APTs ♥ Your Cloud Data

In today's blog we're going to take a look at how harvest.ai has detected nation-state affiliated attacks with sophisticated Content Aware UEBA (user and entity behavioral analytics). These attacks are interesting as attackers utilized compromised user accounts on platforms such as Google for Work, Box, Dropbox and Office 365 to attempt to steal data.

One of the key characteristics of modern sophisticated attack is the use (or abuse) of legitimate tools by attackers and a distinct reduction in malware. In these attacks there is rarely a single piece of evidence to prove account compromise.

What challenges do these this create for organizations? It means that traditional enterprise security and even UBA solutions that are looking just at authentications and accesses without understanding content will have a hard time differentiating between what is just anomalous and what is malicious- a critical step towards stopping attacks before data can be stolen.

Here are some tactics we have seen in recent attacks:

  • Attackers use stolen credentials to search through email, personal and organization share drives to gain intel and move laterally towards their target
  • Attackers use the stolen credentials to access cloud platforms such as Google, Office 365, Box, Dropbox in addition to on-premise file shares
  • In some cases, attackers may tunnel through the user's endpoint, using their logged in credentials to defeat MFA (multi-factor authentication) and IP geolocation
  • Attackers access data during normal business hours
  • Sustained access over months
Over 8.2% of all business documents in an average Fortune 1000 are viewable by all users within the organization

Fortunately, 

  1. Attackers use stolen credentials to search through the organization's email, personal and network share drives to gain intel and move laterally (More on this below!)
  2. Attackers data access patterns are fundamentally different than the legitimate user accounts they have compromised
  3. Attackers often slip up over time and access from IPs not commonly used by the user account

Why can #1 above be a good thing? An attacker is going to do everything they can do to blend in with the noise and avoid detection, but at the end of the day they will need to access information in a way that is different than the target user. By understanding what kinds of information is important to the attacker and what kinds of information that each user and group typically accesses (we do this with natural language processing and AI), we can both narrow in on attacks quickly and reduce false positives.

APTxx example

In our first year, we've detected two (attributed) nation state affiliated attacks with our UBA and DLP analytics-- each of them involved compromising Cloud accounts such as Google for Work, Office 365, Box, etc. as part of their strategy. We believe the attacker behind the compromised account below is an APT that security firm Mandiant first discovered. Let's take a look at a (sanitized) compromised user's account for some insight into what sophisticated attackers are looking for and how we can catch them.

Our analytics (Macie) start with learning what kinds of data that each user and their peer groups access, as well as access patterns such as where they typically access from and how often. Here's a high-level overview of our compromised user account:

We can see quickly that our analytics have learned that the user typically accesses Finance and HR related documents, which are important to the business, but not critical. We can also see that the user has a "Bronze" data access categorization, which is the lowest risk of a Bronze, Silver, Gold, Platinum risk framework applied to each user by our analytics. As we'll see, a lower risk classification just means that the user account typically accesses less business critical data, it does not mean that an account compromise will not create risk for the organization. 

Our AI learns to classify new content and jargon across an organization, getting over one of the biggest roadblocks for traditional DLP

content aware behavioral analytics

When looking for changes in behavior of a user or group, we often see attackers use the compromised account's credentials to search for documents that ARE highly risky to the organization. These are also likely to be different kinds of documents than the user typically accesses. What kind of features can we look for to identify a compromised account? For a start:

  • The compromised account may start searching for data shared across the organization
  • This data will likely different than what the user or their peer group typically accesses
  • This data may be classified by our analytics as very important to the organization
  • Attacker may be going through new or different IP ranges

So we're looking at some pretty highly dimensional features, and it can be difficult to show how our analytics are working behind the scenes. To visualize this we'll use a graph model of the user's accesses and see if we can see where the compromise happened using Gephi.

In the past year, we've seen this user access 538 documents from 7 different ISPs. Edges are created between the documents when they were accessed within an hour of each other-- this allows us to link the nearly 50% of document accesses that don't have a recorded source IP address from the Cloud SAAS provider. The graph is colored by the ISP that the data was accessed from. Now we can apply a layout algorithm to group documents accessed at similar times together

Now that we've grouped documents accessed over similar timelines together, we can look for indications of compromise. For example: Is the user accessing from a risky ISP? What is the business value of docs the user is looking at? Is the user looking at other user's documents or their own? Fortunately our analytics are really fast at looking for anomalous changes across multiple features- and have identified a risky access below:

Results of a multi-dimensional analytic (content, business value, time, location, peer anomaly) to find accesses that match patterns of a compromised account searching across an organization to steal data

Results of a multi-dimensional analytic (content, business value, time, location, peer anomaly) to find accesses that match patterns of a compromised account searching across an organization to steal data

As you can see above, we have one distinct cluster above (.37% of all docs!) where the user matched on all of our criteria-  essentially the compromised account being used to access very important business data that is also different than what the user typically accesses and belongs to other users. Additionally, we see a few connections in there from Linode- a cloud compute provider that the attacker was bouncing their connection through. Gotcha! Now for the good stuff-- what is our nation-state affiliated actor looking for?

  The screen shot above shows accesses to some very sensitive content, and a huge change from the types of HR and Financial documents our user typically accesses. The attacker appears to be focused on Data Center Technology and Information Technology Topics. Additional changes that indicate compromise are that the documents accessed above are not owned by our user, and the large increase in volume of files accessed over a short period of time is also unusual for the user.

 

The screen shot above shows accesses to some very sensitive content, and a huge change from the types of HR and Financial documents our user typically accesses. The attacker appears to be focused on Data Center Technology and Information Technology Topics. Additional changes that indicate compromise are that the documents accessed above are not owned by our user, and the large increase in volume of files accessed over a short period of time is also unusual for the user.

Conclusion

Across Fortune 1000 environments, we see an average of over 8% of all business documents in Google for Work, Office 365, Box and Dropbox being accessible to all users within the domain, and over 1% of those documents being rated by our analytics as business critical-- making it critical that organizations take steps to protect the important content that they use in the cloud.

What did we see during the compromise?

  • A distinct shift from the user's typical access to HR Time-sheets and Vacation schedules to Data Center migration plans, system uptime schedules post account compromise was detected and alerted by our analytics
  • Across industries, we have seen attackers frequently be interested in IT and datacenter related content- likely to move from cloud knowledge repositories towards business specific applications that are being targeted 

Macie, meet RankBrain

Today Google revealed RankBrain, the AI system that Google has recently started using for a "very large" fraction of the millions of searches per second that Google handles per second, featuring more than a few commonalities with Harvest.ai's Macie analytics. What's special about RankBrain? Google says that it's particularly useful when processing searches that its systems have never seen before. 

Google has not released a lot of details about specifically how RankBrain's AI works- but it likely involves some fantastic research that Google has released over the past year with respect to Word and Paragraph vectors-- the same algorithms that we use to form the underpinnings of Harvest.ai's Macie analytics. Word vectors are essentially vector-based representations of words, paragraphs and documents generated by neural networks-- often referred to as neural language models. 

Why is this relevant? Moving from a search engine searching for statistically relevant terms to a more knowledge-based system requires not just understanding which terms are statistically interesting, but also the context in which they were used. Take the example sentence below:

I like to stream TV shows on Hulu and stream movies on Netflix

In a traditional word-count (token n-gram, or bag-of-words) model, this sentence would get turned into:

Example bag-of-words representation

Example bag-of-words representation

The "token_1gram" example is a typical bag-of-words approach that is useful for many NLP applications, such as search by maintaining a count of the times that each word appeared in an input document. It's great for identifying statistically interesting terms like "stream", "hulu", and "netflix" but loses key semantics that are critical for harder problems such as conveying knowledge. For example, it's impossible to answer the question using this model of whether the user likes to watch movies or shows on Netflix.

In contrast to a BoW model, neural language models maintain a stronger weighting between "TV" and "shows" with "Hulu" and "Movies" with "Netflix", capturing the semantics of the original statement. 

Let's do a quick example, and see what we can learn from paragraph vectors trained on a very large corpus of data- such as the full content of Wikipedia or a 3-month data dump of Google News. Given an input search, such as "Hulu" and "Netflix" from the example above, we can find other terms that are often located close to our input query. This can be used to expand the user's original search, which may be something that Google has never seen before to include other relevant terms that may bring in better search results.

A demonstration of the ability to expand on user queries using Word and Paragraph vector algorithms that are likely central to Google's RankBrain AI 

A demonstration of the ability to expand on user queries using Word and Paragraph vector algorithms that are likely central to Google's RankBrain AI 

Pretty amazing! Our original query of "Netflix" and "Hulu" became:

  • Hulu
  • Netflix
  • Amazon Unbox
  • Hulu.com
  • Boxee
  • Vudu

How does this work? In this case, we ran the terms "Netflix" and "Hulu" against a trained word vector representation generated from 3 months of Google News articles -- approximately 100 Billion words utilizing the word2vec implementation open sourced by Google. As shown above, we can use this model to build on existing queries (even ones we've never seen before) with intelligent recommendations- users don't have to know or spend the time to think of what terms to put into a query to Google- it can be expanded upon automatically.

At Harvest.ai- our Macie analytics use a similar approach to RankBrain to identify and protect important data across an organization, even with just a small set of training data. Go machines!

POST BY ALEX WATSON, FOUNDER, CEO HARVEST.AI

Former Senior Director of Security Research at Websense. CTO at BTS. Co-Founder of APX Labs, Over 10 years experience in the US Intelligence Community.

 

Topic Modeling FOIA Data

Today we're going to look at an example of how we can apply Machine Learning- specifically how Natural Language Processing can be applied to massively extend our ability to understand and interact with large sets of data.

Over the past months, the State Department released 5 dumps of over 50,000 pages of emails from Hillary Clinton's non-official email server under the Freedom of Information Act. There is a tremendous amount of interest from the public around specific topics, and the graphics team at the Wall Street Journal put together an excellent tool to search through the data.

While this is very useful to allow journalists to search through data for specific keywords that they are interested in, there is not an easy way to answer a more broad set of questions- such as "What are these emails about?", and "Which emails are actually important?". 

Are random keyword searches, or reading all 50,000 pages really the right way to learn if there's something important or relevant?

Traditional Approaches

First, let's take a look at the data:

  • 8549 total email threads
  • 3553 threads released "IN FULL" (normal)
  • 4063 threads released "IN PART" (edited before release)

We could run a traditional DLP (Data Loss Prevention) tool against the data set, which will search the emails for known terms such as "SENSITIVE" or "CLASSIFIED" or simple regular expressions (REGEXs) like phone numbers. Okay, let's do that:

  • "SENSITIVE" markings on 449 threads (5%)
  • "UNCLASSIFIED" markings on 8546 threads (99%)
  • "Phone Numbers" 389 (4%)

Well that's not useful. We want to learn something that matters- not a count of things we already know about. To do this, (and avoid reading all 55k pages to find something interesting) I'm going to use Topic Modeling, a branch of Machine Learning that uses a statistical model for discovering the abstract "topics" that occur in documents.

Enter Topic Modeling

Specifically, we're going to use Latent Dirichlet Allocation (LDA) for the heavy lifting of discovering what the content of the emails are about. LDA works particularly well in the case of email chains as it's based on the premise that any document is made up of multiple topics that can be discovered by looking at statistical distributions of words across topics and topics across documents.

  Clinton email distribution - 50 discovered topics (left) and most statistically interesting terms (right)

 

Clinton email distribution - 50 discovered topics (left) and most statistically interesting terms (right)

Above is a list of the most statistically significant terms that existed across the entire email corpus. In this case, LDA discovered 50 topics, which are visualized to the left and can be labeled by (you guessed it) the most significant terms for each topic. Larger topics indicate a larger distribution of terms. Now let's get to the good stuff, what are the interesting topics?

  LDA Topic 8 - source|libyan|sensitive|libya|benghazi . Visualized with R's excellent LDAvis 

 

LDA Topic 8 - source|libyan|sensitive|libya|benghazi . Visualized with R's excellent LDAvis 

Interesting topic found! We can tell that our topic model is working well when all of the salient terms form around a single topic. Looking at the image above, it's interesting to see that the topics for Benghazi are spaced away from other topics on the PCA graph on the left- indicating a different distribution of terms.

So Much Better Than Keywords

As a human, we can quickly look at a topic and tell how interesting it is to us (essentially, we are quickly classifying a large set of documents). Now rather than doing a text search for "Benghazi", we can do a search against the topic model for something much more powerful- in this case emails and subject lines from the released documents that most strongly correlate with the topic about sensitive Benghazi data:

  • Topic 8,source|libyan|sensitive|libya|benghazi
    MatchStrength,FileName,Subject
    1.00,C05739866.pdf.txt,(no subject)
    1.00,C05739864.pdf.txt,It Libya, intel, internal conflict over militias, Sid
    1.00,C05739861.pdf.txt,H: V good intel internal Libya. Sid
    1.00,C05739857.pdf.txt,H: V good Intel internal Libya. Sid
    1.00,C05739824.pdf.txt,H: V good intel Internal Libya. Sid
    1.00,C05739803.pdf.txt,RE: H: latest inter Mayan conflicts, leaders & militias. Sid
    1.00,C05739800.pdf.txt,Re: H: latest Intel libyan conflicts, leaders & militias. Sid
    1.00,C05739796.pdf.txt,(no subject)
    1.00,C05739794.pdf.txt,H: latest intel libyari conflicts, leaders & militias. Sid
    1.00,C05739789.pdf.txt,(no subject)
    1.00,C05739771.pdf.txt,H: Latest Intel: Ubyan leadership private discussions. Sid
    1.00,C05739769.pdf.txt,H: Latest intel: Libyan leadership private discussions. Sid
    1.00,C05739768.pdf.txt,H: Latest intel: Libyan leadership private discussions. Sid
    1.00,C05739651.pdf.txt,(no subject)
    1.00,C05739650.pdf.txt,H: Great to see you. Drop in again. Here's Libya. Ski

Simple example- but illustrative of how we can use NLP to amplify our ability to understand and interact with large sets of data-- ultimately making informed decisions on actual facts without having to read all 55,000 pages of data! 

 

 

POST BY ALEX WATSON, FOUNDER, CEO Harvest.ai

Former Senior Director of Security Research at Websense. CTO at BTS. Co-Founder of APX Labs

 

Part 2: OAuth Exploitation in Action

The first part of this article helped explain what OAuth is and how we are seeing it be used with corporate credentials, and now, I will delve into trends around OAuth abuse.  

MALICIOUS SOCIAL ENGINEERING APPS

We are seeing opportunistic applications use social engineering in highly successful campaigns to entice users to grant access to company email, files and contact lists with just a few clicks. While we're mostly seeing these techniques being abused by SPAM bots, there are some very real implications here around targeted attacks and spear phishing.

Above: Example OAuth application permission request

Applications such as "Friend Connect" (just to be clear, not "Google Friend Connect"), a.k.a. "Flipora" a.k.a. "Infoaxe" (somewhat entertaining feedback) are making an evolution from the traditional approach of malicious applications stealing passwords and contact lists from compromised endpoints to a more nuanced approach of using social engineering to get a user to grant access to data stored in services such as Google for Work, Office 365, Box and Dropbox. We mostly see these applications using the stolen accesses to, you guessed it, send SPAM; but as I later demonstrate, it could be much worse.

Notice how "Friend Connect" redirects to the known malicious / annoying "Flipora" callback URL during the OAuth process

Notice how "Friend Connect" redirects to the known malicious / annoying "Flipora" callback URL during the OAuth process

As shown above, the OAuth scopes requested by the borderline malicious "Friend Connect" application include complete read access to a user's corporate Inbox, sent box, drafts, contact lists and profile information. Currently the app appears to be primarily used to access a user's contact list, and send invites and SPAM to people on that list to continue spreading. The scary part is that with a single click, and never sharing their password users are granting applications such as this complete read access to their corporate email. 

Example "Friend Connect" OAuth requests and scopes as viewed in the Google Admin Console, granted complete read access to Gmail and contact lists

 

WHY IS THIS A BIG DEAL?

Traditional security systems such as firewalls, web security gateways and IDS have nearly no way to monitor or stop the infection or spread of OAuth based SPAM bots or even legitimate applications that you probably don't want to have access to your company's data. Why? Resetting user passwords doesn't help as there aren't passwords that are stolen. AntiVirus and web security gateways can't help because endpoints aren't actually infected and might be connecting from mobile locations. In this case, external applications are using Token access unwittingly granted by users with scopes such as accessing contact lists, and downloading and sending email directly through the cloud provider. IT admins are forced to manually step through the cloud provider interface to investigate and revoke OAuth tokens for each unapproved application and user. (Ouch!)

In addition to applications such as "Friend Connect," there is a second and much larger class of legitimate applications that can be granted very wide access to your company's data by users unwittingly signing in with their credentials and granting access to their email and cloud drives. These applications may access or store data in a way that is not safe or consistent with your company's policy or regulatory needs.

 

FINAL THOUGHTS

Admittedly, SPAM bots are a little different than the more advanced threat topics that we usually focus on, but these applications have hit a soft spot in enterprise security that companies need to be aware of. 

OAuth accesses have serious implications for businesses from a data protection perspectiveeffectively with one click a user can grant complete access to their corporate resources, and corporations need to have measures in place (such as OAuth application white and blacklisting) to be part of their security strategy as they move critical or regulatory-protected assets to the cloud.

What we're concerned about is the fact that technologies such as OAuth that are meant to protect users, by easily allowing them to grant certain accesses to applications without having to provide a password, are being abused by malicious applications in a way that creates a great deal of risk for businesses and requires new education for users. In addition to mass malware and phishing, there are some pretty wide implications around targeted attacks and spear phishing that we'll touch on in a follow-up post.


Post by Alex Watson, Founder, CEO harvest.ai

Former Senior Director of Security Research at Websense. CTO at BTS. Co-Founder of APX Labs. Over 10 years experience in the US Intelligence Community.