Topic Modeling FOIA Data

Today we're going to look at an example of how we can apply Machine Learning- specifically how Natural Language Processing can be applied to massively extend our ability to understand and interact with large sets of data.

Over the past months, the State Department released 5 dumps of over 50,000 pages of emails from Hillary Clinton's non-official email server under the Freedom of Information Act. There is a tremendous amount of interest from the public around specific topics, and the graphics team at the Wall Street Journal put together an excellent tool to search through the data.

While this is very useful to allow journalists to search through data for specific keywords that they are interested in, there is not an easy way to answer a more broad set of questions- such as "What are these emails about?", and "Which emails are actually important?". 

Are random keyword searches, or reading all 50,000 pages really the right way to learn if there's something important or relevant?

Traditional Approaches

First, let's take a look at the data:

  • 8549 total email threads
  • 3553 threads released "IN FULL" (normal)
  • 4063 threads released "IN PART" (edited before release)

We could run a traditional DLP (Data Loss Prevention) tool against the data set, which will search the emails for known terms such as "SENSITIVE" or "CLASSIFIED" or simple regular expressions (REGEXs) like phone numbers. Okay, let's do that:

  • "SENSITIVE" markings on 449 threads (5%)
  • "UNCLASSIFIED" markings on 8546 threads (99%)
  • "Phone Numbers" 389 (4%)

Well that's not useful. We want to learn something that matters- not a count of things we already know about. To do this, (and avoid reading all 55k pages to find something interesting) I'm going to use Topic Modeling, a branch of Machine Learning that uses a statistical model for discovering the abstract "topics" that occur in documents.

Enter Topic Modeling

Specifically, we're going to use Latent Dirichlet Allocation (LDA) for the heavy lifting of discovering what the content of the emails are about. LDA works particularly well in the case of email chains as it's based on the premise that any document is made up of multiple topics that can be discovered by looking at statistical distributions of words across topics and topics across documents.

  Clinton email distribution - 50 discovered topics (left) and most statistically interesting terms (right)


Clinton email distribution - 50 discovered topics (left) and most statistically interesting terms (right)

Above is a list of the most statistically significant terms that existed across the entire email corpus. In this case, LDA discovered 50 topics, which are visualized to the left and can be labeled by (you guessed it) the most significant terms for each topic. Larger topics indicate a larger distribution of terms. Now let's get to the good stuff, what are the interesting topics?

  LDA Topic 8 - source|libyan|sensitive|libya|benghazi . Visualized with R's excellent LDAvis 


LDA Topic 8 - source|libyan|sensitive|libya|benghazi . Visualized with R's excellent LDAvis 

Interesting topic found! We can tell that our topic model is working well when all of the salient terms form around a single topic. Looking at the image above, it's interesting to see that the topics for Benghazi are spaced away from other topics on the PCA graph on the left- indicating a different distribution of terms.

So Much Better Than Keywords

As a human, we can quickly look at a topic and tell how interesting it is to us (essentially, we are quickly classifying a large set of documents). Now rather than doing a text search for "Benghazi", we can do a search against the topic model for something much more powerful- in this case emails and subject lines from the released documents that most strongly correlate with the topic about sensitive Benghazi data:

  • Topic 8,source|libyan|sensitive|libya|benghazi
    1.00,C05739866.pdf.txt,(no subject)
    1.00,C05739864.pdf.txt,It Libya, intel, internal conflict over militias, Sid
    1.00,C05739861.pdf.txt,H: V good intel internal Libya. Sid
    1.00,C05739857.pdf.txt,H: V good Intel internal Libya. Sid
    1.00,C05739824.pdf.txt,H: V good intel Internal Libya. Sid
    1.00,C05739803.pdf.txt,RE: H: latest inter Mayan conflicts, leaders & militias. Sid
    1.00,C05739800.pdf.txt,Re: H: latest Intel libyan conflicts, leaders & militias. Sid
    1.00,C05739796.pdf.txt,(no subject)
    1.00,C05739794.pdf.txt,H: latest intel libyari conflicts, leaders & militias. Sid
    1.00,C05739789.pdf.txt,(no subject)
    1.00,C05739771.pdf.txt,H: Latest Intel: Ubyan leadership private discussions. Sid
    1.00,C05739769.pdf.txt,H: Latest intel: Libyan leadership private discussions. Sid
    1.00,C05739768.pdf.txt,H: Latest intel: Libyan leadership private discussions. Sid
    1.00,C05739651.pdf.txt,(no subject)
    1.00,C05739650.pdf.txt,H: Great to see you. Drop in again. Here's Libya. Ski

Simple example- but illustrative of how we can use NLP to amplify our ability to understand and interact with large sets of data-- ultimately making informed decisions on actual facts without having to read all 55,000 pages of data! 




Former Senior Director of Security Research at Websense. CTO at BTS. Co-Founder of APX Labs