Visualizing Data Breach Trends with D3.js Charts

Data breaches continue to make headlines; attacks such as on AnthemSony, and Home Depot, have all become public conversation.

I thought it would be an interesting exercise to try and visualize major breaches and see if there are any trends emerging around data or industry verticals that are being targeted as well as the methods utilized by attackers.

First, I began with a google search on “Data Breaches” to see what information is already available. One of the top results is informationisbeautiful.net, which has a great Bubble Chart that visualizes all the major data breaches from 2004 - Mar 30, 2015. 

informationisbeautiful.net Data Breach Visualization

informationisbeautiful.net Data Breach Visualization

The bubble chart is informative and interactive, but lacks deeper context around trends and aggregations of how breaches occurred over time. For example, it is not easy to tell what the top types of data stolen are or the top ways data is being leaked outside of an organization.

According to a study done by Cleveland and McGill about graphical perception showed that humans decode quantitative information most accurately when ranked in the following order: position > length > slope > area.

                 Graphical perception ranked.

                 Graphical perception ranked.

 

The informationisbeautiful.net bubble chart uses Area to encode the different data breaches, but I wanted to represent the information in another way to make it easier for people discover more from the data. I chose the Length approach, with stacked bar charts.

Many thanks to the team at informationisbeautiful.net who have done quite a bit of leg-work compiling available data breach information and sharing it via (csv). Utilizing the breach data, we can gain some unique insights into the data that has been lost by organizations--answering several key questions:

  1. What are the top types of data that have been stolen?
  2. What trends are evolving for stealing/leaking data?
  3. What industries are more likely to be attacked?

To see trends over time, I developed a custom stacked bar chart graph using d3.js that overcomes some of the limitations of static stacked bar charts, by making it possible to selectively examine data. The source code is available on our public harvest.ai github and jsfiddle; feel free to play around with the code and see how the chart is generated.

The result is the chart below which you can interact with. Click on the bars to transition between a stacked bar chart and a single category bar chart.

Total records stolen from 2004 - Mar 30, 2015: 2.06 Billion

Top type of data being stolen (by the millions):

  • Email address/Online info (593M)
  • Credit Card info (590M)
  • SSN/Personal details (412M)

In the chart below, we can clearly see the aggregations of different types of data being stolen over time, and tell from the legends on the left side of the chart that Email, Credit Card, SSN/Personal details are among the top targeted and leaked type of data.

Among all types of data, email address/online info is the most one being stolen. “Your email account may be worth far more than you imagine” - The value of a hacked email account, from Krebs on Security;  A hacked email could be a bridge for hackers to gain deeper access into your privacy, financial, and employment information.

Besides Email (593M), there was almost an equivalent amount of credit card information (590M) being stolen, according to black market prices of stolen data accompanied with informationisbeautiful.net’s data set. Freshly acquired credit card info is worth about $26.60-$44.80 a piece and could go as high as $8000 for an executive's credit card info.

By clicking on the related "Full bank account details" bars, the chart will transform into the view below. Notice the significant increase in the breach of full bank account details over the past two years. It shows there were less than 2.0M bank account data stolen back in 2006 and 2010, and there were more than 150M records stolen in each 2012 and 2014, which accounted for more than 98% of the stolen bank account data from the past decade.

Top ways information being leaked:

  • hacked (1.63B)
  • lost / stolen media (166M)
  • inside job / hacked (92.0M)

ANALYSIS: While hacking still accounts for the lion's share of data record breaches over time, there has been a significant increase over the past two years in attacks involving insider threat. 

secondtolastchart.png

Top types of organizations being breached:

  • web (554M)
  • financial (491M)
  • retail (240M):

ANALYSIS: There appears to be a major trend in high-tech industries being targeted by attackers over the past two years.

CONCLUSION: Overall, I was able to show that there are more insights to be gained when you are able to see aggregated trends over time with stacked bar charts through the length graphical perception.

Based on the data and our stacked bar chart, 2014 was a record breaking year for number of records stolen at over 575M. Also, this data set shows there has been a rising trend of data being breaches from 2004 to 2014.

As George Orwell said: “Those who control the present, control the past and those who control the past control the future.”  As we are becoming more interconnected via the web and cloud, our growing past is stored in big data somewhere.


POST BY WINSTON CAI, FRONT END NINJA

Contributions by: Alex Matzner and Alex Watson