Tuesday, June 18, 2013

Understanding Aircraft Bird Strikes: (Messing Around with Tableau)

Ok I'll admit it, I'm a bit of a nervous flyer.  I'm the guy on the plane who looks around at all the flight attendants after every little bump or noise to make sure nothing's wrong.  So when a United Airlines 737 was taken down by a flock of geese, that was the last thing I needed to hear about.  I had no idea something as small as a bird could have such an impact on a 30 ton aircraft.

And then there was that episode of Mythbusters where they shot chickens out of a cannon into an old aircraft frame to see the damage.

Even though it turns out they were shooting frozen chickens (they forgot to thaw them), the results were still scary.

So how frequent are bird strikes?  And when do they occur?  I was given a dataset containing information about all the recorded bird strikes over the past 12 years.  The first thing I noticed was that the file was almost 100,00 lines long.  Crap.

Fortunately, out of these 100,000 strikes there were only 107 injuries and 5 fatalities.  Humans seemed to fare much better than the birds.  In order to get the most out of this rich dataset, I used Tableau--a data visualization product that can make some fancy interactive visualizations.

This is what I came up with

To get a full appreciation for the interactive nature of Tableau you have to click the link above.  Here are some highlights.

I had no idea that bird can fly as high as 25,000 feet!  It makes sense that some of the most damaging (costliest) strikes occurred when the plane was traveling fast at a higher altitude.

Texas and California have had the most strikes, but this is a bit misleading.  They are also the most populous states, and have some of the nation's largest airports.  There are many more flights originating in these two states compared to other smaller states.  That means there are many more chances for a strike.

The majority of strikes occur in clear skies during the day.

Monday, June 3, 2013

Analyzing the Semantic Web

As part of an assignment for a class I'm taking on Coursera, I recently played around with a big data set representing the Semantic Web (an attempt to organize the world into a network of data that's capable of  being understood and manipulated without human input).  One of the dreams of this project is to allow machines to interact with the entirety of the world wide web without any direct human input.

The data set that describes the Semantic Web is a classic example of Big Data.  The set I used for my analysis is known as the Billion Triples dataset.  It contains a billion entries in a format known as RDF triples.  This format requires statements to be expressed in the form:

Subject Predicate Object

For example, to say something like "Bob is Tim's father" in RDF you would write

Subject:[Bob] Predicate:[is the father of] Object:[Tim]

In the Semantic Web, the entities in the RDF triples are not confined to only people, but can be anything described by a URI (Universal Resource Identifier)--literally anything.  The Billion Triples database therefore contains a ton of information describing the relationships of interconnected people/places/things/companies.  We can treat this set as a directed graph where each entry represents an edge between two vertices (the subject and object).

An example of a directed graph.  If this were to represent the Semantic Web, each of the circles would be represented by a URI.  The arrows are drawn whenever a subject points to an object in RDF.

Analyzing graphs like this is a standard task for data scientists, and one common quantity to compute is the out-degree of each node.  This represents the number of objects that a given subject links to.  In the example graph above there are 2 arrows leaving node 7.  Therefore node 7 has an out-degree of 2.  We can learn a lot about a graph by computing a histogram of the out-degree of all its nodes.

However, the Billion Triples data set is so large (>0.5 TB) that a single computer can't handle it.  Instead we have to use a cluster of computers, each operating on a small chunk of the data.  This situation refers to a classic computer programming model known as MapReduce in which common tasks get mapped to a single machine which then performs some simple aggregate function (like counting) in the reduce step.

When files are this large, they can't be stored in traditional file systems.  Instead, distributed file systems like Hadoop break up large files into many smaller chunks stored on separate machines across a network.  In this way, file access can be parallelized and you can scale up your MapReduce jobs into the cloud.  This is exactly what I did for my analysis of the Semantic Web.

Using Amazon Web Services, I pulled the data set from S3 storage and analyzed it on their cloud computing servers.  To setup the MapReduce jobs, I used the programming tool Pig.  This tool coordinates the resources of the 20 compute nodes I ran my script on.  In my Pig script I group the entries by subject, then count the number of entries for each subject (the out-degree).  Then I count the number of subjects having out-degree d for my histogram.  After 4 hours of running in the cloud, I get the following plot:

The distribution of subjects having various values for their out-degree.  Fits to the data were obtained with the optimize.curve_fit package in SciPy.


The code I wrote to generate this plot fits an exponential function (blue) and a powerlaw function (red) to the distribution.  A result from graph theory is that the out-degree histogram of a random graph will follow an exponential distribution.  Since the blue line is a terrible fit to the data, we can tell that the Semantic Web is not a random graph. Instead, we can see that a powerlaw with index -2.2 does a pretty good job at fitting most of the data.

This index of -2.2 in a powerlaw distribution is a number that frequently comes up in the analysis of networks.  Researchers studying the patterns of phone calls among AT&T's customers also find the exact same distribution with the exact same index.  There is something fundamental in the way networks organize themselves such that they produce powerlaw histograms with an index near -2.2.  I find it really interesting that a network of telephone calls among friends displays the same properties as a catalog of the relationships between everything in the Semantic Web.  Even more interesting was the cloud computing I got to do (for free with a student grant from AWS, I should add!).