Enron Email Corpus Visualization

Data Visualization · November 2011

Thumbnail of a poster investigating the email trends of Jeff Dasovich.

Thumbnail of a poster showing the network structure of Enron. — Thumbnail of a poster showing the clusters (based on communication) of Enron.

What was the assignment?

In Data Visualization, we were tasked with visualizing networks. I chose the Enron email corpus because it was huge (2.5 GB of email — 517,440 messages). While exploring the data set, I stumbled across an address that had sent over 34,000 emails — Jeff Dasovich’s address. This was one part I investigated with the poster above. The second part was examining whether clusters would be present in Enron’s communication, also shown in the poster above.

What am I most proud of?

This class was an opportunity to be creative in a visual medium for once, so I am most proud of getting to exercise that creativity to produce meaningful graphics summarizing large amounts of data.

What did I learn from the project?

How to identify hypotheses in opaque data sets, and how to support them with visualization.
How to process large amounts of plain text effectively (shard, shard, shard!)

What would I do differently?

The network graph is flawed because it the data is bound to the layout algorithm, and it requires manual tweaking to get something half-digestible. An approach I didn’t have time to explore was using a hiveplot, which is a new approach to visualizing large, complex networks.

Papers

Project report for Data Visualization

Code

View source code

Projects

Enron Email Corpus Visualization: Investigating everyone's favorite email corpus.
GrizSpace iPhone App: An iPhone application for scheduling and finding classes on UM's campus.
Interactive Hyperelasticity Web Application: An interactive application created to investigate the impact of boundary conditions and other parameters on hyperelastic materials.
NLP Spam Classification of a Social Network: Applying natural language processing techniques to classify spam.
RnaSec: A Ruby library for representing RNA secondary structures as tree data structures.