Section Image
SGI Home
Blog
Language Technology
Email Research
Enron Email Corpus
Java
About SGI


















University of Massachusetts Amherst

The University of Massechusetts provides a WikiWeb for their work with the Enron Email dataset. Their work primarily focuses on social network analysis. They are currently working to prove that email social networks are self-similar (fractal) in their relationship.

Topic based categorization of email into folders

Ron Bekkerman, a PhD student supervised by Andrew McCallum, has worked with the Enron corpus to perform automatic categorization of email into folders. The work presented to date has made use of only seven users' email (Download the seven preprocessed datasets - 14.7 Mb tarred, gzipped), for whom the number of messages and folders are particularly large.

Link Analysis

MD5 to Digest to Relative Filepath Mapping

Andrés Corrada-Emmanuel has done some work on using MD5 hases to identify the number of "unique" emails within the Enron corpus. His results show that there are 250,484 unique emails. He has made available a mapping file showing the MD5 digest and relative filepath for all files in the Enron corpus.

This mapping file was created as follows: A first pass calculating the MD5 digest of the email messages was made. Files having the same MD5 digest were then grouped by their timestamp. Those within a day of each other were considered the same message and a revised MD5 digest was calculated for the MD5-date grouping by appending the date of the earliest message in the grouping to the email body. This still resulted in messages with the same MD5 having multiple authors so the de-duplication of messages by this ad-hoc method is clearly not perfect. Caveat emptor!

MD5 to Authors Mapping

Using the MD5 to filepath mapping described above, Corrada-Emmanuel has also further analysed the Enron data in constructing another mapping file, which shows the authors found in the email headers for each MD5 digest. Since the de-duplication process detailed above is not perfect, some "unique" emails have multiple authors.

The format of the file is: <MD5 digest> <author email address> -%- <author email address> ... Notice, the -%- symbol for separating email addresses since they can contain empty spaces.

MD5 to Recipients Mapping

Similarly, a mapping file between the MD5 digest and the email recipients has been constructed. This is based on all the recipients that appear in the "To", ''quot;CC''quot;, and ''quot;BCC''quot; fields in the header. The format for this file is the same as the MD5 to authors mapping: one line per unique MD5 digest and a -%- separated list of extracted recipient email addresses.

Folder Users

Even though the corpus is claimed to contain 150 users, Andrés exploration of the data suggests that there are really only 149 different users. He provides yet another mapping, this time between the top folders in the corpus and his normalised form of each author's email address.

Finally, to help combat some of the multiple inconsistencies in the Enron corpus, particularly, the occurence of multiple email addresses for the same users, Andr&ecute; has created a mapping between the raw email address and the normalized email address for the 149 unique Enron folder users that he has identified.