University of Massachusetts Amherst
The University of Massechusetts provides a WikiWeb for
their work with the Enron Email dataset. Their work primarily focuses on social
network analysis. They are currently working to prove that email social
networks are self-similar (fractal) in their relationship.
Topic based categorization of email into folders
Ron Bekkerman, a PhD student
supervised by Andrew McCallum,
has worked with the Enron corpus to perform automatic categorization of email
into folders. The work presented to date has made use of only seven users'
email (Download the
seven preprocessed datasets - 14.7 Mb tarred, gzipped), for whom the number
of messages and folders are particularly large.
Link Analysis
MD5 to Digest to Relative Filepath Mapping
Andrés
Corrada-Emmanuel has done some work on using MD5 hases to identify the
number of "unique" emails within the Enron corpus. His results show
that there are 250,484 unique emails. He has made available a mapping file showing the MD5
digest and relative filepath for all files in the Enron corpus.
This mapping file was created as follows: A first pass calculating the MD5
digest of the email messages was made. Files having the same MD5 digest were
then grouped by their timestamp. Those within a day of each other were
considered the same message and a revised MD5 digest was calculated for the
MD5-date grouping by appending the date of the earliest message in the grouping
to the email body. This still resulted in messages with the same MD5 having
multiple authors so the de-duplication of messages by this ad-hoc method is
clearly not perfect. Caveat emptor!
MD5 to Authors Mapping
Using the MD5 to filepath mapping described above, Corrada-Emmanuel has also
further analysed the Enron data in constructing another
mapping file, which shows the authors found in the email headers for each
MD5 digest. Since the de-duplication process detailed above is not perfect,
some "unique" emails have multiple authors.
The format of the file is: <MD5 digest> <author email
address> -%- <author email address> ... Notice, the
-%- symbol for separating email addresses since they can contain
empty spaces.
MD5 to Recipients Mapping
Similarly, a mapping
file between the MD5 digest and the email recipients has been constructed.
This is based on all the recipients that appear in the "To",
''quot;CC''quot;, and ''quot;BCC''quot; fields in the header. The format for
this file is the same as the MD5 to authors mapping: one line per unique MD5
digest and a -%- separated list of extracted recipient email
addresses.
Folder Users
Even though the corpus is claimed to contain 150 users, Andrés
exploration of the data suggests that there are really only 149 different
users. He provides yet
another mapping, this time between the top folders in the corpus and his
normalised form of each author's email address.
Finally, to help combat some of the multiple inconsistencies in the Enron
corpus, particularly, the occurence of multiple email addresses for the same
users, Andr&ecute; has created a mapping
between the raw email address and the normalized email address for the 149
unique Enron folder users that he has identified.
|