Section Image
SGI Home
Blog
Language Technology
Email Research
Enron Email Corpus
Java
About SGI


















How are people using the Enron Email Corpus?

The Enron email corpus is being applied by a number of different researchers in different contexts. Below is a list of known projects and research topics being investigated that are making use of the corpus.

This list aims to capture all known uses, and through the Enron Corpus mailing list, the hope is that we can avoid people duplicating work in processing and marking-up the corpus for research purposes (e.g., adding specific markup of concepts or features, analysing the structure of the corpus, loading the corpus into database tables).

  • Carnegie Mellon University:
    • William Cohen has taken on the responsibility for distributing the Enron email corpus as a resource for researchers.
    • Bryan Klimt and Yiming Yang have done some data cleaning and preliminary data analysis of the Enron corpus, in terms of folder usage, thread characteristics and message distribution across users.

    Read More about Enron work at CMU

  • University of California, Berkeley:
    • A search interface for the Enron email collection, developed by Andrew Fiore and Marti Hearst.
    • A set of categories developed by Marti Hearst and her students that are planned to be used for annotating a subset of the Enron email messages.
    • A subset of about 1700 labeled email messages (4.5M) focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on. Students in Marti's course annotated the selected messages with the category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings.
    • The Enronic email visualization and clustering tool by Jeff Heer, built on his prefuse toolkit. (1.9M jar file). This provides for graph based visualizations of social networks within Enron, based on email interaction between users.
    • A database representation(219 MB compressed) of the Enron email collection, built by Andrew Fiore and Jeff Heer, containing the enron email messages. This version contains many but not all of the tables used in the search tool, as well as special tables to be used with the Enronic visualization tool. This database version of the corpus has had a substantial amount of processing performed on the contents of the database to remove duplicates, normalize names, and so on.

    Read More about Enron work at UC Berkeley

  • University of Massachusetts, Amherst:
    • Ron Bekkerman, a PhD student supervised by Andrew McCallum, has worked with the Enron corpus to perform automatic categorization of email into folders. The work presented to date has made use of only seven users' email, for whom the number of messages and folders are particularly large.
    • Andrés Corrada-Emmanuel has done a large amount of data consistency checking, based on MD5 digests for each of the email messages. This has allowed him to identify duplicate messages, and to provide mappings between messages and their respective senders and receivers. Other data cleansing work includes normalisation of email addresses, which suggests that only 149 different folder users exist in the corpus.

    Read More about Enron work at the UMass

  • University of Southern California:
    • Jitesh Shetty has worked with Jafar Adibi (ISI) to use the Enron corpus for testing the effectiveness of some Link Discovery techniques which are used for counter terrorism and fraud detection. To do so, they have created yet another cleaned database (MySQL) version of the corpus. They have also looked at social network analysis, and have managed to construct a list of job titles for many of the users represented in the Enron corpus.

    Read More about Enron work at USC/ISI

  • University of Iowa:

  • Columbia University:

    Owen Rambow, Aaron Hanly, Martin Jansche and others at Columbia University are using the Enron corpus for work in email summarisation, particularly for email thread summarisation.