Section Image
SGI Home
Blog
Language Technology
Email Research
Enron Email Corpus
Java
About SGI


















Available Versions of the Enron Corpus

Raw Corpus

The definitive version of the Enron Corpus is the March 2, 2004 Version of the dataset, made available by William Cohen at CMU. Note that even this version has attachments removed and has had some messages deleted "as part of a redaction effort due to requests from affected employees". Note also that invalid email addresses have been converted to something of the form user@enron.com whenever possible ((i.e., recipient is specified in some parseable format like "Doe, John" or "Mary K. Smith") and to no_address@enron.com when no recipient was specified.

Database Corpus versions

Several organisations and people have spent considerable effort to clean the raw corpus and import the contents into database tables. These include:

Web-accessible Enron Data

  • Bob Arens at the University of Iowa has created a web accessible search interface to at least part of the Enron corpus. Search can be either random, or based on keyword search of email content. Viewing of a set of human-annoated emails is also possible (see below for more information).

Marked-up Enron Datasets

  • Bob Arens also runs an Annotated Email Viewer, in which emails that have been categorized as useful or not useful by human annotators can be viewed in a browser. It appears that only a very small subset of email has been human annotated at this stage (June 2005). Not useful annotation types include notwork, spam, noattach, and noinfo. I haven't yet found a complete set of classification tags that are being applied by the human annotators.
  • Marti Hearst at UC Berkeley has developed a set of to be used for annotating a subset of the Enron email messages. A subset of about 1700 labeled email messages (4.5M) has been annotated by NLP students. The emails were chosen in a semi-motivated fashion (focusing on business-related emails and the California Energy Crises and on emails that occurred later in the collection, trying to avoid very personal messages, jokes, and so on). Students in Marti's ANLP course annotated the selected messages with the possible category labels. Each message was labeled by two people, but no claims of consistency, comprehensiveness, nor generality are made about these labelings.