Raw Corpus
The definitive version of the Enron Corpus is the March 2, 2004
Version of the dataset, made available by William Cohen at CMU. Note that
even this version has attachments removed and has had some messages deleted
"as part of a redaction effort due to requests from affected
employees". Note also that invalid email addresses have been converted to
something of the form user@enron.com whenever possible ((i.e.,
recipient is specified in some parseable format like "Doe, John" or
"Mary K. Smith") and to no_address@enron.com when no
recipient was specified.
Database Corpus versions
Several organisations and people have spent considerable effort to clean the
raw corpus and import the contents into database tables. These include:
Web-accessible Enron Data
-
Bob Arens at the University of Iowa has created a web
accessible search interface to at least part of the Enron corpus. Search
can be either random, or based on keyword search of email content. Viewing of a
set of human-annoated emails is also possible (see below for more information).
Marked-up Enron Datasets
-
Bob Arens also runs an Annotated
Email Viewer, in which emails that have been categorized as useful
or not useful by human annotators can be viewed in a browser. It appears
that only a very small subset of email has been human annotated at this stage
(June 2005). Not useful annotation types include notwork, spam,
noattach, and noinfo. I haven't yet found a complete set of
classification tags that are being applied by the human annotators.
-
Marti Hearst at UC Berkeley has developed a set of
to be used for annotating a subset of the Enron email messages. A
subset of about 1700 labeled email messages (4.5M) has been annotated by
NLP students. The emails were chosen in a semi-motivated fashion (focusing on
business-related emails and the California Energy Crises and on emails that
occurred later in the collection, trying to avoid very personal messages,
jokes, and so on). Students in Marti's ANLP course annotated the selected
messages with the possible category labels. Each message was labeled by two
people, but no claims of consistency, comprehensiveness, nor generality are
made about these labelings.
|