University of California, Berkeley
Enron Email Analysis Project
- UC Berkeley has setup an an Enron Email Analysis Project. Work under this project includes:
- A powerful search
interface for the Enron email collection, developed by
Andrew Fiore and Marti Hearst. This connects to
the mysql database described below using python, and uses lucene for the text
queries.
- A set of
categories developed by Marti and students in her Applied
Natural Language Processing (ANLP) course, to be used for annotating a
subset of the Enron email messages.
- A
subset of about 1700 labeled email messages (4.5M). These were chosen by
Marti in a semi-motivated fashion (focusing on business-related emails and the
California Energy Crises and on emails that occurred later in the collection,
trying to avoid very personal messages, jokes, and so on). Students in Marti's
ANLP course annotated the selected messages with the category labels. Each
message was labeled by two people, but no claims of consistency,
comprehensiveness, nor generality are made about these labelings.
- The
Enronic email visualization and clustering tool by Jeff Heer, built on his prefuse toolkit. (1.9M jar
file)
- A database
representation(219 MB compressed) of the Enron email collection, built by
Andrew Fiore and Jeff Heer,
containing the enron email messages. This version contains many but not all of
the tables used in the search tool, as well as special tables to be used with
the Enronic visualization tool. Andrew did a substantial amount of
processing on the contents of the database to remove duplicates, normalize
names, and so on. This has been tested only on mysql.
|
|