|
The Enron email corpus contains data from roughly 150 senior management
executives of Enron, which was originally made public during the Federal Energy
Regulatory Commission's investigation of Enron.
The corpus contains roughly 500,000 messages, organised into folders. Thanks to
the work of people at MIT, SRI and CMU, the dataset has been cleaned (mostly
stripping attachments and removing some emails for reasons of employee
privacy), and made available
online.
An unparalleled research dataset
The Enron corpus is completely unparalleled in terms of email datasets that can
be used for research purposes. It is more extensive than any other
research-friendly email corpus (that I know of) by several orders of magnitude.
Many people in a variety of Natural Language Processing, Machine Learning and a
bunch of other fields have realised this, and have started to analyse the
corpus as the basis of a number of different research programs. These range
from investigations into social networks and organisational communication to
data mining and text classification tasks.
Coordinating research using the Enron corpus
Unfortunately, despite such widespread interest, the community using the Enron
corpus seems to be very fragmented, with many researchers seemingly unaware of
how others are using the corpus. This has the potential to result in much
wasted effort if different research groups duplicating each other's work,
especially in terms of data markup and data cleansing, which are both huge
tasks given the size and inconsistencies of the corpus.
The main motivation for creating this website is to pull together all the known
work happening with the Enron Corpus, and to encourage users to share data and
knowledge about the corpus. The Enron Corpus Mailing
List has been setup for exactly this purpose.
If you're working with (or thinking of using) the Enron dataset, why not join the discussion list.
|