As I just mentioned in my last post, I’m trying to setup a useful resource site for people using the Enron Email Corpus.
The Enron corpus is completely unparalleled in terms of email datasets that can be used for research purposes. It is more extensive than any other research-friendly email corpus (that I know of) by several orders of magnitude. Many people in a variety of Natural Language Processing, Machine Learning and a bunch of other fields have realised this, and have started to analyse the corpus as the basis of a number of different research programs. These range from investigations into social networks and organisational communication to data mining and text classification tasks. Quite a range of research has already been published, though most of it is fairly preliminary at this stage.
Unfortunately, despite such widespread interest, the community using the Enron corpus seems to be very fragmented, with many researchers seemingly unaware of how others are using the corpus. This has the potential to result in much wasted effort if different research groups duplicating each other’s work, especially in terms of data markup and cleansing, which are both huge tasks given the size and inconsistencies of the corpus.
The main motivation for me to create yet another website is to pull together all the known work happening with the Enron Corpus, and to encourage users to share data and knowledge about the corpus. I have also setup an Enron Corpus discussion list for exactly this purpose.
If you’re working with (or thinking of using) the Enron dataset, why not join the discussion list. If you know anyone who is using the Enron corpus, point them over to the Enron Corpus Mailing List and encourage them to join.
1 Comment so far
Leave a comment
[...] back in mid-2005, I setup an Enron Email Mailing List to encourage people to share data, experience, questions and knowledge about working with the Enron [...]
Pingback by Enron Email Mailing List - Available again | Thoughtlets 03.17.08 @ 10:07 amLeave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
