Filed under: email, language technology, research, search
Posted by: Andrew Lampert
Exciting news – there’s a new version of the Enron email corpus that’s now publicly available which includes both the email messages and attachments.
Recently, an organisation called EDRM (Electronic Discovery Reference Model) has made a version of the Enron email corpus available for download that includes attachments, which were missing from the widely used versions of the corpus available from CMU, ISI etc. Apparently, the initial data set was created by John Wang and a team at ZL Technologies.
This version of the corpus consists of a series of Microsoft PST files, which contain both email messages and attachments. It’s a reasonably large dataset, especially compared with the email only versions; the total size of the compressed files is about 19 GB. The uncompressed files total about 43 GB. Except where otherwise noted, use of files is subject to a Creative Commons Attribution 3.0 United States License. Attribution should be noted as “EDRM (edrm.net).”
One thing to note is that every email appears to have had a footer added with EDRM attribution information, I assume as part of the conversion process into PST files. The content of the footer is consistent, however, so could be readily filtered out if processing the emails automatically.
