Exciting news – there’s a new version of the Enron email corpus that’s now publicly available which includes both the email messages and attachments.
Recently, an organisation called EDRM (Electronic Discovery Reference Model) has made a version of the Enron email corpus available for download that includes attachments, which were missing from the widely used versions of the corpus available from CMU, ISI etc. Apparently, the initial data set was created by John Wang and a team at ZL Technologies.
This version of the corpus consists of a series of Microsoft PST files, which contain both email messages and attachments. It’s a reasonably large dataset, especially compared with the email only versions; the total size of the compressed files is about 19 GB. The uncompressed files total about 43 GB. Except where otherwise noted, use of files is subject to a Creative Commons Attribution 3.0 United States License. Attribution should be noted as “EDRM (edrm.net).”
One thing to note is that every email appears to have had a footer added with EDRM attribution information, I assume as part of the conversion process into PST files. The content of the footer is consistent, however, so could be readily filtered out if processing the emails automatically.
In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their functional parts. Today, the explosion of different email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simple techniques for identifying quoted replies that used to yield 95% accuracy now find less than 10% of such content.
Many language processing and search tools stand to benefit from better knowledge of the different functional parts of email messages, since this would allow them to focus on relevant content in specific parts of a message. In particular, access to zone information would allow email classification, summarisation and analysis tools to separate or filter out ‘noise’ and focus on the content in specific zones of a message that are relevant to the application at hand. Email contact mining tools, for example, might only access content from the email signature, while tools that attempt to identify tasks or action items in email might restrict themselves to the sender-authored and forwarded content.
Last week, I presented my paper on Segmenting Email Message Text into Zones at the Empirical Methods in Natural Language Processing (EMNLP) conference in Singapore. The focus of this work is Zebra, an SVM-based system that automatically segments and classifies the body text of email messages into nine functional zone types based on graphic, orthographic and lexical cues.
Our set of nine zones includes the following: author, greeting, signoff, quoted reply, forward, signature, advertising, disclaimer and attachment. Zebra currently performs the segmentation and classification of email text into the nine zones with an accuracy of about 87%. When the number of zones is abstracted to two or three zone classes (which is much more likely to be the granularity required for real-world email processing tasks), Zebra’s accuracy increases above 91.5%.
I’m currently working to finish off the Zebra system, as well as to resolve some licensing issues so that the code can be released for other researchers to use. We have, however, already released our annotated email dataset consisting of almost 12,000 lines of annotated email text that we used to train the Zebra system. If you want to know more, you can read our paper, head over to the Zebra website, or just get in touch with me by email or other means.
One of the main aims of this workshop is to gather email and enterprise computing researchers and practitioners to discuss and propose solutions for email in e-commerce and enterprise contexts.
Topics:
Architecture for enterprise cooperation and interoperability over email
Intelligent email for SMEs
Email-based business task and process management
Email content analysis, message summarization, information extraction
Semantic Email and Semantic Knowledge Extraction
Email social networks for enterprise computing
Email analysis of exchanged documents for semantic alignment via negotiation
Email Workflow Management for Business Processes
Interconnection of email content and enterprise resources (legacy systems, document repositories)
Enterprise resource mashup support for business email
Approaches for email visualization and user interfaces in business contexts
Case studies
Business email datasets
If you’re a researcher working with email, or if your startup or company is in the email space, please consider submitting a paper or demo to the workshop. Full details are available in the Call for Papers.
I’ve previously noted my disappointment with the array of trivial trinkets that have so far defined Gmail Labs. One of the most recent additions, however, finally adds something of use.
Quote selected text allows you to selectively quote and reply to one small part of a message. Like other email clients with this feature (Apple Mail springs to mind), you just highlight the text you want to include in your reply, hit the keyboard shortcut “r” to reply, and the compose template will be just what you selected. This is a simple but useful feature. Note that it only works in Firefox and IE right now. Safari and Chrome support is still in progress.
Luckily for us, Gabor is fitting in some travel between finishing up at Xobni and starting his new company, and Sydney is one of the stops on his itinerary. Gabor is an excellent presenter, so if you’re in Sydney, I highly recommend coming along to the seminar that he will be giving on The Future of Email at CSIRO / Macquarie University, starting 11am on Wednesday 15th October. (Here’s details of our location and how to get here if you’re planning to come along).
Of course, given Gabor’s experience as an entrepreneur, I’m sure he’ll also be happy to talk about life in a Silicon Valley startup and the lessons he’s learned along the way. So, come along for the seminar, and stick around for what’s sure to be some interesting discussion.
Whistleblower website Wikileaks.org, which famously made its debut revealing secret documents about Guantanamo Bay, has announced that they have acquired a corpus of over 8000 diplomatic emails from the government of Venezuelan President Hugo Chavez. Controversially, WikiLeaks is offering to auction off the corpus to the highest bidder.
The winning bidder will get exclusivity and embargoed access to the documents. However, there is hope for cash-poor email researchers, as Wikileaks claims that they will eventually publish all of the email, after the embargo expires.
The corpus allegedly includes email messages and attachments from 2005 to July 2008 that provide insight into the management of Chavez’s “inner circle”, along with “sentiments about CIA activities in Venezuela, Columbian incursions, the visit of the Pope”, and the Bolivarian revolution. Based on the Wikileaks press release below, the email messages appear to be from a single diplomat’s mailbox.
From: Wikileaks Press Office
Date: Wed, 27 Aug 2008 20:38:47 +0100
Inside Venezuela – over 8, 000 diplomatic emails 2005-2008
Wikileaks has prepared for publication over 8,000 internal and
external emails to and from a senior Venezuelan diplomat and former
speech writer for Hugo Chavez. The emails are dated 2005 to July
2008, and include several thousand attachments. The preparation
includes a “one touch” translation system to over a dozen different
languages.
The material provides a unqiue insight into the Bolivarian revolution,
President Chavez’s manamgement of his inner circle, and affairs
ranging from Cuban and Venezuelan contacts, sentiments about CIA
activites in Venezuela, Columbian incursions, the visit of the
Pope and Venezuelan views on many other countries and events.
Organizations wishing to bid for exclusivity (proceeds to our source
defense fund) and embargoed access contact usa@wikileaks.org for
additional information.
Thanks to Rob McArthur for alerting me to the Wired News article about the auction. If anyone out there knows more about this potential corpus, please comment!
Update (3/8/08): Of course, I assume the email messages are likely to be in Spanish, the official language of Venezuela.
A few months back, I had a conversation about my PhD work with Kate Stevens, one of the members of the executive for HCSNet, an Australian Research Council funded collaboration network for researchers working on topics in the broad space of Human Communication Science.
Parts of my on-camera conversation with Kate have made it into the recently released HCSNet Promotional video, which is now available on YouTube. It’s always a bit weird seeing yourself on camera, particularly when sound bytes are taken from a much longer conversation! Given the totally unscripted nature of what was recorded though, I think it’s worked out quite well.
Of course, this is also a good opportunity to actually plug the annual HCSNet Summerfest, which will be held at UNSW in Sydney in December. If you’re interested in speech, language, sonics, psychology or any number of topics in between, check out what’s on offer – it’s well worth a few days of your time to meet some inspiring people.
Mark Dredze, Vitor Carvalho and Tessa Lau did an excellent job bringing together a great bunch of people working on a variety of email-related research at the recent EMAIL-08 workshop at AAAI in Chicago. There was a huge amount of energy and enthusiasm amongst the participants, which is a great thing for the future of email research.
Following on from the workshop, we have created a series of new resources to help keep the community connected. The first of these is a new mailing list for those interested in email research. Our intention is for this list to be a central place for people in the email research community to discuss ideas and projects and to announce resources of interest. More information about the list (including subscription information) can be found at http://groups.google.com/group/email-research.
In addition to the list, we have also created a community maintained email research website that we hope will keep a current list of email datasets, published papers and related information. Please get in touch if you have relevant content for the site.
If you are at all involved in email-related research, I strongly encourage you to join the new Email Research mailing list and to take part in the ongoing discussion of the wider email research community. I’m looking forward to hearing your ideas!
Apparently Microsoft is willing to purchase Yahoo! for (US) $44.6 billion. As Michael Osterman notes, this would give Microsoft a huge addition to its consumer email base, through Yahoo!’s world-leading base of webmail users, to compliment Microsoft’s existing Hotmail user community.
Yahoo!’s recent purchase of Zimbra, also means that we’d see a Microsoft Zimbra, although it’s unclear how this would be particularly useful to Microsoft as a business-grade email and collaboration offering, given their existing Exchange/Outlook combination. I guess MS might see it as removing a potential competitor from the enterprise email space?
It will be interesting to see whether this often mooted Yahoo!-Microsoft partnership will actually come to fruition this time. Quite apart from the email aspects of the partnership, I’d be fascinated to see what would come from a merging of their search capabilities and from a combined Microsoft Research and Yahoo! Research.