Exciting news – there’s a new version of the Enron email corpus that’s now publicly available which includes both the email messages and attachments.
Recently, an organisation called EDRM (Electronic Discovery Reference Model) has made a version of the Enron email corpus available for download that includes attachments, which were missing from the widely used versions of the corpus available from CMU, ISI etc. Apparently, the initial data set was created by John Wang and a team at ZL Technologies.
This version of the corpus consists of a series of Microsoft PST files, which contain both email messages and attachments. It’s a reasonably large dataset, especially compared with the email only versions; the total size of the compressed files is about 19 GB. The uncompressed files total about 43 GB. Except where otherwise noted, use of files is subject to a Creative Commons Attribution 3.0 United States License. Attribution should be noted as “EDRM (edrm.net).”
One thing to note is that every email appears to have had a footer added with EDRM attribution information, I assume as part of the conversion process into PST files. The content of the footer is consistent, however, so could be readily filtered out if processing the emails automatically.
In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their functional parts. Today, the explosion of different email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simple techniques for identifying quoted replies that used to yield 95% accuracy now find less than 10% of such content.
Many language processing and search tools stand to benefit from better knowledge of the different functional parts of email messages, since this would allow them to focus on relevant content in specific parts of a message. In particular, access to zone information would allow email classification, summarisation and analysis tools to separate or filter out ‘noise’ and focus on the content in specific zones of a message that are relevant to the application at hand. Email contact mining tools, for example, might only access content from the email signature, while tools that attempt to identify tasks or action items in email might restrict themselves to the sender-authored and forwarded content.
Last week, I presented my paper on Segmenting Email Message Text into Zones at the Empirical Methods in Natural Language Processing (EMNLP) conference in Singapore. The focus of this work is Zebra, an SVM-based system that automatically segments and classifies the body text of email messages into nine functional zone types based on graphic, orthographic and lexical cues.
Our set of nine zones includes the following: author, greeting, signoff, quoted reply, forward, signature, advertising, disclaimer and attachment. Zebra currently performs the segmentation and classification of email text into the nine zones with an accuracy of about 87%. When the number of zones is abstracted to two or three zone classes (which is much more likely to be the granularity required for real-world email processing tasks), Zebra’s accuracy increases above 91.5%.
I’m currently working to finish off the Zebra system, as well as to resolve some licensing issues so that the code can be released for other researchers to use. We have, however, already released our annotated email dataset consisting of almost 12,000 lines of annotated email text that we used to train the Zebra system. If you want to know more, you can read our paper, head over to the Zebra website, or just get in touch with me by email or other means.
About 5 years ago, during my Masters studies, I wrote some simple speech applications using Java Speech API (JSAPI) 1.0 compliant speech engines. At the time, the JSR for JSAPI 2.0 was well underway. Well, it’s taken more than 8 years since the formation of the JSR, but *finally* the final release of the Java Speech API (JSAPI) 2.0 specification has been made available, released on 7th May 2009.
Of note, JSAPI 2.0 is now primarily aimed at the Java ME platform (specifically CLDC 1.0 and MIDP 1.0), meaning that it’s hoped the new spec will facilitate speech-enabled java applications on mobile devices. For this reason, gone are all floating point references and dependencies on AWT (yay!). Recognition Engines may provide full support for application-defined grammars or provide more limited support through specialized built-in grammars. Synthesis Engines may support full text-to-speech capabilities or simple text and audio sequencing. According to documentation in the spec, implementations can require 0.5-1.5 MBytes of ROM for models and algorithms and approximately 128 KBytes of RAM depending on vocabulary and grammar size. Of course, JSAPI 2.0 compliant engines can still run on Java SE platforms, and can obviously make good use of more substantial memory and processing resources.
“We think that the API is well designed and has very comprehensive functions. However, it is therefore highly complex and requires fairly advanced speech recognition and synthesis features. It also assumes a high level of speech recognition understanding from the application developer. It might not be feasible in many Java ME devices in the near term, but can provide good features in those high end platforms where applicable.”
All in all, while it has taken a long time to come to fruition, I’m very pleased to see the JSAPI 2.0 standard finalised. Of course, given that JSAPI is only a specification (not an implementation) it remains to be seen how quickly the various speech recognition and speech synthesis systems move to support the new and modified APIs.
One of the main aims of this workshop is to gather email and enterprise computing researchers and practitioners to discuss and propose solutions for email in e-commerce and enterprise contexts.
Topics:
Architecture for enterprise cooperation and interoperability over email
Intelligent email for SMEs
Email-based business task and process management
Email content analysis, message summarization, information extraction
Semantic Email and Semantic Knowledge Extraction
Email social networks for enterprise computing
Email analysis of exchanged documents for semantic alignment via negotiation
Email Workflow Management for Business Processes
Interconnection of email content and enterprise resources (legacy systems, document repositories)
Enterprise resource mashup support for business email
Approaches for email visualization and user interfaces in business contexts
Case studies
Business email datasets
If you’re a researcher working with email, or if your startup or company is in the email space, please consider submitting a paper or demo to the workshop. Full details are available in the Call for Papers.
Luckily for us, Gabor is fitting in some travel between finishing up at Xobni and starting his new company, and Sydney is one of the stops on his itinerary. Gabor is an excellent presenter, so if you’re in Sydney, I highly recommend coming along to the seminar that he will be giving on The Future of Email at CSIRO / Macquarie University, starting 11am on Wednesday 15th October. (Here’s details of our location and how to get here if you’re planning to come along).
Of course, given Gabor’s experience as an entrepreneur, I’m sure he’ll also be happy to talk about life in a Silicon Valley startup and the lessons he’s learned along the way. So, come along for the seminar, and stick around for what’s sure to be some interesting discussion.
Whistleblower website Wikileaks.org, which famously made its debut revealing secret documents about Guantanamo Bay, has announced that they have acquired a corpus of over 8000 diplomatic emails from the government of Venezuelan President Hugo Chavez. Controversially, WikiLeaks is offering to auction off the corpus to the highest bidder.
The winning bidder will get exclusivity and embargoed access to the documents. However, there is hope for cash-poor email researchers, as Wikileaks claims that they will eventually publish all of the email, after the embargo expires.
The corpus allegedly includes email messages and attachments from 2005 to July 2008 that provide insight into the management of Chavez’s “inner circle”, along with “sentiments about CIA activities in Venezuela, Columbian incursions, the visit of the Pope”, and the Bolivarian revolution. Based on the Wikileaks press release below, the email messages appear to be from a single diplomat’s mailbox.
From: Wikileaks Press Office
Date: Wed, 27 Aug 2008 20:38:47 +0100
Inside Venezuela – over 8, 000 diplomatic emails 2005-2008
Wikileaks has prepared for publication over 8,000 internal and
external emails to and from a senior Venezuelan diplomat and former
speech writer for Hugo Chavez. The emails are dated 2005 to July
2008, and include several thousand attachments. The preparation
includes a “one touch” translation system to over a dozen different
languages.
The material provides a unqiue insight into the Bolivarian revolution,
President Chavez’s manamgement of his inner circle, and affairs
ranging from Cuban and Venezuelan contacts, sentiments about CIA
activites in Venezuela, Columbian incursions, the visit of the
Pope and Venezuelan views on many other countries and events.
Organizations wishing to bid for exclusivity (proceeds to our source
defense fund) and embargoed access contact usa@wikileaks.org for
additional information.
Thanks to Rob McArthur for alerting me to the Wired News article about the auction. If anyone out there knows more about this potential corpus, please comment!
Update (3/8/08): Of course, I assume the email messages are likely to be in Spanish, the official language of Venezuela.
A few months back, I had a conversation about my PhD work with Kate Stevens, one of the members of the executive for HCSNet, an Australian Research Council funded collaboration network for researchers working on topics in the broad space of Human Communication Science.
Parts of my on-camera conversation with Kate have made it into the recently released HCSNet Promotional video, which is now available on YouTube. It’s always a bit weird seeing yourself on camera, particularly when sound bytes are taken from a much longer conversation! Given the totally unscripted nature of what was recorded though, I think it’s worked out quite well.
Of course, this is also a good opportunity to actually plug the annual HCSNet Summerfest, which will be held at UNSW in Sydney in December. If you’re interested in speech, language, sonics, psychology or any number of topics in between, check out what’s on offer – it’s well worth a few days of your time to meet some inspiring people.
Mark Dredze, Vitor Carvalho and Tessa Lau did an excellent job bringing together a great bunch of people working on a variety of email-related research at the recent EMAIL-08 workshop at AAAI in Chicago. There was a huge amount of energy and enthusiasm amongst the participants, which is a great thing for the future of email research.
Following on from the workshop, we have created a series of new resources to help keep the community connected. The first of these is a new mailing list for those interested in email research. Our intention is for this list to be a central place for people in the email research community to discuss ideas and projects and to announce resources of interest. More information about the list (including subscription information) can be found at http://groups.google.com/group/email-research.
In addition to the list, we have also created a community maintained email research website that we hope will keep a current list of email datasets, published papers and related information. Please get in touch if you have relevant content for the site.
If you are at all involved in email-related research, I strongly encourage you to join the new Email Research mailing list and to take part in the ongoing discussion of the wider email research community. I’m looking forward to hearing your ideas!
I know that Robert Dale (who happens to be one of my PhD supervisors) has been working hard towards this for some time now, along with a host of other people from the ACL and CL boards. So thank you and congratulations to all involved!
For those outside the community, the CL journal is arguably the most prestigious journal for our field. Despite this, I find that work seems far more visible when published in the big CL conferences (ACL, Coling, EMNLP etc.). It will be interesting to see how the move to open access changes this balance.
Of course, not having been in Ohio this year, I still have questions about the details – what’s the funding model?, will there be new sections/types of publications accepted?, will each issue contain more papers than previously now there aren’t physical page limits? – but I’m sure I’ll hear the details in time.
The move is not without challenges, but I think this is excellent news for both our CL/NLP communities and for the research community more generally in making high quality published research more easily available to everyone.
Way back in mid-2005, I setup an Enron Email Mailing List to encourage people to share data, experience, questions and knowledge about working with the Enron corpus. While the list has been quite low-traffic, a significant number of email researchers subscribed, and I like to think that it’s been of at least some use to people working with the Enron data.
Unfortunately, if you have tried to post (or if new people tried to subscribe) to the list in the past few months, things wouldn’t have worked out.
Due to some technical and people issues (that I have been slow to notice and even slower to address – my apologies for this!) the list disappeared off the face of the internet sometime around September last year. Unfortunately, the mailing list archives were lost in this process, and I haven’t been able to recover them, although I do have a personal archive of all the mailing list messages, if anyone is in desperate need of a copy.
The good news is that I have reconstructed the membership list, based on my personal archives of the list. So the list is now functioning again. If you’re not already a subscriber, and you’d like to join, just head on over to the Enron Email Mailing List page.
If any of you have Enron specific, or more general email research questions or topics you’d like to discuss, I’d encourage you to post them to the list.