Email Zoning: Finding Signal amongst the Textual Noise of Email Messages
In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their functional parts. Today, the explosion of different email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simple techniques for identifying quoted replies that used to yield 95% accuracy now find less than 10% of such content.
Many language processing and search tools stand to benefit from better knowledge of the different functional parts of email messages, since this would allow them to focus on relevant content in specific parts of a message. In particular, access to zone information would allow email classification, summarisation and analysis tools to separate or filter out ‘noise’ and focus on the content in specific zones of a message that are relevant to the application at hand. Email contact mining tools, for example, might only access content from the email signature, while tools that attempt to identify tasks or action items in email might restrict themselves to the sender-authored and forwarded content.
Last week, I presented my paper on Segmenting Email Message Text into Zones at the Empirical Methods in Natural Language Processing (EMNLP) conference in Singapore. The focus of this work is Zebra, an SVM-based system that automatically segments and classifies the body text of email messages into nine functional zone types based on graphic, orthographic and lexical cues.
Our set of nine zones includes the following: author, greeting, signoff, quoted reply, forward, signature, advertising, disclaimer and attachment. Zebra currently performs the segmentation and classification of email text into the nine zones with an accuracy of about 87%. When the number of zones is abstracted to two or three zone classes (which is much more likely to be the granularity required for real-world email processing tasks), Zebra’s accuracy increases above 91.5%.
I’m currently working to finish off the Zebra system, as well as to resolve some licensing issues so that the code can be released for other researchers to use. We have, however, already released our annotated email dataset consisting of almost 12,000 lines of annotated email text that we used to train the Zebra system. If you want to know more, you can read our paper, head over to the Zebra website, or just get in touch with me by email or other means.
Details of “Lost” Bush Administration Emails to Remain Secret
Thanks to a unanimous ruling in the U.S. Court of Appeals for the D.C. Circuit, details surrounding millions of Bush Administration emails that were lost (and later found) may remain secret.
Back in October 2005, the Office of Administration in the (Bush Administration) White House allegedly discovered that the Executive Office of the President had lost millions of White House emails between 2003 and 2005. In April 2007, CREW filed a Freedom-of-Information-Act request to the Office of Administration asking for information about the missing emails. CREW sought records about the EOP’s e-mail management system,reports analyzing potential problems with the system, records of retained emails and possibly missing ones, documents discussing plans to find the missing emails, and proposals to institute a new email record system.
Sadly for CREW, the latest ruling finds that the Office of Administration is not an “agency” under the terms of the Freedom-of-Information-Act, and thus need not comply with CREW’s request to provide information about the “misplaced” emails.
Of course, there are other cases still moving through the courts between the Executive Office of the President, CREW and other parties. And, thanks to earlier lawsuits in the 1990s, email from the White House must be treated and preserved as government records. For more information about the “lost” Bush Administration emails, the National Security Archive at George Washington University has a comprehensive chronology of the saga.
(Hat-tip to Roger Matus for alerting me to the ruling)
Java Speech API 2.0 Specification Finally Released
About 5 years ago, during my Masters studies, I wrote some simple speech applications using Java Speech API (JSAPI) 1.0 compliant speech engines. At the time, the JSR for JSAPI 2.0 was well underway. Well, it’s taken more than 8 years since the formation of the JSR, but *finally* the final release of the Java Speech API (JSAPI) 2.0 specification has been made available, released on 7th May 2009.
Of note, JSAPI 2.0 is now primarily aimed at the Java ME platform (specifically CLDC 1.0 and MIDP 1.0), meaning that it’s hoped the new spec will facilitate speech-enabled java applications on mobile devices. For this reason, gone are all floating point references and dependencies on AWT (yay!). Recognition Engines may provide full support for application-defined grammars or provide more limited support through specialized built-in grammars. Synthesis Engines may support full text-to-speech capabilities or simple text and audio sequencing. According to documentation in the spec, implementations can require 0.5-1.5 MBytes of ROM for models and algorithms and approximately 128 KBytes of RAM depending on vocabulary and grammar size. Of course, JSAPI 2.0 compliant engines can still run on Java SE platforms, and can obviously make good use of more substantial memory and processing resources.
Reinforcing comments made by expert group member Paul Lamere about the difficulties of satisfying all parties and developing a comprehensive speech API, Nokia made the following observation in approving the final specification:
“We think that the API is well designed and has very comprehensive functions. However, it is therefore highly complex and requires fairly advanced speech recognition and synthesis features. It also assumes a high level of speech recognition understanding from the application developer. It might not be feasible in many Java ME devices in the near term, but can provide good features in those high end platforms where applicable.”
Unrelated to Java ME compatibility, also gone are the Java Speech API Grammar Format (JSGF) and Java Speech API Markup Language (JSML), which were defined as companion specifications in JSAPI 1.0. Sensibly, given the standardisation that has thankfully occurred in the intervening years, these have been replaced by the W3C Speech Recognition Grammar Specification (SRGS) and the W3C Speech Synthesis Markup Language (SSML) respectively. After spending some time reviewing the plethora of speech synthesis markup languages, I’m very relieved to see this standardisation.
All in all, while it has taken a long time to come to fruition, I’m very pleased to see the JSAPI 2.0 standard finalised. Of course, given that JSAPI is only a specification (not an implementation) it remains to be seen how quickly the various speech recognition and speech synthesis systems move to support the new and modified APIs.
E3C: Email in eCommerce and Enterprise Contexts
I’m very excited to announce that we’re planning another email workshop, following up from last year’s very successful AAAI workshop (EMAIL-08). This one is titled The 1st International Workshop on Email in e-Commerce and Enterprise Contexts (E3C), and is being held at the 11th IEEE Conference on Commerce and Enterprise Computing (CEC 2009) in Vienna, Austria on July 20th 2009.
Important Dates:
- Full paper submission: March 15th, 2009
- Authors Notification: April 15th, 2009
- Camera ready versions due: May 15th, 2009
- Workshop: July 20, 2009
One of the main aims of this workshop is to gather email and enterprise computing researchers and practitioners to discuss and propose solutions for email in e-commerce and enterprise contexts.
Topics:
- Architecture for enterprise cooperation and interoperability over email
- Intelligent email for SMEs
- Email-based business task and process management
- Email content analysis, message summarization, information extraction
- Semantic Email and Semantic Knowledge Extraction
- Email social networks for enterprise computing
- Email analysis of exchanged documents for semantic alignment via negotiation
- Email Workflow Management for Business Processes
- Interconnection of email content and enterprise resources (legacy systems, document repositories)
- Enterprise resource mashup support for business email
- Approaches for email visualization and user interfaces in business contexts
- Case studies
- Business email datasets
If you’re a researcher working with email, or if your startup or company is in the email space, please consider submitting a paper or demo to the workshop. Full details are available in the Call for Papers.
Google adds Task List to GMail Labs
I’ve been traveling for the past couple of weeks, so missed the announcement of Tasks as a new feature in GMail Labs. Given my own interests in tasks in email, this seems to be the most useful Labs feature to surface so far. Also of interest are the nearly 500 threads discussing ideas for future enhancements to the Tasks plugin.

The focus seems to be on lightweight interaction, which is definitely the right approach. To add a new task, for example, you just click in an empty part of the task list and start typing. This seems pretty similar to the style of task interaction pioneered by Remember the Milk, and I’d be interested to know how it compares with RTM’s GMail services, particularly their recently announced RTM GMail gadget that can be added via GMail Labs. Are there any users out there who have experimented with RTM’s tools and can offer insight on the comparative strengths and weaknesses of the new Labs task addition?
There doesn’t seem to be much in the way of tight interaction between email and tasks (yet), but I’m sure this will be in the pipeline for future enhancements.
On the topic of tasks in email, if you’re interested in learning more about how people phrase tasks in email messages, have a look at my recent paper, Requests and Commitments in Email are Complex: Eight Reasons to be Cautious, which I presented at the Australasian Language Technology conference in Hobart earlier this week.
Sarah Palin’s Email Leaked
A series of email messages from the controversial Yahoo! Mail account of US Republican vice-presidential candidate Sarah Palin were leaked onto the Internet today.
As with the recently announced Venezuelan government email leak, Wikileaks was again in the scrum, issuing the following press release:
The internet activist group ‘anonymous’, famed for its exposure of unethical behavior by the Scientology cult, has now gone after the Alaskan governor and republican Vice-Presidential candidate Sarah Palin.
At around midnight last night the group gained access to governor Palin’s email account … and handed over the contents to the government sunshine site Wikileaks.org.
Governor Palin has come under media criticism in the past week for using pseudo-private email accounts to avoid Alaskan freedom of information laws.
The zip archive made available by Wikileaks contains screen shots of Palin’s inbox, two example emails, governor Palin’s address box and a couple of family photos. While the emails released so far reveal little, the list of correspondence appears to re-enforce the criticism that Palin is mixing governmental and personal affairs.
The emails quoted in press articles to date seem to show that Palin has improperly used her private email account to conduct government business, thereby avoiding archiving requirements and shielding herself and her government from public scrutiny. It is unclear what if any action will be taken in response. According to the Sydney Morning Herald, the Secret Service contacted The Associated Press and asked for copies of the leaked emails on her Yahoo! account, but AP did not comply.
The Palin email leak is the latest in a string of unauthorised email disclosures. Ironically, it comes almost a year to the day after the MediaDefender email leak. Clearly, our recent discussion about the ethics of email corpora on the email research mailing list is a timely one!
Quote Selected Text: A Useful Gmail Labs Addition
I’ve previously noted my disappointment with the array of trivial trinkets that have so far defined Gmail Labs. One of the most recent additions, however, finally adds something of use.
Quote selected text allows you to selectively quote and reply to one small part of a message. Like other email clients with this feature (Apple Mail springs to mind), you just highlight the text you want to include in your reply, hit the keyboard shortcut “r” to reply, and the compose template will be just what you selected. This is a simple but useful feature. Note that it only works in Firefox and IE right now. Safari and Chrome support is still in progress.
Gabor Cselle on the Future of Email
As many in the email community will know, Gabor Cselle, VP of Engineering at Email startup Xobni, announced a month or so ago that he was leaving Xobni to start his own email company.
Luckily for us, Gabor is fitting in some travel between finishing up at Xobni and starting his new company, and Sydney is one of the stops on his itinerary. Gabor is an excellent presenter, so if you’re in Sydney, I highly recommend coming along to the seminar that he will be giving on The Future of Email at CSIRO / Macquarie University, starting 11am on Wednesday 15th October. (Here’s details of our location and how to get here if you’re planning to come along).
Of course, given Gabor’s experience as an entrepreneur, I’m sure he’ll also be happy to talk about life in a Silicon Valley startup and the lessons he’s learned along the way. So, come along for the seminar, and stick around for what’s sure to be some interesting discussion.
Integrating new email features in Outlook using Xobni
Gabor Cselle and Greg Duffy from Xobni gave an excellent keynote at the AAAI Email Workshop. Amongst other insights and Xobni anecdotes, their combined presentation gave an overview of just how difficult and painful it is to integrate new ideas into existing email clients like Microsoft Outlook. Such pain is, unfortunately, unavoidable if you’d like your ideas to reach any of the 400 million Outlook email users out there in the world.
The exciting news I took away from the Xobni presentation was the plan to open up external access to developer APIs to access and extend Xobni’s sidebar. This is what LinkedIn has had access to in order to achieve the recent integration with Xobni, and might be a less painful path to Outlook integration for other developers in the future.
Get Involved in the Email Research Community
Mark Dredze, Vitor Carvalho and Tessa Lau did an excellent job bringing together a great bunch of people working on a variety of email-related research at the recent EMAIL-08 workshop at AAAI in Chicago. There was a huge amount of energy and enthusiasm amongst the participants, which is a great thing for the future of email research.
Following on from the workshop, we have created a series of new resources to help keep the community connected. The first of these is a new mailing list for those interested in email research. Our intention is for this list to be a central place for people in the email research community to discuss ideas and projects and to announce resources of interest. More information about the list (including subscription information) can be found at http://groups.google.com/group/email-research.
In addition to the list, we have also created a community maintained email research website that we hope will keep a current list of email datasets, published papers and related information. Please get in touch if you have relevant content for the site.
If you are at all involved in email-related research, I strongly encourage you to join the new Email Research mailing list and to take part in the ongoing discussion of the wider email research community. I’m looking forward to hearing your ideas!