In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their functional parts. Today, the explosion of different email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simple techniques for identifying quoted replies that used to yield 95% accuracy now find less than 10% of such content.
Many language processing and search tools stand to benefit from better knowledge of the different functional parts of email messages, since this would allow them to focus on relevant content in specific parts of a message. In particular, access to zone information would allow email classification, summarisation and analysis tools to separate or filter out ‘noise’ and focus on the content in specific zones of a message that are relevant to the application at hand. Email contact mining tools, for example, might only access content from the email signature, while tools that attempt to identify tasks or action items in email might restrict themselves to the sender-authored and forwarded content.
Last week, I presented my paper on Segmenting Email Message Text into Zones at the Empirical Methods in Natural Language Processing (EMNLP) conference in Singapore. The focus of this work is Zebra, an SVM-based system that automatically segments and classifies the body text of email messages into nine functional zone types based on graphic, orthographic and lexical cues.
Our set of nine zones includes the following: author, greeting, signoff, quoted reply, forward, signature, advertising, disclaimer and attachment. Zebra currently performs the segmentation and classification of email text into the nine zones with an accuracy of about 87%. When the number of zones is abstracted to two or three zone classes (which is much more likely to be the granularity required for real-world email processing tasks), Zebra’s accuracy increases above 91.5%.
I’m currently working to finish off the Zebra system, as well as to resolve some licensing issues so that the code can be released for other researchers to use. We have, however, already released our annotated email dataset consisting of almost 12,000 lines of annotated email text that we used to train the Zebra system. If you want to know more, you can read our paper, head over to the Zebra website, or just get in touch with me by email or other means.
One of the main aims of this workshop is to gather email and enterprise computing researchers and practitioners to discuss and propose solutions for email in e-commerce and enterprise contexts.
Topics:
Architecture for enterprise cooperation and interoperability over email
Intelligent email for SMEs
Email-based business task and process management
Email content analysis, message summarization, information extraction
Semantic Email and Semantic Knowledge Extraction
Email social networks for enterprise computing
Email analysis of exchanged documents for semantic alignment via negotiation
Email Workflow Management for Business Processes
Interconnection of email content and enterprise resources (legacy systems, document repositories)
Enterprise resource mashup support for business email
Approaches for email visualization and user interfaces in business contexts
Case studies
Business email datasets
If you’re a researcher working with email, or if your startup or company is in the email space, please consider submitting a paper or demo to the workshop. Full details are available in the Call for Papers.
A few months back, I had a conversation about my PhD work with Kate Stevens, one of the members of the executive for HCSNet, an Australian Research Council funded collaboration network for researchers working on topics in the broad space of Human Communication Science.
Parts of my on-camera conversation with Kate have made it into the recently released HCSNet Promotional video, which is now available on YouTube. It’s always a bit weird seeing yourself on camera, particularly when sound bytes are taken from a much longer conversation! Given the totally unscripted nature of what was recorded though, I think it’s worked out quite well.
Of course, this is also a good opportunity to actually plug the annual HCSNet Summerfest, which will be held at UNSW in Sydney in December. If you’re interested in speech, language, sonics, psychology or any number of topics in between, check out what’s on offer – it’s well worth a few days of your time to meet some inspiring people.
Mark Dredze, Vitor Carvalho and Tessa Lau did an excellent job bringing together a great bunch of people working on a variety of email-related research at the recent EMAIL-08 workshop at AAAI in Chicago. There was a huge amount of energy and enthusiasm amongst the participants, which is a great thing for the future of email research.
Following on from the workshop, we have created a series of new resources to help keep the community connected. The first of these is a new mailing list for those interested in email research. Our intention is for this list to be a central place for people in the email research community to discuss ideas and projects and to announce resources of interest. More information about the list (including subscription information) can be found at http://groups.google.com/group/email-research.
In addition to the list, we have also created a community maintained email research website that we hope will keep a current list of email datasets, published papers and related information. Please get in touch if you have relevant content for the site.
If you are at all involved in email-related research, I strongly encourage you to join the new Email Research mailing list and to take part in the ongoing discussion of the wider email research community. I’m looking forward to hearing your ideas!
I know that Robert Dale (who happens to be one of my PhD supervisors) has been working hard towards this for some time now, along with a host of other people from the ACL and CL boards. So thank you and congratulations to all involved!
For those outside the community, the CL journal is arguably the most prestigious journal for our field. Despite this, I find that work seems far more visible when published in the big CL conferences (ACL, Coling, EMNLP etc.). It will be interesting to see how the move to open access changes this balance.
Of course, not having been in Ohio this year, I still have questions about the details – what’s the funding model?, will there be new sections/types of publications accepted?, will each issue contain more papers than previously now there aren’t physical page limits? – but I’m sure I’ll hear the details in time.
The move is not without challenges, but I think this is excellent news for both our CL/NLP communities and for the research community more generally in making high quality published research more easily available to everyone.
Way back in mid-2005, I setup an Enron Email Mailing List to encourage people to share data, experience, questions and knowledge about working with the Enron corpus. While the list has been quite low-traffic, a significant number of email researchers subscribed, and I like to think that it’s been of at least some use to people working with the Enron data.
Unfortunately, if you have tried to post (or if new people tried to subscribe) to the list in the past few months, things wouldn’t have worked out.
Due to some technical and people issues (that I have been slow to notice and even slower to address – my apologies for this!) the list disappeared off the face of the internet sometime around September last year. Unfortunately, the mailing list archives were lost in this process, and I haven’t been able to recover them, although I do have a personal archive of all the mailing list messages, if anyone is in desperate need of a copy.
The good news is that I have reconstructed the membership list, based on my personal archives of the list. So the list is now functioning again. If you’re not already a subscriber, and you’d like to join, just head on over to the Enron Email Mailing List page.
If any of you have Enron specific, or more general email research questions or topics you’d like to discuss, I’d encourage you to post them to the list.
The main aim of the workshop is to provide a focus for people working on email and other messaging technologies. In some ways this is what I think Conference on Email and Anti-Spam (CEAS) could have (and perhaps should have) been, but in recent years, CEAS seems to have been heavily focused on the anti-spam aspect of email, at the apparent expense of work more focused on HCI, NLP, AI and so on. Sensing this gap, the Enhanced Messaging Workshop is also hoping to set a multi-year agenda of important research goals for the field of email research and messaging technologies more generally.
For anyone interested, here’s an introduction to the purposes of the workshop from the Call For Participation:
With the rise of the digital workplace, email has become a ubiquitous tool in the office and a primary means of communication. Email’s growth has created new opportunities and challenges for a large variety of artificial intelligence research, focusing an increasing amount of academic and industrial research on email issues. Research seeks to enhance the email user experience by addressing email overload or to learn from email social patterns. Recent papers have dealt with email triage, activity management, email prioritization, summarization, topic tracking, sorting, leak detection, social network analysis, and enhanced intelligent interfaces. The wide spectrum of email research has appeared in a variety of conferences. The growing interest in email has left a fractured community spread through many sub-areas, a particularly important problem for this type of work since all research is aimed at improving a single application.
The Workshop on Enhanced Messaging at AAAI 2008 brings together researchers working on solutions for email and other forms of web messaging from many subfields of AI as well as soliciting participation from the broader community. We will discuss recent progress in the field and share research experiences. The community will outline existing problems in email and construct major research objectives for the next few years. We expect this workshop to be an important step towards building a community structure that will open channels of communication and collaboration as we move forward.
The workshop is aiming to appeal to both academic and industrial researchers (you might notice that Gabor Cselle, VP of Engineering at Xobni, is on the Program Committee too), so if you work at all in the email or messaging space, please have a look at the Call For Participation and consider submitting a paper, poster or demo.
I presented the first published work from my PhD this week at the Australasian Language Technology Workshop (ALTW) – a paper entitled Classifying Speech Acts using Verbal Response Modes. Happily for me, our paper ended up being judged by the international panel of reviewers as the joint recipient of the Best Paper Award for the conference. Stephen Wan, a fellow PhD student at both Macquarie Uni and CSIRO, was awarded the best student presentation award.
As well as reporting on our first classifier of surface speech acts using VRM, the paper also sets the scene for our ongoing research into how we can usefully exploit knowledge of speech acts in email and other forms of online conversation.
In particular, I’m interested in using knowledge of the intentional structure to improve how we currently search conversations and to provide insight into how we might automatically generate summaries of such conversations. I’m also exploring how such structure can provide some automated indication of conversation state. Although our work is still quite preliminary, we got lots of interest and some great feedback and insight from a range of people at ALTW.
If you want to know more about our work, please take a look at the paper. If you have any queries or comments, I’d really appreciate if you would comment on this post or drop me an email (Andrew.Lampert@csiro.au).