NextMail’11: Next Trends in Email
Thursday March 03rd 2011, 8:36 pm
Filed under: email,information delivery,language technology,mobile,research,technology
Posted by: Andrew Lampert

Is e-mail obsolete? Far from it. We continue to gather more and more information in our inboxes: personal and professional communications, but also marketing and commercial ads, alerts and notifications from websites or social networks, search engines results, agendas, …

The NextMail’11 workshop will focus on current research and emerging trends in email research. I’m happy to be a part of the program committee for the workshop, which will be held as part of the IEEE / WIC / ACM International Conferences on Web Intelligence and Intelligent Agent Technology August 22, 2011 in Lyon, France.

You can read the full Call For Papers for all the details, but relevant topics include:

  • Email content analysis, information extraction, summarization
  • Email social networks in enterprise
  • Email management strategies within organizations
  • Adaptative email agents and semantic agents
  • Emails archives exploration, visualization, regulations and behaviors
  • Email visual interfaces and human/computer interaction with emails
  • Case studies, experiments and user studies on emails usages
  • Benchmark and email testing datasets
  • Interoperability over email with enterprise resources and legacy systems
  • Semantic email and email mining
  • Unified messaging and web interactions : instant messaging, RSS feeds, annotations, tagging
  • Personal information management integration in email clients, pending task management
  • Interaction between email , PIM and the mobility factor
  • Facing the volume growth, do we need to replace the old protocols?
  • Evolution of infrastructures and uses

Papers are due by 21st March 2011, so get writing!



Subtextual adds a private backchannel within your email message
Thursday August 19th 2010, 11:37 pm
Filed under: email,information delivery,mobile,technology
Posted by: Andrew Lampert

An interesting aspect of online group communication is the phenomena of backchannel. Backchannel in computer-mediated communication (CMC) allows participants within a group conversation to exchange private communication which is visible only to the sender and receiver. Many existing forms of CMC provide such capability – think IRC, Skype and even Twitter (through direct messages).

Launched 5 months ago Subtextual (until recently, known as bccthis) is an interesting plug-in for Microsoft Outlook that allows the mixing of public (visible to all recipients) and private (visible only to specific recipients) content within a single email message. This allows a sender to send a single message, but add private context addressed to only those people that need it.

Subtextual adds the ability to send a hidden message as part of a normal email message. This hidden content is visible only to selected message recipients – other recipients never see any indication that the message has any additional content. Happily, recipients don’t need any plug-in to view Subtextual messages.

As well as the Outlook plugin, Subtextual also have a Twitter client (which seems less compelling to me), a FireFox plug-in for using Subtextual with Gmail and a BlackBerry application.

While clearly an interesting idea, I’m not sure whether Subtextual, is significant enough to be more than just another feature for Outlook. I am, however, impressed with their family of products across desktop, mobile and web-based email. Together with their recently announced premium version of the Outlook plug-in, it feels like the company is busy experimenting, trying to discover the platforms which can deliver them traction, customers and revenue. I am very interested to see in which direction this company will pivot in the future.



Rethinking Mobile Email
Tuesday August 17th 2010, 11:38 am
Filed under: email,language technology,mobile,technology
Posted by: Andrew Lampert

In work reminiscent of their original ReMail work, but targeted at mobile email, IBM is rethinking mobile email. Their focus is on fast email triage on mobile devices, including how to capture intended actions, such as those that might be actioned on the desktop at a later time (rather than on the mobile device).

While it’s widely acknowledged that desktop email clients have been slow to adapt to changing volumes and styles of email use, the problem is arguably more acute in the mobile space. For starters, obviously the device form factors influence how people use mobile email – you’re not likely to see people typing long-winded messages with their thumbs – yet many mobile email clients are essentially designed as smaller versions of desktop email clients. Mobile email users typically focus on triaging their messages to determine what’s new, what they can delete right away, and what’s important enough to handle immediately. They often defer everything else until they are at a desktop or laptop with a full keyboard and larger display.

I think it’s worth spending 7 minutes or so to watch the video below, where Jeff Pierce outlines the project:



There’s more information, including a short paper, also available at the Triage and Capture: Rethinking Mobile Email website.



iPhone iOS4 adds Event / Date detection in Email
Thursday July 01st 2010, 12:59 pm
Filed under: email,language technology,mobile
Posted by: Andrew Lampert

A quick note about a new feature in the email client on the iPhone in the latest iOS4 release. When you receive an email with a date or time mentioned in it, Apple’s email client automatically detects the date, and presents it as an underlined hyperlink. Clicking the date then creates an event in your Calendar on the date/time that was recognised in the text. Apparently it defaults to using the email’s subject line as the event title. (As an aside, it’s also worth noting that Gmail has had similar functionality since around 2006.)

I’m guessing Apple has rolled-their-own date recognition code, probably using simple rules or regular expressions. Does anyone know more about the technology behind this feature?

(Hat tip to Rob Tot for alerting me to this functionality)



First Clinton Administration Email Released
Friday June 25th 2010, 11:55 pm
Filed under: email,language technology
Posted by: Andrew Lampert

Way back in 2008 at the AAAI Workshop on Enhanced Email, Mark Dredze mooted that emails from the Clinton era would at some stage be released to the public. Happily, just days ago, the William J. Clinton Presidential Library and Museum began releasing email and other records from the US Clinton Administration.

The first release has focused on messages and documents authored by or sent to Elena Kagan. Elena Kagan served in two positions during the Clinton Administration. She was Associate White House Counsel from 1995 to 1996 and Deputy Assistant to the President for Domestic Policy and Deputy Director of the Domestic Policy Council from 1997 to 1999. To date, the released records include email created and received by Elena Kagan, along with 114 messages deemed to be part of the Federal side of the Clinton White House. These messages also include forwards, reply chains, and attachments. The attached documents include notes, memorandum, articles, reports, executive orders, bills, and directives.

Released emails are arranged into categories called “buckets”; within buckets, messages are arranged by creation date. Emails were stored in these “buckets” by the Automated Records Management System (ARMS) that was used during the Clinton Administration to capture email from Lotus Notes. The ARMS databases hold proprietary software based attachments that were converted to hexadecimal code (hex-code or hex-dump). When this hexadecimal code is included in an email message, archivists have converted the hexadecimal code back to readable text. Converted attachments have been arranged behind their corresponding created or received email.

The emails released so far span wide range of topics, including Amtrak, campaign finance reform, gaming/gambling (especially as it relates to Native Americans), timber, regulatory reform, welfare and domestic policy topics such as AIDS, budget appropriations, education, health, labor, race, and tobacco.

Sadly, like many email releases, the messages are rendered as PDF files, rather than in their native digital form. The files I have examined, however, have been OCR’d, and so the message text is at least searchable, and presumably extractable. One obvious question is why the emails should have been OCR’d, when ARMS presumably stored things as electronic text to begin with? As Tom Lee notes, it appears that these are re-digitised versions of data dumped out of ARMS.

To download or take a look at the data yourself, head on over to the Clinton Presidential Library. Alternatively the Sunlight Foundation has put up a familiar inbox-style view of the data for more convenient browsing.



The Failed Wikileaks Auction of Venezuelan Diplomatic Email Messages
Wednesday January 13th 2010, 3:32 pm
Filed under: email
Posted by: Andrew Lampert

I was recently contacted by Stefan Mey, who interviewed Julian Assange. Assange, an Australian, is the spokesperson of Wikileaks. The interview makes for interesting reading. In discussing how Wikileaks is financed, Mey elicits some interesting comments on the controversial auction of Venezuelan government email that I’ve previously covered on this blog.

Back in September 2008, there was widespread discussion of a collection of 8000 diplomatic emails from the government of Venezuelan President Hugo Chavez that found its way to Wikileaks. They turn out to be from Hugo Chavez’ former speech writer, Freddy Balzan. At the time, Wikileaks tried to auction off access to the email messages to the highest bidder. The auction was ultimately cancelled.

Stefan Mey’s interview with Julian Assange was actually conducted in German English; English and German transcripts are available on Stefan’s blog. Below are some extracted comments regarding the Venezuelan email messages:


[Stefan Mey:] In Germany you made an exclusivity deal with two media companies, the Stern and Heise. Are you satisfied with these kind of deals?

[Julian Assange:] We did this in other countries before. Generally we have been satisfied. The problem is it takes too much time to manage. To make a contract, and to determine who should have the exclusivity. Someone can say, oh, we will do a good story. We are going to maximize the political impact. And then they won’t do it. How do we measure this?

You want to make sure that if you give them the exclusivity that they really do what they promise to do …

Yes. One thing that can’t be faked is how much money the pay. If you have an auction and a media organisation pays the most, then they are predicitng, that they will benefit the most from publishing the story. That is they will have the maximum number of readers. So this is a very good way to measure who should have the exclusivity. We tried to do it as an experiment in Venezuela .

Why Venezuela?

Because of the character of the document. We had 7.000 Emails from Freddy Balzan, he was Hugo Chavez’ former speech writer and also the former ambassador to Argentinia. We knew that this document would have this problem, that it was big and political important, therefore probably no one would write anything about it for the reason I just said.

What happened?

This auction proved to be a logistical nightmare. Media organisations wanted access to the material before they went to auction. So we would get them to sign non-disclosure agreements, chop up the material and release just every second page or every second sentence.That was too distracting to all the normal work we were doing, so that we said, forget it, we can’t do that. We just released the material as normal. And that’s precisely what happened: No one wrote anything at all about those 7.000 Emails. Even though 15 stories had appeared about the fact that we were holding the auction.

The experiment failed.

The experiment didn’t fail, the experiment taught us about what the burdens were. We would actually need a team of five or six people whose job was just to arrange these auctions.

You plan to continue the auction idea in the future …

We plan to continue it, but we know it will take more resources. But if we persue that we will not do that for single documents. Instead we will do a subscription. This would be much simpler. We would only have the overhead of doing the auction stuff every three months or six months, not for every document.

So the exclusivity of the story will run out after three months?

No, there will be exclusivity in terms of different time windows in access to the material. As an example: there will be an auction for North America. And you will be ranked in the auction. The media organisation who bids most in the auction, would get access to it first, the one who bids second will get access to it second and so on. Media organisations would have a subscription to Wikileaks.

I haven’t ever actually seen the Venezuelan emails, but in the extract above, Assange seems to indicate that they were eventually made freely available from the Wikileaks site. The Wikileaks site is currently unavailable (though the page states it was to be available again after 11th Jan 2010), instead showing a page requesting funds from supporters, so I can’t confirm whether the emails are actually available for download from the site.

Regardless, it seems that like the infamous MediaDefender emails, it seems unlikely that the email could be ethically used for research purposes.

Update 13/01/10: I mistakenly stated that the interview was conducted in German. Stefan Mey has confirmed that the interview was in English, and translated into German.



New Enron Email Corpus release with attachments
Wednesday November 25th 2009, 9:13 pm
Filed under: email,language technology,research,search
Posted by: Andrew Lampert

Exciting news – there’s a new version of the Enron email corpus that’s now publicly available which includes both the email messages and attachments.

Recently, an organisation called EDRM (Electronic Discovery Reference Model) has made a version of the Enron email corpus available for download that includes attachments, which were missing from the widely used versions of the corpus available from CMU, ISI etc. Apparently, the initial data set was created by John Wang and a team at ZL Technologies.

This version of the corpus consists of a series of Microsoft PST files, which contain both email messages and attachments. It’s a reasonably large dataset, especially compared with the email only versions; the total size of the compressed files is about 19 GB. The uncompressed files total about 43 GB. Except where otherwise noted, use of files is subject to a Creative Commons Attribution 3.0 United States License. Attribution should be noted as “EDRM (edrm.net).”

One thing to note is that every email appears to have had a footer added with EDRM attribution information, I assume as part of the conversion process into PST files. The content of the footer is consistent, however, so could be readily filtered out if processing the emails automatically.



Email Zoning: Finding Signal amongst the Textual Noise of Email Messages
Monday August 10th 2009, 3:46 pm
Filed under: email,java,language technology,research,science,search,technology
Posted by: Andrew Lampert

In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their functional parts. Today, the explosion of different email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simple techniques for identifying quoted replies that used to yield 95% accuracy now find less than 10% of such content.

Many language processing and search tools stand to benefit from better knowledge of the different functional parts of email messages, since this would allow them to focus on relevant content in specific parts of a message. In particular, access to zone information would allow email classification, summarisation and analysis tools to separate or filter out ‘noise’ and focus on the content in specific zones of a message that are relevant to the application at hand. Email contact mining tools, for example, might only access content from the email signature, while tools that attempt to identify tasks or action items in email might restrict themselves to the sender-authored and forwarded content.

Last week, I presented my paper on Segmenting Email Message Text into Zones at the Empirical Methods in Natural Language Processing (EMNLP) conference in Singapore. The focus of this work is Zebra, an SVM-based system that automatically segments and classifies the body text of email messages into nine functional zone types based on graphic, orthographic and lexical cues.

Our set of nine zones includes the following: author, greeting, signoff, quoted reply, forward, signature, advertising, disclaimer and attachment. Zebra currently performs the segmentation and classification of email text into the nine zones with an accuracy of about 87%. When the number of zones is abstracted to two or three zone classes (which is much more likely to be the granularity required for real-world email processing tasks), Zebra’s accuracy increases above 91.5%.

I’m currently working to finish off the Zebra system, as well as to resolve some licensing issues so that the code can be released for other researchers to use. We have, however, already released our annotated email dataset consisting of almost 12,000 lines of annotated email text that we used to train the Zebra system. If you want to know more, you can read our paper, head over to the Zebra website, or just get in touch with me by email or other means.



Details of “Lost” Bush Administration Emails to Remain Secret
Wednesday May 20th 2009, 11:50 pm
Filed under: email,language technology,technology
Posted by: Andrew Lampert

Thanks to a unanimous ruling in the U.S. Court of Appeals for the D.C. Circuit, details surrounding millions of Bush Administration emails that were lost (and later found) may remain secret.

Back in October 2005, the Office of Administration in the (Bush Administration) White House allegedly discovered that the Executive Office of the President had lost millions of White House emails between 2003 and 2005. In April 2007, CREW filed a Freedom-of-Information-Act request to the Office of Administration asking for information about the missing emails. CREW sought records about the EOP’s e-mail management system,reports analyzing potential problems with the system, records of retained emails and possibly missing ones, documents discussing plans to find the missing emails, and proposals to institute a new email record system.

Sadly for CREW, the latest ruling finds that the Office of Administration is not an “agency” under the terms of the Freedom-of-Information-Act, and thus need not comply with CREW’s request to provide information about the “misplaced” emails.

Of course, there are other cases still moving through the courts between the Executive Office of the President, CREW and other parties. And, thanks to earlier lawsuits in the 1990s, email from the White House must be treated and preserved as government records. For more information about the “lost” Bush Administration emails, the National Security Archive at George Washington University has a comprehensive chronology of the saga.

(Hat-tip to Roger Matus for alerting me to the ruling)



Java Speech API 2.0 Specification Finally Released
Friday May 08th 2009, 10:27 pm
Filed under: java,language technology,research,technology
Posted by: Andrew Lampert

About 5 years ago, during my Masters studies, I wrote some simple speech applications using Java Speech API (JSAPI) 1.0 compliant speech engines. At the time, the JSR for JSAPI 2.0 was well underway. Well, it’s taken more than 8 years since the formation of the JSR, but *finally* the final release of the Java Speech API (JSAPI) 2.0 specification has been made available, released on 7th May 2009.

Of note, JSAPI 2.0 is now primarily aimed at the Java ME platform (specifically CLDC 1.0 and MIDP 1.0), meaning that it’s hoped the new spec will facilitate speech-enabled java applications on mobile devices. For this reason, gone are all floating point references and dependencies on AWT (yay!). Recognition Engines may provide full support for application-defined grammars or provide more limited support through specialized built-in grammars. Synthesis Engines may support full text-to-speech capabilities or simple text and audio sequencing. According to documentation in the spec, implementations can require 0.5-1.5 MBytes of ROM for models and algorithms and approximately 128 KBytes of RAM depending on vocabulary and grammar size. Of course, JSAPI 2.0 compliant engines can still run on Java SE platforms, and can obviously make good use of more substantial memory and processing resources.

Reinforcing comments made by expert group member Paul Lamere about the difficulties of satisfying all parties and developing a comprehensive speech API, Nokia made the following observation in approving the final specification:

“We think that the API is well designed and has very comprehensive functions. However, it is therefore highly complex and requires fairly advanced speech recognition and synthesis features. It also assumes a high level of speech recognition understanding from the application developer. It might not be feasible in many Java ME devices in the near term, but can provide good features in those high end platforms where applicable.”

Unrelated to Java ME compatibility, also gone are the Java Speech API Grammar Format (JSGF) and Java Speech API Markup Language (JSML), which were defined as companion specifications in JSAPI 1.0. Sensibly, given the standardisation that has thankfully occurred in the intervening years, these have been replaced by the W3C Speech Recognition Grammar Specification (SRGS) and the W3C Speech Synthesis Markup Language (SSML) respectively. After spending some time reviewing the plethora of speech synthesis markup languages, I’m very relieved to see this standardisation.

All in all, while it has taken a long time to come to fruition, I’m very pleased to see the JSAPI 2.0 standard finalised. Of course, given that JSAPI is only a specification (not an implementation) it remains to be seen how quickly the various speech recognition and speech synthesis systems move to support the new and modified APIs.