The Failed Wikileaks Auction of Venezuelan Diplomatic Email Messages
Wednesday January 13th 2010, 3:32 pm
Filed under: email
Posted by: Andrew Lampert

I was recently contacted by Stefan Mey, who interviewed Julian Assange. Assange, an Australian, is the spokesperson of Wikileaks. The interview makes for interesting reading. In discussing how Wikileaks is financed, Mey elicits some interesting comments on the controversial auction of Venezuelan government email that I’ve previously covered on this blog.

Back in September 2008, there was widespread discussion of a collection of 8000 diplomatic emails from the government of Venezuelan President Hugo Chavez that found its way to Wikileaks. They turn out to be from Hugo Chavez’ former speech writer, Freddy Balzan. At the time, Wikileaks tried to auction off access to the email messages to the highest bidder. The auction was ultimately cancelled.

Stefan Mey’s interview with Julian Assange was actually conducted in German English; English and German transcripts are available on Stefan’s blog. Below are some extracted comments regarding the Venezuelan email messages:


[Stefan Mey:] In Germany you made an exclusivity deal with two media companies, the Stern and Heise. Are you satisfied with these kind of deals?

[Julian Assange:] We did this in other countries before. Generally we have been satisfied. The problem is it takes too much time to manage. To make a contract, and to determine who should have the exclusivity. Someone can say, oh, we will do a good story. We are going to maximize the political impact. And then they won’t do it. How do we measure this?

You want to make sure that if you give them the exclusivity that they really do what they promise to do …

Yes. One thing that can’t be faked is how much money the pay. If you have an auction and a media organisation pays the most, then they are predicitng, that they will benefit the most from publishing the story. That is they will have the maximum number of readers. So this is a very good way to measure who should have the exclusivity. We tried to do it as an experiment in Venezuela .

Why Venezuela?

Because of the character of the document. We had 7.000 Emails from Freddy Balzan, he was Hugo Chavez’ former speech writer and also the former ambassador to Argentinia. We knew that this document would have this problem, that it was big and political important, therefore probably no one would write anything about it for the reason I just said.

What happened?

This auction proved to be a logistical nightmare. Media organisations wanted access to the material before they went to auction. So we would get them to sign non-disclosure agreements, chop up the material and release just every second page or every second sentence.That was too distracting to all the normal work we were doing, so that we said, forget it, we can’t do that. We just released the material as normal. And that’s precisely what happened: No one wrote anything at all about those 7.000 Emails. Even though 15 stories had appeared about the fact that we were holding the auction.

The experiment failed.

The experiment didn’t fail, the experiment taught us about what the burdens were. We would actually need a team of five or six people whose job was just to arrange these auctions.

You plan to continue the auction idea in the future …

We plan to continue it, but we know it will take more resources. But if we persue that we will not do that for single documents. Instead we will do a subscription. This would be much simpler. We would only have the overhead of doing the auction stuff every three months or six months, not for every document.

So the exclusivity of the story will run out after three months?

No, there will be exclusivity in terms of different time windows in access to the material. As an example: there will be an auction for North America. And you will be ranked in the auction. The media organisation who bids most in the auction, would get access to it first, the one who bids second will get access to it second and so on. Media organisations would have a subscription to Wikileaks.

I haven’t ever actually seen the Venezuelan emails, but in the extract above, Assange seems to indicate that they were eventually made freely available from the Wikileaks site. The Wikileaks site is currently unavailable (though the page states it was to be available again after 11th Jan 2010), instead showing a page requesting funds from supporters, so I can’t confirm whether the emails are actually available for download from the site.

Regardless, it seems that like the infamous MediaDefender emails, it seems unlikely that the email could be ethically used for research purposes.

Update 13/01/10: I mistakenly stated that the interview was conducted in German. Stefan Mey has confirmed that the interview was in English, and translated into German.



New Enron Email Corpus release with attachments
Wednesday November 25th 2009, 9:13 pm
Filed under: email, language technology, research, search
Posted by: Andrew Lampert

Exciting news – there’s a new version of the Enron email corpus that’s now publicly available which includes both the email messages and attachments.

Recently, an organisation called EDRM (Electronic Discovery Reference Model) has made a version of the Enron email corpus available for download that includes attachments, which were missing from the widely used versions of the corpus available from CMU, ISI etc. Apparently, the initial data set was created by John Wang and a team at ZL Technologies.

This version of the corpus consists of a series of Microsoft PST files, which contain both email messages and attachments. It’s a reasonably large dataset, especially compared with the email only versions; the total size of the compressed files is about 19 GB. The uncompressed files total about 43 GB. Except where otherwise noted, use of files is subject to a Creative Commons Attribution 3.0 United States License. Attribution should be noted as “EDRM (edrm.net).”

One thing to note is that every email appears to have had a footer added with EDRM attribution information, I assume as part of the conversion process into PST files. The content of the footer is consistent, however, so could be readily filtered out if processing the emails automatically.



Email Zoning: Finding Signal amongst the Textual Noise of Email Messages
Monday August 10th 2009, 3:46 pm
Filed under: email, java, language technology, research, science, search, technology
Posted by: Andrew Lampert

In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their functional parts. Today, the explosion of different email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simple techniques for identifying quoted replies that used to yield 95% accuracy now find less than 10% of such content.

Many language processing and search tools stand to benefit from better knowledge of the different functional parts of email messages, since this would allow them to focus on relevant content in specific parts of a message. In particular, access to zone information would allow email classification, summarisation and analysis tools to separate or filter out ‘noise’ and focus on the content in specific zones of a message that are relevant to the application at hand. Email contact mining tools, for example, might only access content from the email signature, while tools that attempt to identify tasks or action items in email might restrict themselves to the sender-authored and forwarded content.

Last week, I presented my paper on Segmenting Email Message Text into Zones at the Empirical Methods in Natural Language Processing (EMNLP) conference in Singapore. The focus of this work is Zebra, an SVM-based system that automatically segments and classifies the body text of email messages into nine functional zone types based on graphic, orthographic and lexical cues.

Our set of nine zones includes the following: author, greeting, signoff, quoted reply, forward, signature, advertising, disclaimer and attachment. Zebra currently performs the segmentation and classification of email text into the nine zones with an accuracy of about 87%. When the number of zones is abstracted to two or three zone classes (which is much more likely to be the granularity required for real-world email processing tasks), Zebra’s accuracy increases above 91.5%.

I’m currently working to finish off the Zebra system, as well as to resolve some licensing issues so that the code can be released for other researchers to use. We have, however, already released our annotated email dataset consisting of almost 12,000 lines of annotated email text that we used to train the Zebra system. If you want to know more, you can read our paper, head over to the Zebra website, or just get in touch with me by email or other means.



Details of “Lost” Bush Administration Emails to Remain Secret
Wednesday May 20th 2009, 11:50 pm
Filed under: email, language technology, technology
Posted by: Andrew Lampert

Thanks to a unanimous ruling in the U.S. Court of Appeals for the D.C. Circuit, details surrounding millions of Bush Administration emails that were lost (and later found) may remain secret.

Back in October 2005, the Office of Administration in the (Bush Administration) White House allegedly discovered that the Executive Office of the President had lost millions of White House emails between 2003 and 2005. In April 2007, CREW filed a Freedom-of-Information-Act request to the Office of Administration asking for information about the missing emails. CREW sought records about the EOP’s e-mail management system,reports analyzing potential problems with the system, records of retained emails and possibly missing ones, documents discussing plans to find the missing emails, and proposals to institute a new email record system.

Sadly for CREW, the latest ruling finds that the Office of Administration is not an “agency” under the terms of the Freedom-of-Information-Act, and thus need not comply with CREW’s request to provide information about the “misplaced” emails.

Of course, there are other cases still moving through the courts between the Executive Office of the President, CREW and other parties. And, thanks to earlier lawsuits in the 1990s, email from the White House must be treated and preserved as government records. For more information about the “lost” Bush Administration emails, the National Security Archive at George Washington University has a comprehensive chronology of the saga.

(Hat-tip to Roger Matus for alerting me to the ruling)



Java Speech API 2.0 Specification Finally Released
Friday May 08th 2009, 10:27 pm
Filed under: java, language technology, research, technology
Posted by: Andrew Lampert

About 5 years ago, during my Masters studies, I wrote some simple speech applications using Java Speech API (JSAPI) 1.0 compliant speech engines. At the time, the JSR for JSAPI 2.0 was well underway. Well, it’s taken more than 8 years since the formation of the JSR, but *finally* the final release of the Java Speech API (JSAPI) 2.0 specification has been made available, released on 7th May 2009.

Of note, JSAPI 2.0 is now primarily aimed at the Java ME platform (specifically CLDC 1.0 and MIDP 1.0), meaning that it’s hoped the new spec will facilitate speech-enabled java applications on mobile devices. For this reason, gone are all floating point references and dependencies on AWT (yay!). Recognition Engines may provide full support for application-defined grammars or provide more limited support through specialized built-in grammars. Synthesis Engines may support full text-to-speech capabilities or simple text and audio sequencing. According to documentation in the spec, implementations can require 0.5-1.5 MBytes of ROM for models and algorithms and approximately 128 KBytes of RAM depending on vocabulary and grammar size. Of course, JSAPI 2.0 compliant engines can still run on Java SE platforms, and can obviously make good use of more substantial memory and processing resources.

Reinforcing comments made by expert group member Paul Lamere about the difficulties of satisfying all parties and developing a comprehensive speech API, Nokia made the following observation in approving the final specification:

“We think that the API is well designed and has very comprehensive functions. However, it is therefore highly complex and requires fairly advanced speech recognition and synthesis features. It also assumes a high level of speech recognition understanding from the application developer. It might not be feasible in many Java ME devices in the near term, but can provide good features in those high end platforms where applicable.”

Unrelated to Java ME compatibility, also gone are the Java Speech API Grammar Format (JSGF) and Java Speech API Markup Language (JSML), which were defined as companion specifications in JSAPI 1.0. Sensibly, given the standardisation that has thankfully occurred in the intervening years, these have been replaced by the W3C Speech Recognition Grammar Specification (SRGS) and the W3C Speech Synthesis Markup Language (SSML) respectively. After spending some time reviewing the plethora of speech synthesis markup languages, I’m very relieved to see this standardisation.

All in all, while it has taken a long time to come to fruition, I’m very pleased to see the JSAPI 2.0 standard finalised. Of course, given that JSAPI is only a specification (not an implementation) it remains to be seen how quickly the various speech recognition and speech synthesis systems move to support the new and modified APIs.



E3C: Email in eCommerce and Enterprise Contexts
Saturday February 14th 2009, 11:45 am
Filed under: email, language technology, research, science, search, technology
Posted by: Andrew Lampert

I’m very excited to announce that we’re planning another email workshop, following up from last year’s very successful AAAI workshop (EMAIL-08). This one is titled The 1st International Workshop on Email in e-Commerce and Enterprise Contexts (E3C), and is being held at the 11th IEEE Conference on Commerce and Enterprise Computing (CEC 2009) in Vienna, Austria on July 20th 2009.

Important Dates:

  • Full paper submission: March 15th, 2009
  • Authors Notification: April 15th, 2009
  • Camera ready versions due: May 15th, 2009
  • Workshop: July 20, 2009

One of the main aims of this workshop is to gather email and enterprise computing researchers and practitioners to discuss and propose solutions for email in e-commerce and enterprise contexts.

Topics:

  • Architecture for enterprise cooperation and interoperability over email
  • Intelligent email for SMEs
  • Email-based business task and process management
  • Email content analysis, message summarization, information extraction
  • Semantic Email and Semantic Knowledge Extraction
  • Email social networks for enterprise computing
  • Email analysis of exchanged documents for semantic alignment via negotiation
  • Email Workflow Management for Business Processes
  • Interconnection of email content and enterprise resources (legacy systems, document repositories)
  • Enterprise resource mashup support for business email
  • Approaches for email visualization and user interfaces in business contexts
  • Case studies
  • Business email datasets

If you’re a researcher working with email, or if your startup or company is in the email space, please consider submitting a paper or demo to the workshop. Full details are available in the Call for Papers.



Google adds Task List to GMail Labs
Thursday December 11th 2008, 11:17 am
Filed under: email, information delivery, language technology, technology
Posted by: Andrew Lampert

I’ve been traveling for the past couple of weeks, so missed the announcement of Tasks as a new feature in GMail Labs. Given my own interests in tasks in email, this seems to be the most useful Labs feature to surface so far. Also of interest are the nearly 500 threads discussing ideas for future enhancements to the Tasks plugin.

Tasks for Gmail Labs

The focus seems to be on lightweight interaction, which is definitely the right approach. To add a new task, for example, you just click in an empty part of the task list and start typing. This seems pretty similar to the style of task interaction pioneered by Remember the Milk, and I’d be interested to know how it compares with RTM’s GMail services, particularly their recently announced RTM GMail gadget that can be added via GMail Labs. Are there any users out there who have experimented with RTM’s tools and can offer insight on the comparative strengths and weaknesses of the new Labs task addition?

There doesn’t seem to be much in the way of tight interaction between email and tasks (yet), but I’m sure this will be in the pipeline for future enhancements.

On the topic of tasks in email, if you’re interested in learning more about how people phrase tasks in email messages, have a look at my recent paper, Requests and Commitments in Email are Complex: Eight Reasons to be Cautious, which I presented at the Australasian Language Technology conference in Hobart earlier this week.



Sarah Palin’s Email Leaked
Thursday September 18th 2008, 8:24 pm
Filed under: email, technology
Posted by: Andrew Lampert

A series of email messages from the controversial Yahoo! Mail account of US Republican vice-presidential candidate Sarah Palin were leaked onto the Internet today.

As with the recently announced Venezuelan government email leak, Wikileaks was again in the scrum, issuing the following press release:

The internet activist group ‘anonymous’, famed for its exposure of unethical behavior by the Scientology cult, has now gone after the Alaskan governor and republican Vice-Presidential candidate Sarah Palin.

At around midnight last night the group gained access to governor Palin’s email account … and handed over the contents to the government sunshine site Wikileaks.org.

Governor Palin has come under media criticism in the past week for using pseudo-private email accounts to avoid Alaskan freedom of information laws.

The zip archive made available by Wikileaks contains screen shots of Palin’s inbox, two example emails, governor Palin’s address box and a couple of family photos. While the emails released so far reveal little, the list of correspondence appears to re-enforce the criticism that Palin is mixing governmental and personal affairs.

The emails quoted in press articles to date seem to show that Palin has improperly used her private email account to conduct government business, thereby avoiding archiving requirements and shielding herself and her government from public scrutiny. It is unclear what if any action will be taken in response. According to the Sydney Morning Herald, the Secret Service contacted The Associated Press and asked for copies of the leaked emails on her Yahoo! account, but AP did not comply.

The Palin email leak is the latest in a string of unauthorised email disclosures. Ironically, it comes almost a year to the day after the MediaDefender email leak. Clearly, our recent discussion about the ethics of email corpora on the email research mailing list is a timely one!



Quote Selected Text: A Useful Gmail Labs Addition
Friday September 12th 2008, 11:04 am
Filed under: email, search, technology
Posted by: Andrew Lampert

I’ve previously noted my disappointment with the array of trivial trinkets that have so far defined Gmail Labs. One of the most recent additions, however, finally adds something of use.

Quote selected text allows you to selectively quote and reply to one small part of a message. Like other email clients with this feature (Apple Mail springs to mind), you just highlight the text you want to include in your reply, hit the keyboard shortcut “r” to reply, and the compose template will be just what you selected. This is a simple but useful feature. Note that it only works in Firefox and IE right now. Safari and Chrome support is still in progress.



Gabor Cselle on the Future of Email
Sunday September 07th 2008, 8:35 pm
Filed under: csiro, email, language technology, research, search, software, technology
Posted by: Andrew Lampert

As many in the email community will know, Gabor Cselle, VP of Engineering at Email startup Xobni, announced a month or so ago that he was leaving Xobni to start his own email company.

Luckily for us, Gabor is fitting in some travel between finishing up at Xobni and starting his new company, and Sydney is one of the stops on his itinerary. Gabor is an excellent presenter, so if you’re in Sydney, I highly recommend coming along to the seminar that he will be giving on The Future of Email at CSIRO / Macquarie University, starting 11am on Wednesday 15th October. (Here’s details of our location and how to get here if you’re planning to come along).

Of course, given Gabor’s experience as an entrepreneur, I’m sure he’ll also be happy to talk about life in a Silicon Valley startup and the lessons he’s learned along the way. So, come along for the seminar, and stick around for what’s sure to be some interesting discussion.