Enron Email Mailing List – Available again
Monday March 17th 2008, 10:06 am
Filed under: email,language technology,research,science,technology,usability
Posted by: Andrew Lampert

Way back in mid-2005, I setup an Enron Email Mailing List to encourage people to share data, experience, questions and knowledge about working with the Enron corpus. While the list has been quite low-traffic, a significant number of email researchers subscribed, and I like to think that it’s been of at least some use to people working with the Enron data.

Unfortunately, if you have tried to post (or if new people tried to subscribe) to the list in the past few months, things wouldn’t have worked out.

Due to some technical and people issues (that I have been slow to notice and even slower to address – my apologies for this!) the list disappeared off the face of the internet sometime around September last year. Unfortunately, the mailing list archives were lost in this process, and I haven’t been able to recover them, although I do have a personal archive of all the mailing list messages, if anyone is in desperate need of a copy.

The good news is that I have reconstructed the membership list, based on my personal archives of the list. So the list is now functioning again. If you’re not already a subscriber, and you’d like to join, just head on over to the Enron Email Mailing List page.

If any of you have Enron specific, or more general email research questions or topics you’d like to discuss, I’d encourage you to post them to the list.

Finally, it’s probably a good time to remind anyone interested in email research about the upcoming AAAI Enhanced Messaging Workshop. You can find out all the details, including the important dates, at http://enhancedmessagingworkshop.googlepages.com.



Do we need sentiment analysis for email?
Tuesday January 22nd 2008, 12:32 pm
Filed under: email,information delivery,language technology,research,technology
Posted by: Andrew Lampert

Brij Singh at MessageDance has posted an interesting motivation for applying sentiment analysis to incoming email. He asks whether the sentiment evoked by incoming email results in cognitive turnover for knowledge workers, thus disrupting their productivity.

Brij thinks that the application of sentiment analysis to email could help address this mental wandering for knowledge-based employees:

I think it’s high time for companies to invest in sentiment classification and routing toxic emails to platform where immediate impact on employee productivity is less. Can carefully controlled social platform enable this process?

Having just yesterday attended a research presentation by Mary Gardiner on sentiment classification, it’s interesting to consider the possibilities and practicalities of applying the sentiment classification techniques to email.

One unsupervised technique, pioneered by Turney and Littman, is to use pointwise mutual information (PMI) and word co-occurrence counts from a search engine to help determine the valence of each word in a text. Turney and Littman used the NEAR operator in Altavista to determine the co-occurrence of each word in their text to be classified (in our case, this would be each word from an incoming email message) with each word from a set of words with known positive or negative valence. The counts for co-occurrence with the known-positive words contribute to the positive sentiment of our unclassified word, while counts for co-occurrence with negative words contribute to the negative sentiment. These co-occurrence counts are then normalised and combined to determine the overall valence of each word from our unclassified text. The technique, though simple, worked surprisingly well (80% classification accuracy at the word level), much better than many more complex techniques.

Ignoring the sad reality that the NEAR operator is no longer available to use in Altavista queries (and that no other search engines offer an operator of similar functionality in their public query interface), it’s interesting to think about whether such a technique could be usefully applied to email. I don’t know if people have addressed how to move from word-level classification up to message-level sentiment classification, but it doesn’t seem to be an insurmountable problem.

More of an issue for email is whether people would be happy for the entire text of their email messages to be sent in clear text to a single search provider. Depending on the volume and nature of data on a user’s own machine, perhaps we could use the desktop search interface to approximate Turney and Littman’s technique, without passing sensitive email data out onto the network? Of course, there’s a big difference in the scale of corpus being used to generate the co-occurrence counts in this case – Altavista at the time of the experiment, claimed to be indexing around 100 billion words. My desktop search index claims to contain about 1.5 million items (email messages, documents, visited web pages etc.) . While that’s not going to get us to 100 billion words, it might be enough to get some credible results?



Enhanced Email Workshop at AAAI 2008
Friday December 14th 2007, 9:47 pm
Filed under: email,language technology,research,science
Posted by: Andrew Lampert

In exciting news, the proposal for an Enhanced Messaging Workshop at AAAI 2008 was recently accepted, thanks to the efforts of Tessa Lau, Vitor Carvalho and Mark Dredze. I’m especially excited to be a member of the program committee for the workshop!

The main aim of the workshop is to provide a focus for people working on email and other messaging technologies. In some ways this is what I think Conference on Email and Anti-Spam (CEAS) could have (and perhaps should have) been, but in recent years, CEAS seems to have been heavily focused on the anti-spam aspect of email, at the apparent expense of work more focused on HCI, NLP, AI and so on. Sensing this gap, the Enhanced Messaging Workshop is also hoping to set a multi-year agenda of important research goals for the field of email research and messaging technologies more generally.

For anyone interested, here’s an introduction to the purposes of the workshop from the Call For Participation:

With the rise of the digital workplace, email has become a ubiquitous tool in the office and a primary means of communication. Email’s growth has created new opportunities and challenges for a large variety of artificial intelligence research, focusing an increasing amount of academic and industrial research on email issues. Research seeks to enhance the email user experience by addressing email overload or to learn from email social patterns. Recent papers have dealt with email triage, activity management, email prioritization, summarization, topic tracking, sorting, leak detection, social network analysis, and enhanced intelligent interfaces. The wide spectrum of email research has appeared in a variety of conferences. The growing interest in email has left a fractured community spread through many sub-areas, a particularly important problem for this type of work since all research is aimed at improving a single application.

The Workshop on Enhanced Messaging at AAAI 2008 brings together researchers working on solutions for email and other forms of web messaging from many subfields of AI as well as soliciting participation from the broader community. We will discuss recent progress in the field and share research experiences. The community will outline existing problems in email and construct major research objectives for the next few years. We expect this workshop to be an important step towards building a community structure that will open channels of communication and collaboration as we move forward.

The workshop is aiming to appeal to both academic and industrial researchers (you might notice that Gabor Cselle, VP of Engineering at Xobni, is on the Program Committee too), so if you work at all in the email or messaging space, please have a look at the Call For Participation and consider submitting a paper, poster or demo.



Requests and Promises in Email
Friday December 14th 2007, 9:10 am
Filed under: csiro,email,language technology,research
Posted by: Andrew Lampert

On the topic of my PhD work, I presented a paper at the Australasian Document Computing Symposium (ADCS) on Monday in Melbourne about how well humans agree on identifying requests and commitments in email message. The bottom line appears to be that there is sufficient agreement to have some hope of automating the task, although there is much more work to do to make this happen. If you’re interested in the details, have a look at the paper.

Excitingly, I ended up winning the best presentation award. I think at least in part this was because I presented 40-odd slides in a 15 minute talk – which seemed impossibly many slides to most folks – and still managed to make my research understandable, which of course is the whole point!



Xobni Looking for Latent Structure in Email
Friday October 19th 2007, 4:41 pm
Filed under: email,information delivery,language technology,research,search,technology
Posted by: Andrew Lampert

Email seems to be a flavour of the moment, and Chris Morrison continues the trend over at VentureBeat with a short but informative write-up of four startups innovating around email.

Fuser and Orgoo both focus on the integrated/universal messaging client, bringing IM, social networks and other communication mediums into a single client along-side email. Xoopit is still in stealth-mode, so they haven’t revealed much publicly about the details of their work, but their focus appears to be on extracting and compiling collections of attached documents, images etc. from email archives. More interesting to me is Xobni, who I’ve been following with some interest since Vitor Carvalho brought the company to my attention a few weeks ago.

Chris Morrison notes that while Xobni already pulls out some information like phone numbers from email, there’s much more information waiting for someone to find an innovative way to highlight. Of course, highlighting is only one option for making such structure available and useful for end users. Matt Brezina, co-founder of Xobni, also comments about the latent, untapped structure in email:

“There’s a structure that just hasn’t been broken apart and exposed”
Matt Brezina – Co-Founder Xobni

I think Matt is right on target with this assessment. It’ll be interesting to see which avenues of structure they pursue. I have my own ideas on important latent structure in email, some of which you can hopefully read about in an upcoming conference paper. More details coming if and when the paper actually gets accepted.



MediaDefender Email Corpus: 6600 email messages released
Tuesday September 18th 2007, 11:56 am
Filed under: email,language technology,research,search,technology
Posted by: Andrew Lampert

The internet is buzzing with conversations about the huge email leak from MediaDefender, a company which makes its living selling services and software to prevent illegal content sharing in peer-to-peer networks. I was made aware of this hugely exciting opportunity thanks to the excellent Death By Email blog which provides a good summary of the unfolding drama.

Given its business, MediaDefender is of course not a popular company within the file-sharing community. It thus shouldn’t be surprising that people have been very eager to jump on the more than 6600 company email messages from MediaDefender employees and begin dissecting their content. The emails appear to date from the period between April 2007 and September 2007.

According to Ars-Technica, the e-mail was leaked to the public by a group that calls itself MediaDefender-Defenders. In a text file distributed with the email data, the group claims that MediaDefender employee Jay Mairs forwarded all of his company emails to a Gmail account, from where the email data was leaked. “A special thanks to Jay Maris, for circumventing there entire email-security by forwarding all your emails to your gmail account, and using the really highly secure password: blahbob”.

The group’s motivation for releasing the email is also made clear: “By releasing these emails we hope to secure the privacy and personal integrity of all peer-to-peer users. The emails contains information about the various tactics and technical solutions for tracking p2p users, and disrupt p2p services. So here it is; we hope this is enough to create a viable defense to the tactics used by these companies …”

As someone whose first use of bit-torrent was to download this email corpus, my interest in the data is purely academic – is this another corpus we could use for email research? Conveniently, the MediaDefender email data is released in mbox format, which is a welcome change from the image-based PDF files (created by scanning printed email messages!) that have been released in recent US court cases. Being in mbox format, the data has all the header information, making the data perfect for research purposes.

The (insurmountable?) problem with using this data for research is the of course the fact that the email was not legally obtained. So, is there any way we could get ethics approval for publishing experiments using this data? It seems very doubtful to me, but I’d be curious to hear your thoughts.



MyLiveSearch: the end of outdated search results?
Saturday June 02nd 2007, 11:20 pm
Filed under: research,search,technology
Posted by: Andrew Lampert

I’ve recently come across MyLiveSearch. It’s a search engine that claims to be “the only engine that searches the web live”. Search results are apparently crawled retrieved on the fly, as the user’s query is processed.

According to coverage at IDM, this is achieved through a browser plug in. A user’s search query is first run through a traditional search engine (their own, or an existing engine?), then as results are returned, MyLiveSearch performs deeper scanning on the fly, following embedded links to many more sites and turning up much more detailed and up-to-the-minute results.

What’s especially interesting is that the development team is based in Melbourne, and according to their website, have been working on this technology for the past 8 years. Does anyone know these guys?



Business users receive 10 times more email than personal users?
Friday March 30th 2007, 9:05 pm
Filed under: email,language technology,research,technology
Posted by: Andrew Lampert

This afternoon I came across an interesting quote about Yahoo! Mail a Sydney Morning Herald article. Here’s the relevant extract from the article:

According to Yahoo engineer [and Group Vice President of Engineering] David Nakayama, in 1997 its total storage capacity for mail accounts was just 200GB. Today, he said that amount of space was consumed by just 10 minutes worth of inbound email.

I thought it would be interesting to do some calculations about email usage based on these numbers, and compare these results with previous numbers I’ve posted about email statistics within CSIRO.

In order to turn David Nakayama’s 200Gb figure into something useful, we need to have some idea of the number of users that Yahoo! Mail has. If we believe numbers quoted by TechCrunch, based on Comscore Media Metrix statistics, then this number is something like 250 million users.

Ok, next we need to get some sort of handle on the size of an average email message. My first thought was to look to the Enron Email Corpus. The raw version of the corpus that is available from CMU contains 517,431 email messages (including duplicates) and takes up about 1.34 Gbytes of disk space. So the average email size in the Enron Email Corpus turns out to be approximately 2.7Kbytes. Of course, we should remember that this number underestimates the true average size, since the messages in the Enron Corpus have had all email attachments removed. Despite this methodological flaw, calculations using the average email size in the Enron Corpus should give us an approximate upper bound on the volume of email being processed at Yahoo! Mail.

At 2.7Kb per message, every Terabyte of data transferred represents approximately 370 million email messages. At 1200 Gbytes per hour, Yahoo! Mail is processing roughly 28.8 Terabytes of email per day, which with the Enron numbers, equates to 10.66 billion email messages per day. Based on 250 million users, that’s roughly 42.6 email messages per user per day. That seems like a plausible figure. As we’ve already noted, however, the average email size of 2.7Kbytes from the Enron data is just one data point, and almost certainly under-estimates the true average email size.

It turns out that some of the spam processing companies have looked at this problem too. Just recently, SoftScan released email statistics suggesting that the average size of a spam email message was now 11.76 Kbytes. (This number is apparently increasing, due to the growing number of image spam messages). Presumably, a very large proportion of mail processed by Yahoo! Mail is actually spam, (especially if the numbers are anything like the spam statistics for CSIRO), so numbers based on average spam email size are probably quite a realistic approximation.

At 11.76Kb per message, every Terabyte of data transferred represents approximately 85 million email messages. In this case, our 28.8 Terabytes of daily Yahoo! Mail equates to roughly 2.45 billion email messages, or roughly 9.8 email messages per user per day. Now, of course not all Yahoo! Mail accounts are active, so the volume of email received by each active user is probably somewhat higher than this, but this gives us a reasonable lower-bound on email volume per user.

So, the average Yahoo! Mail user probably receives somewhere between 10 and 43 messages per day (including spam – hopefully a significant amount of which wouldn’t actually reach users’ inboxes).

Why is this interesting? Well, we see a pretty stark contrast when we compare these numbers to those from inside a company. In my previous post, I calculated that the average number of incoming emails per user per day in CSIRO is upwards of 400 (including spam). That’s at least an order of magnitude greater than the numbers at Yahoo!.

I’m very curious whether these numbers are representative of a more general trend in email volumes for personal email users (who presumably dominate the Yahoo! Mail figures) and business users. Does anyone else have any additional email usage figures they can share that might shed light on this?

Finally, it’s also interesting to quickly consider what these numbers mean in terms of network bandwidth required for running Yahoo! Mail. Some simple back-of-the-envelope calculations tell us that 200 Gbytes/10 minutes equates to roughly 2.6 Gigabits/second in network traffic. And that’s just for email traffic. It should be clear why there is such large scale infrastructure investment from companies like Google, Yahoo! and Microsoft – and that’s not even considering the requirements for search and other applications (crawling, processing queries, serving video data, replicating copies of the internet across data centres etc.).



The double-edged sword of regression testing
Wednesday February 21st 2007, 9:26 am
Filed under: research,software,technology
Posted by: Andrew Lampert

Sat through an interesting seminar from Kevin Schofield, General Manager of Research at Microsoft Research yesterday, while he visited our Marsfield Lab. In what was a relatively short presentation, Kevin covered only a tiny part of the work at MSR in any detail. Despite time constraints, however, a couple of the points he made really made me stop and think, including his clearly heartfelt comments on the disappearance of Jim Gray.

On the technical side, one of the astounding take-home points for me was the magnitude of complexity in Microsoft’s various code bases. Code complexity is something I’ve been thinking about a bit lately, particularly in terms of concurrency. Kevin’s point was in rather a different dimension of complexity – that of software testing. In quantifying this complexity, he noted that running the full suite of regression tests over the Windows code base takes 8 weeks! On a large farm of servers!

Just think about what the impact of 8 weeks would be on your release schedule. Kinda makes it hard to have a reliable yet agile release cycle, no? If Microsoft wants to run their full suite of regression tests to ensure old bugs have not been reintroduced by new code changes, then there is a huge impact on the agility with which Microsoft can release new versions, respond to bugs and release critical security patches. While the magnitude of their problem may be somewhat larger than most due to the age, size and complexity of their code base, I’m quite sure Microsoft is not alone in having to face such a problem.

Understandably, MSR has been working to address this issue, in part by deriving mappings of the code exercised by each and every test in the regression suite. These mappings are stored and used to prioritise the regression tests, such that the tests that cover the modified code are exercised first.

Sounds like a very logical approach, and I’m surprised that I haven’t come across such techniques before. Perhaps I just haven’t looked in the right places. How do you manage your regression test suites? Do you partition or prioritise them in any novel ways?



I knew it was true!
Tuesday January 16th 2007, 7:02 pm
Filed under: research,science,technology,uni
Posted by: Andrew Lampert

It seems that after reading my previous post, Jorge Cham has finally decided to admit the truth!

The secret world of hidden PhD cameras