Google’s Massive Email Corpus – Now consuming a domain near you
Saturday February 11th 2006, 5:33 pm
Filed under: email, research, technology
Posted by: Andrew Lampert

It seems Google isn’t content with the massive GMail email corpus that they’re building, which is presumably dominated by personal email (and spam). In the latest step of their data collection strategy, they are offering YABS (yet another beta service) to host and manage email for company domains, if you satisfy their selection criteria.

As well as asking for details of your organisation’s size, location and nature (company/personal/ISP/portal/education/other), there are questions about what the organisation uses email for, how many accounts you need, and why you want Google to host it for you. There is also a curious checkbox that asks whether “All email users are in the United States”. I’m not immediately sure what the implications of this question might be; all that springs to mind is that it could easily have something to do with laws covering domestic surveillance of US citizens. That certainly doesn’t instill a great deal of comfort. I also can’t imagine that I’d like my ISP to be outsourcing the hosting and processing of my personal or business email to Google, allowing that email to be used in building detailed user profiles.

Ignoring the privacy implications, however, what a fantastic and completely unparalleled data resource this will create for Google – instead of seeing each GMail user as an independent and isolated identity which they can only weakly link to other people through networks of communication and browsing, they will now, in addition to all that, know exactly which users belong to which organisation. This will allow them to analyse entire sets of email communication for a presumably large number of companies, universities and other organisations who sign up. This will surely give Google a more comprehensive understanding of organisational communication than anyone else in history. You think the Enron email corpus is a big deal? Think of the implications of those who can get their hands on the Google Email Corpus!



Good Times for CSIRO ICT Research
Friday February 03rd 2006, 5:22 pm
Filed under: csiro, research, science, technology
Posted by: Andrew Lampert

The ICT Centre has been a big winner in CSIRO’s recent revision of research priorities, with a substantial increase in research dollars being directed our way. In dollar terms, the CSIRO ICT Centre will see its budget increase by over 14% to $48M in 2006/07, which includes a substantial increase in its involvement in CSIRO National Research Flagships ($5.3M to $12.7M) to address research issues of national significance.

Of course, this new funding is in addition to the recently announced Intelligent Island funding that will be invested into establishing an ICT Centre presence in Tasmania.

I actually spent much of today interviewing candidates for our most recent software engineering position; if our growth continues at its recent pace (and the increased funding almost ensures this) I should probably get used to spending a lot more time interviewing people.



Activities and Tasks in Emails
Friday February 03rd 2006, 4:45 pm
Filed under: email, information delivery, language technology, research, science, technology
Posted by: Andrew Lampert

So I was busy at the International Conference on Intelligent User Interfaces conference earlier this week, and it was a hugely motivating and thought provoking experience. A great bunch of really switched on people doing all kinds of interesting things.

One presentation that particularly caught my interest was from Tessa Lau at IBM Almaden Research lab – not surprising really, given that I’ve read about some of Tessa’s previous work in email management. The work she presented at IUI was on IBM’s Unified Activity Management project (UAM). In that context, one of her points that really rang true for me was about the need to move away from being focused on tools to focus more on the activities people perform when dealing with information management. This should, of course, lead to the development of software applications that do a better job of supporting users, who are (and should be!) more concerned about their tasks and activities than about which tool they used to do what, and how they can integrate work that happens to have been performed using different software tools.

As a simple example, rather than grouping email messages for a given activity in an email client and the Excel documents in a separate folder on the filesystem, can we instead cluster all relevant information together based on the activity which ties the various artifacts together, rather than based on the tool that happened to have been used to create them. Accordingly, a major part of the UAM project is focussed on integrating email content into an overall activity management system that is under development. To do so requires an ability to associate email content with new or existing activities. Obviously, for new activities, this requires a light-weight and simple way of creating activities from email, and of displaying email in the context of existing activities.

When trying to associate incoming email messages with new and existing activities, the IBM team seems to have been inspired by the information retrieval community in using recommendation rather than all-or-nothing mapping of incoming messages to activities. This is a clever way of reducing the likelihood of frustrating users with incorrect categorisations, and is indeed the approach we took in earlier email categorisation work I have been involved with a few years ago at CSIRO.

Tessa also referred to email signatures as ‘noise’ that, by implication, needs to be removed to recover the communication signal conveyed by email – a very simple and logical description of the nature of email signatures (and often quoted material) in the context of automatic processing of emails.

Some weaknesses of the work presented included an implicit assumption that a single email message should be associated with only zero or one activity. Clearly this suffers from a multiple-inheritance style problem – in practice a single email message can often contain content that is relevant to many different activities. In the present system it is not possible to apply multiple activity labels to a single message. This, of course, sounds a lot like the folder vs. labelling problem that has been all the range since GMail appeared on the radar.

Another interesting question is whether classifying email messages into activities is different from the classification of emails into folders (which is a well studied text categorisation problem). There certainly seem to be many similarities between both problems. Perhaps there is a difference of focus (folder classification generally being for archiving, and activity classification more for current work), but this is purely speculation.

Of particular interest for me was that Tessa identified speech act detection in email as a future direction for their research. This is both motivating, given that smart people see some similar value in the kinds of ideas I’m playing with, but also rather intimidating to think who my competitors out there in the research world include!! I think I’d better get a move on with my own research!