Taking Enron Email to the Business World
The blogosphere seems to have recently rediscovered the Enron Email Corpus, thanks to the publicity surrounding Trampoline Systems‘ newly released web application for exploring the Enron emails.
Exploring Enron offers a number of different views of the Enron data via:
- Direct access to mailboxes of individual employees;
- A search interface across the entire data set; and
- A visual java applet for visually exploring the relationships and messages between users.
Also offered are trendy Web 2.0 compliant ‘tag clouds’ for sowing related people and topics when browsing the Enron messages. There is nothing particularly novel in any of this functionality, but Exploring Enron does offer a better-than-prototype quality application that has the potential to bring the Enron email data to the attention of a whole new non-research-oriented audience. In this sense, it continues in the same vein (while offering greater functionality) than other polished sites like Inboxer’s Enron Email site.
It will be interesting to see whether anything comes from this renewed attention from people who haven’t yet played with a large-scale email corpus.
Research Seminar Podcast
So I’ve taken the plunge and created my first podcast which is also available through iTunes. Don’t be afraid though – you won’t hear much from me except the occasional speaker introduction – it’s a podcast of recorded seminars from the research seminar series that I’ve been jointly running with Cecile Paris at the CSIRO ICT Centre for the past 5 years. The seminar series itself pre-dates my time at CSIRO however – 2006 is its 10th consecutive year!
Anyway, if you’re at all interested in human factors, artificial intelligence or language technology, take a moment to tune in – we have some excellent talks coming up in the near future. As you can see from our collection of past seminars, topics range widely including research and applications in usability, human-computer interaction, user modelling/personalisation, novel interfaces, natural language processing, linguistics, information retrieval, speech processing, system evaluation, computer supported cooperative work, cognitive science and more.
Using Context to Deliver Useful Information to People
As Mitch Kapor, founder of Lotus Development Corporation, once said, Getting information off the Internet is like taking a drink from a fire hydrant
.
On September 19th, I will be presenting a seminar to the NSW branch of CHISIG - the Computer-Human Interaction Special Interest Group of Australia – about our research in CSIRO that focuses on controlling the flow of information to deliver the right content to the right people at the right time in the right form.
Our research approaches the problem by using knowledge about users and their interaction to tailor the information that is gathered and to present it appropriately. The context information that is captured and reasoned about can include user preferences and characteristics, as well as details of a user’s current task, their previous history of interaction and their environment. This context can determine which information should be retrieved, and how that content should be aggregated, organised, and presented, in order to best support the user.
My presentation will cover work that builds on concepts and techniques from a variety of different fields, including: natural language generation, information extraction, information retrieval, discourse analysis, user modelling, task analysis and HCI, so if any of those topics spark interest (and you happen to be in Sydney) you might consider coming along to PTG Global on Tuesday 19th.
Invention versus Innovation
I work in a research organisation where the word innovation
is thrown around with gay abandon, especially by management folks. We’re under constant pressure to be inventing and innovating, though the two words are often used interchangeably But, how should we judge when innovation has occurred? What about invention? What’s the difference?
In the midst of their recent article on “Innovation as Language Action” in CACM, Peter Denning and Robert Dunham propose an answer that I find both simple and compelling: innovation occurs where we observe that a group or community has adopted a new practice. Invention is something different – it means to create something new, but it does not require that anyone accept or adopt it.
I should also point out that Denning and Dunham’s article is interesting for many other reasons than the distinction it draws between innovation and invention. In particular, their work takes inspiration from the earlier work of Terry Winograd and Fernando Flores (whose ideas are very influential in my own research), in looking at the specific skills and steps involved in taking new technology inventions into the broader market from a language-action perspective.
Innovation as the adoption of new practices seems nicely consistent with the ideas of Peter Drucker, who himself linked innovation to the adoption of new practices back in the 1950s. Of course, another definitional question that then arises is: what exactly is meant by practice
? Denning and Dunham suggest that practice
refers to habits, routines, and other forms of embodied recurrent actions taken without conscious thought, and to me this seems largely to capture the concept.
But is Denning and Dunham’s definition of innovation widely accepted?
(more…)
SIGDial – Day 2
So I had an excellent time at day 2 of SIGDial, although I unfortunately missed the last couple of sessions due to a clash with the Discourse Annotation tutorial I attended. On that note, it seems strange (but presumably unavoidable) that such closely aligned sessions should clash.
Heard some interesting talks that I won’t have time to really do justice to in summarizing them here (I’m sitting in the corridor at the main Coling-ACL conference, taking advantage of a break in sessions). Most interesting work for me included:
- Work using GraphBank – which looks at discourse structure as graph-based rather than tree based. This allows non-local (i.e., long distance) discourse links to be modelled, which is sometimes an advantage in real discourse. GraphBank, while an interesting idea, is not without its problems however – one specific issue is that it seems to conflate some relations, in particular actual causation with intention or purpose, which can lead to some strage annotations.
- Work from Tilburg University on (yet another) dialogue act taxonomy, called DIT++. How is it different from something like DAMSL? This isn’t entirely clear, but perhaps a point of differentiation seems to be a more elaborate and fine-grained set of feedback functions and dialogue control aspects. In general, given that it is a multi-dimensional annotation scheme, there are the usual problems with inter-annotator agreement. To attempt to improve their evaluation scores, the particular work presented looked at developing evaluation metrics that better model the performance of such hierarchical schemes, where coarse-grained agreement is usually ignored if fine-grained disagreement occurs (e.g., using kappa as a measure of agreement). Unfortunately, that actual weighted metrics proposed seemed rather preliminary and arbitrary, though there is clearly a need for such work.
- A high-level presentation from David Traum on work with Question Answering characters. The main message I took out of his talk was a desire to define the ‘science’ of content creation across different modalities and methods (text, speech and graphics are their focus)
Otherwise, Coling/ACL has been a rather intense experience so far – just getting towards finishing day 4 of consecutive conference days, with another 5 still to go (although tomorrow is actually an excursion day). ACL is awesome fun though – a hugely impressive and inspiring group of people from all over the planet.
Discourse and Dialogue Research Workshop
The 7th SIGDial workshop is being held in Sydney this year, as part of Coling-ACL (for which I have been looking after the website, as part of the local organising committee). SigDIAL covers both discourse and dialogue research (though many people consider it a dialogue forum). Here are some highlights from day 1 of my first SiGDial workshop:
- Using user models to tailor help provided within a spoken language dialogue system in BMW cars. The work looked at how to advise users about the available options in the context of a system offering more than 350 features where 90% of users use less than 10% of the features. The novel aspect was its attempt to model users forgetting about available features as well as their learning behaviour in using and internalising options. I still question how usable (or necessary) it is to really have 350 features, let alone 700-1000, which was the prediction for the next generation of the BMW iDrive. What exactly are people controlling while they’re driving with these devices, other than the navigation system, their music and possibly a phone?
- A paper on a multi-domain Spoken Language Dialogue System, looking at an architecture with a central module directing questions to domain specific agents with domain specific language models etc. This work focused on how to correctly identify the domain for any incoming question, including using information about the dialogue history to bias certain domain choices (e.g., bias towards staying on the existing topic depending on the probabilities returned from the speech recognition component). The presenter assumed independence between domain expert agents, and discounted the case where more than one agent is appropriate, or where information from multiple agents is required, so didn’t explore any of the interesting integration or aggregation aspects – it was really more about identifying the domain of a question/utterance, with some fairly simplistic use of discourse features (namely, the discourse history).
- Maria Georgesecul gave a very interesting presentation on different algorithms for topic segmentation and different methods of evaluating their performance. She looked at the TextTiling, C99 and TextSeg algorithms, using a variety of evaluation metrics. More interestingly, she also evaluated the effect of using artificial/synthesized topic data created by concatenating fragments of documents from different domains, versus using (expensive to create) hand-annotated thematic data. Their study suggests that using synthetic data to evaluate performance of topic segmentation algorithms can give misleading results under certain conditions, which is something that has long been suspected in the field, but apparently not previously confirmed.
- The keynote from Jonathan Ginzberg looked at whether the content of an utterance in dialogue should be computed using only the grammatical information, or whether it should take account of the participant’s intention using domain-level inference. More specifically, he talked about whether grounding pertains to surface utterance content or to interlocutor intention. I was a bit disappointed overall, as I found his talk a bit opaque and hard to follow.
- Simon Keizer gave an interesting talk on multidimensional dialogue management, where he proposed a new, multi-level dialogue act annotation scheme. Disappointing for me was that he didn’t attempt to contrast or compare this with any of the existing dialogue act scheme, including DAMSL, which seemed remarkably similar. He had built a rule-based DA recognizer using mainly part-of-speech information and word-patterns (cue phrases) as features.
Some interesting posters included:
- Daniel Midgley’s work (I met Daniel at the HCSNet Summerfest last year) looking at adjacency pairs of dialogue acts in the VerbMobil corpus combined with the use of some novel discourse chunking. Also applied Chi-squared normalisation to filter out noise in the adjacency pairs, and ended up with data that empirically supports the original adjacency pairs proposed by Sacks and Schegloff back in 1973.
- Simone Teufel presented work that defines an annotation scheme for classifying sentiment in scientific citations. This allows sentiment analysis classification to be added to citation graphs, and might allow for interesting applications – see which are the controversial papers in your field. Which papers are used by many other papers as a theoretical basis? Which papers receive only positive citations?
Looking forward to day 2 tomorrow!
CeBIT Australia 2006
After returning from leave, I was immediately immersed in last minute preparations for CeBIT Australia 2006. After spending much of Monday afternoon assisting with the construction and setup of the CSIRO stand, I then spent 2 days this week at CeBIT show-casing ICT research from across a range of CSIRO divisions. Our main demonstration was again SciFly, our tailored brochure generation system – with much improved robustness and performance from last year. I had several interesting discussions with interested people about applying SciFly and the underlying technology to a range of problems across a variety of industries. For me, this was the most satisfying success metric of my time at CeBIT.
As well as demonstrating our technology at the CSIRO booth, I also gave a short seminar on Contextualised Information Retrieval and Delivery as part of the Future Parc seminar series. The environment was a challenging one for speakers, with much background noise, unreadably small plasma screens for displaying slides, and no less than 6 parallel sessions of seminars at various points around CeBIT to compete with. Despite this, I think I managed to engage at least some of the people in the audience, based on the couple of thoughtful follow-up discussions that I had after the seminar.
Good Times for CSIRO ICT Research
The ICT Centre has been a big winner in CSIRO’s recent revision of research priorities, with a substantial increase in research dollars being directed our way. In dollar terms, the CSIRO ICT Centre will see its budget increase by over 14% to $48M in 2006/07, which includes a substantial increase in its involvement in CSIRO National Research Flagships ($5.3M to $12.7M) to address research issues of national significance.
Of course, this new funding is in addition to the recently announced Intelligent Island funding that will be invested into establishing an ICT Centre presence in Tasmania.
I actually spent much of today interviewing candidates for our most recent software engineering position; if our growth continues at its recent pace (and the increased funding almost ensures this) I should probably get used to spending a lot more time interviewing people.
Activities and Tasks in Emails
So I was busy at the International Conference on Intelligent User Interfaces conference earlier this week, and it was a hugely motivating and thought provoking experience. A great bunch of really switched on people doing all kinds of interesting things.
One presentation that particularly caught my interest was from Tessa Lau at IBM Almaden Research lab – not surprising really, given that I’ve read about some of Tessa’s previous work in email management. The work she presented at IUI was on IBM’s Unified Activity Management project (UAM). In that context, one of her points that really rang true for me was about the need to move away from being focused on tools to focus more on the activities people perform when dealing with information management. This should, of course, lead to the development of software applications that do a better job of supporting users, who are (and should be!) more concerned about their tasks and activities than about which tool they used to do what, and how they can integrate work that happens to have been performed using different software tools.
As a simple example, rather than grouping email messages for a given activity in an email client and the Excel documents in a separate folder on the filesystem, can we instead cluster all relevant information together based on the activity which ties the various artifacts together, rather than based on the tool that happened to have been used to create them. Accordingly, a major part of the UAM project is focussed on integrating email content into an overall activity management system that is under development. To do so requires an ability to associate email content with new or existing activities. Obviously, for new activities, this requires a light-weight and simple way of creating activities from email, and of displaying email in the context of existing activities.
When trying to associate incoming email messages with new and existing activities, the IBM team seems to have been inspired by the information retrieval community in using recommendation rather than all-or-nothing mapping of incoming messages to activities. This is a clever way of reducing the likelihood of frustrating users with incorrect categorisations, and is indeed the approach we took in earlier email categorisation work I have been involved with a few years ago at CSIRO.
Tessa also referred to email signatures as ‘noise’ that, by implication, needs to be removed to recover the communication signal conveyed by email – a very simple and logical description of the nature of email signatures (and often quoted material) in the context of automatic processing of emails.
Some weaknesses of the work presented included an implicit assumption that a single email message should be associated with only zero or one activity. Clearly this suffers from a multiple-inheritance style problem – in practice a single email message can often contain content that is relevant to many different activities. In the present system it is not possible to apply multiple activity labels to a single message. This, of course, sounds a lot like the folder vs. labelling problem that has been all the range since GMail appeared on the radar.
Another interesting question is whether classifying email messages into activities is different from the classification of emails into folders (which is a well studied text categorisation problem). There certainly seem to be many similarities between both problems. Perhaps there is a difference of focus (folder classification generally being for archiving, and activity classification more for current work), but this is purely speculation.
Of particular interest for me was that Tessa identified speech act detection in email as a future direction for their research. This is both motivating, given that smart people see some similar value in the kinds of ideas I’m playing with, but also rather intimidating to think who my competitors out there in the research world include!! I think I’d better get a move on with my own research!
R&D Software Engineer Wanted
Ok, so if you’re a software engineer looking for new challenges in 2006, here’s a great opportunity for you. My research team within the CSIRO ICT Centre (the Information Delivery team) is seeking to recruit a highly competent, motivated, and energetic software engineer to our Sydney laboratory.
You will contribute to software engineering, R&D and commercialisation activities within our small but highly productive team carrying out leading-edge research in the area of information engineering and the development of advanced search and delivery technology. This role will have a particular focus on mobile phone and PDA technology.
A degree in Software Engineering or a related discipline is essential; an honours degree or higher qualification would be an advantage, but not essential.
We need you to demonstrate excellent programming expertise in at least Java (preferably other languages too), familiarity with Web services, and preferably have exposure to mobile phone or PDA software development platforms. The development
projects underway need you to work on both research prototypes and on commercial products. Your willingness to provide technical support, an ability to write high quality documentation, and a capacity to talk to customers are important.
Finally, you should enjoy working in teams, be honest, trustworthy, and ethical, with an ability to contribute creative ideas to our projects.
|
Reference Number: |
2006/63
|
|
Position Title: |
Software Engineer – Information Delivery
|
|
Division: |
CSIRO ICT Centre
|
|
Location: |
North Ryde, NSW
|
|
Classification: |
CSOF4 to CSOF5
|
|
Salary Range: |
$58k – $72k + superannuation
|
|
Tenure: |
12 month term
|
|
Applicants: |
International Applicants Welcome
|
|
Relocation Assistance: |
May be offered to the successful applicant.
|
|
Applications Close: |
27 Jan 2006
|
|
Job Category: |
Computer Software/Scientific Research
|
For further details, selection criteria and to apply for this position, please visit: http://recruitment.csiro.au/asp/job_details.asp?RefNo=2006/63
If you have any questions about this position, please post a comment here, or feel free to email me (Andrew.Lampert@csiro.au).