Email Zoning: Finding Signal amongst the Textual Noise of Email Messages
In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their functional parts. Today, the explosion of different email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simple techniques for identifying quoted replies that used to yield 95% accuracy now find less than 10% of such content.
Many language processing and search tools stand to benefit from better knowledge of the different functional parts of email messages, since this would allow them to focus on relevant content in specific parts of a message. In particular, access to zone information would allow email classification, summarisation and analysis tools to separate or filter out ‘noise’ and focus on the content in specific zones of a message that are relevant to the application at hand. Email contact mining tools, for example, might only access content from the email signature, while tools that attempt to identify tasks or action items in email might restrict themselves to the sender-authored and forwarded content.
Last week, I presented my paper on Segmenting Email Message Text into Zones at the Empirical Methods in Natural Language Processing (EMNLP) conference in Singapore. The focus of this work is Zebra, an SVM-based system that automatically segments and classifies the body text of email messages into nine functional zone types based on graphic, orthographic and lexical cues.
Our set of nine zones includes the following: author, greeting, signoff, quoted reply, forward, signature, advertising, disclaimer and attachment. Zebra currently performs the segmentation and classification of email text into the nine zones with an accuracy of about 87%. When the number of zones is abstracted to two or three zone classes (which is much more likely to be the granularity required for real-world email processing tasks), Zebra’s accuracy increases above 91.5%.
I’m currently working to finish off the Zebra system, as well as to resolve some licensing issues so that the code can be released for other researchers to use. We have, however, already released our annotated email dataset consisting of almost 12,000 lines of annotated email text that we used to train the Zebra system. If you want to know more, you can read our paper, head over to the Zebra website, or just get in touch with me by email or other means.
Java Speech API 2.0 Specification Finally Released
About 5 years ago, during my Masters studies, I wrote some simple speech applications using Java Speech API (JSAPI) 1.0 compliant speech engines. At the time, the JSR for JSAPI 2.0 was well underway. Well, it’s taken more than 8 years since the formation of the JSR, but *finally* the final release of the Java Speech API (JSAPI) 2.0 specification has been made available, released on 7th May 2009.
Of note, JSAPI 2.0 is now primarily aimed at the Java ME platform (specifically CLDC 1.0 and MIDP 1.0), meaning that it’s hoped the new spec will facilitate speech-enabled java applications on mobile devices. For this reason, gone are all floating point references and dependencies on AWT (yay!). Recognition Engines may provide full support for application-defined grammars or provide more limited support through specialized built-in grammars. Synthesis Engines may support full text-to-speech capabilities or simple text and audio sequencing. According to documentation in the spec, implementations can require 0.5-1.5 MBytes of ROM for models and algorithms and approximately 128 KBytes of RAM depending on vocabulary and grammar size. Of course, JSAPI 2.0 compliant engines can still run on Java SE platforms, and can obviously make good use of more substantial memory and processing resources.
Reinforcing comments made by expert group member Paul Lamere about the difficulties of satisfying all parties and developing a comprehensive speech API, Nokia made the following observation in approving the final specification:
“We think that the API is well designed and has very comprehensive functions. However, it is therefore highly complex and requires fairly advanced speech recognition and synthesis features. It also assumes a high level of speech recognition understanding from the application developer. It might not be feasible in many Java ME devices in the near term, but can provide good features in those high end platforms where applicable.”
Unrelated to Java ME compatibility, also gone are the Java Speech API Grammar Format (JSGF) and Java Speech API Markup Language (JSML), which were defined as companion specifications in JSAPI 1.0. Sensibly, given the standardisation that has thankfully occurred in the intervening years, these have been replaced by the W3C Speech Recognition Grammar Specification (SRGS) and the W3C Speech Synthesis Markup Language (SSML) respectively. After spending some time reviewing the plethora of speech synthesis markup languages, I’m very relieved to see this standardisation.
All in all, while it has taken a long time to come to fruition, I’m very pleased to see the JSAPI 2.0 standard finalised. Of course, given that JSAPI is only a specification (not an implementation) it remains to be seen how quickly the various speech recognition and speech synthesis systems move to support the new and modified APIs.
How do you share software development knowledge and experience?
For those of you who don’t know, I work in a research & development lab, developing software that tries to stem the tide of information overflow by reasoning about the context of each user’s interactions. In the past few months, we’ve had a bit of an influx of software engineers in our building – not only within my own team, but also in other teams working on everything from biomedical imaging to multi-agent systems for energy management.
Given a large and growing body of software engineers with varied skills and experience, I’m attempting to kick-off (well, actually re-kindle) a discussion forum for exchanging software engineering knowledge, skills and advice. We had our first informal discussion after work tonight and lobbed around a few ideas about how to best make use of such a forum.
One recurring suggestion was for people to give a presentation or lead a discussion about particular challenges faced in their current projects, or to talk about an interesting technology/methodology/idea that has caught their eye. Another semi-serious idea was for people to work together on building a game engine. Given that we could bring the skills of a diverse group of people to bear on a single, focussed activity, it’s actually an interesting suggestion. Otherwise, we talked about wikis and mailing lists and other mediums for sharing questions, answers and ideas.
Our first activity will be a presentation from our recent hire about his experience maintaining and enhancing Java’s Abstract Window Toolkit package within Sun for 6 years. We’re aiming to have a presentation from someone in the building every couple of weeks. The other possibility I’m considering is trying to invite external people to come and talk about their work. I’m wary of trying to run another seminar series, however, given that I already run the HAIL Series which takes quite a lot of my time.
I’m curious though, what ideas would you try to make the most of having a bunch of very smart software engineers working in your building?
R&D Software Engineer Wanted
Ok, so if you’re a software engineer looking for new challenges in 2006, here’s a great opportunity for you. My research team within the CSIRO ICT Centre (the Information Delivery team) is seeking to recruit a highly competent, motivated, and energetic software engineer to our Sydney laboratory.
You will contribute to software engineering, R&D and commercialisation activities within our small but highly productive team carrying out leading-edge research in the area of information engineering and the development of advanced search and delivery technology. This role will have a particular focus on mobile phone and PDA technology.
A degree in Software Engineering or a related discipline is essential; an honours degree or higher qualification would be an advantage, but not essential.
We need you to demonstrate excellent programming expertise in at least Java (preferably other languages too), familiarity with Web services, and preferably have exposure to mobile phone or PDA software development platforms. The development
projects underway need you to work on both research prototypes and on commercial products. Your willingness to provide technical support, an ability to write high quality documentation, and a capacity to talk to customers are important.
Finally, you should enjoy working in teams, be honest, trustworthy, and ethical, with an ability to contribute creative ideas to our projects.
||Software Engineer – Information Delivery
||CSIRO ICT Centre
||North Ryde, NSW
||CSOF4 to CSOF5
||$58k – $72k + superannuation
||12 month term
||International Applicants Welcome
||May be offered to the successful applicant.
||27 Jan 2006
||Computer Software/Scientific Research
For further details, selection criteria and to apply for this position, please visit: http://recruitment.csiro.au/asp/job_details.asp?RefNo=2006/63
If you have any questions about this position, please post a comment here, or feel free to email me (Andrew.Lampert@csiro.au).
Accessing the Global Address List in MS Exchange from Java
Recently, I needed to get access to information stored in the Global Address List (GAL) in Microsoft Exchange, the address book that is commonly accessible as a corporate directory of staff through Outlook. I had a dig around on the interweb and although there are plenty of examples out there on using LDAP or accessing the list through the Outlook application via a Java-COM bridge, I couldn’t find anything that exactly explained how to access the MS Exchange GAL via LDAP. So for my benefit (and the off chance that this might help someone else out there) here’s how I did it.
Firstly, to work out which LDAP server to query, you can look at the configuration of your Outlook client (or of course, whatever other mail client you might use that’s hooked up to your corporate LDAP directory). For Outlook users, here are the steps to determine your current LDAP server:
- Open the Outlook address book
- Choose the Options … item from the Tools menu in the Address Book window. This pops up an Addressing window.
- Highlight the Global Address Book entry in the ordering panel at the bottom of the Addressing window.
- Click the Properties button to see the properties for the Global Address List. This window shows the "Microsoft Exchange Address Book Provider", which specifies the address of the current LDAP server.
The Microsoft Exchange Server administrator creates and maintains this Global Address List (GAL). The GAL contains information for every email user, as well as details of global distribution lists and public folder e-mail addresses. Note that (as far as I’m aware) there is no standard naming for the properties in each GAL entry, so some knowledge of the specific GAL entry format for your organisation is required. In my case, I used the freely available Java LDAP Browser/Editor to browse the some entries for people in the CSIRO GAL to understand how the relevant properties were stored (in my case firstname, surname and email address).
With this information in hand, we can now use the JNDI API to access the LDAP directory. My Java code for this is shown below. Note that "ident" is CSIRO parlance for userId or username. This code returns a list with the user’s firstname, surname and email address from the user’s entry in the GAL.
PC World | CSIRO launches flying datacentre
PC World has published an article (PC World | CSIRO launches flying datacentre) on our recently completed 3-year research project with Boeing (USA) around developing new technology for the RAAF Wedgetail airborne early warning and control (AWACS/AEW&C) aircraft.
With typical journalistic flair, the story has been blown up a little: I’m not sure I’d quite agree that the technology we developed “has also been commercialized for sale to appropriate customers”, nor have there been 20 scientists and engineers working on it (well, maybe close to that number contributed, but there certainly weren’t anywhere near that number of people working full-time for 3 years, as might be inferred from the article) but the important parts are there.
The focus of my team’s contribution has been in intelligent information delivery: how do we prevent air surveillance operators from being overloaded with information, while still ensuring that they aren’t deprived of and don’t overlook any important information? Initial investigations by Robert Tot of current air surveillance operators at the RAAF Williamtown airbase allowed us to observe operators in action to understand the information they use to perform their work tasks. Interviews and observations also highlighted several issues: 1) Operators have to manually integrate information from a number of different sources to perform their job. This can include having to physically move to a different computer terminal (e.g. to access civilian flight plan information). 2) Displaying all of the available information all of the time is infeasible because the display becomes too cluttered.
Our approach to alleviating these and related problems was to create an adaptive graphical user interface that tailors the information displayed at any point in time and how that information is presented according to the operator’s current task and role. Based on this context, the relevant information required by the operator is planned, gathered and delivered using Myriad, our java-based platform for contextualised information delivery.
At the core of Myriad is our Virtual Document Planner (VDP), a goal-decomposition planning engine that, when configured using a set of plans, produces structured representations of content to be delivered that is specific to the current interaction context (which includes who the information is being delivered to, what task they are currently trying to perform, what environment they will view the information in, what information they have previously been presented with etc.).
The structured representation of content produced by the VDP explicitly models the role of each fragment of information to be delivered, through making explicit the rhetorical relations between each piece of information. We can then reason about the content, based on its structure, in deciding which information should be presented and how, based on whatever constraints might apply (e.g. temporal or screen-space constraints).
In order to infer the current (and future) tasks being performed by an operator, our Operator GUI provided a constant stream of user actions to a Task Parsing module, which based on a grammatical model of the operator’s possible tasks, makes statistical predictions of: what task is currently being performed, what task is likely to be performed next, and what information is required by the operator to perform these tasks. This information allows Myriad to plan the delivery of information proactively, meaning that operators shouldn’t need to request information; instead they should find that information is discretely made available to them as they require it.
Of course, the proactive delivery of information risks overloading or distracting the operator, who may be deeply engaged in other current tasks. To avoid unwanted distraction or disorientation, we were very careful to provide new information by discretely displaying a notification of information availability, rather than immediately providing the information itself on screen. In this way, we leave the human operator in control to choose if and when the information is required in order to complete their tasks.
The project has required me to develop our Myriad platform to support the delivery of textual, graphical and spatial information. In addition, I have been responsible for the development of the intelligent, adaptive Graphical User Interface. The GUI is based around the excellent OpenMap framework from BBN for the display of spatial information. In addition, a desktop-like workspace has been created where more verbose information could be displayed (either linked to objects visible on the map, or provided as non-spatial data). To allow the GUI to be completely controlled from Myriad, I created a flexible and extensible command-line API (using the BeanShell Java source interpreter), through which information can be added, displayed, hidden, modified, highlighted etc. on both the map and workspace displays with commands sent to specific GUI channel listeners over message-passing middleware.
IntelliJ IDEA 5 Released
Hooray! JetBrains have finally released IDEA 5 to the public. The new features I’m most looking forward to? Subversion integration, J2ME support (or should that be JME support now?) and of course, the usual smattering of new refactorings, like the ability to safely move non-static methods between classes. Of course, there’s also the well-publicised new JSP and CSS support. Yay!
Software Engineering Job Available!
A fantastic opportunity for an experienced Java Developer. We’re seeking a new software engineer to join our small team of engineers and scientists and be responsible for implementing world-leading research ideas in software.
You can find out more about our work or about the CSIRO ICT Centre here.
Interested? Check out the position description for more information about the position, and to apply.
Finally got my website published
So it’s far from finished, but I’ve finally bitten the bullet and published my new site to the web. It’s been sitting on my staging/development server for more than two years now, although the current incarnation bears very little resemblance to that old site!
I’ve focussed on two sections at the moment:
- Trying to create a collective resource that documents how people are using the Enron Email Corpus. This is a massive collection of real-world email from Enron that is available for research purposes. (If you’re interested, head on over to the Enron Email Corpus pages)
- Documenting relevent resources for my Masters Project (which will hopefully lead into a PhD), looking at discourse structures and intention in email communication. This is based around email classification (at least partially), and will hopefully make use of the Enron corpus, both for investigating patterns of communication, and for ensuring that the tools produced work on real-world, noisy data.
Still got a mountain of uni work to do (about 15,000 words of essays for Speech Recognition, as well as constructing a speaker/speech recognition system in R; and a formal literature review and research proposal for my research project). One day I’ll feel like I’m actually making progress!
Starting RMI server programmatically
I didn’t have time to look at this for CeBIT, but it seems that there’s a simple way to manage the RMI server programmatically, rather than having to rely on external scripts to ensure that an RMI server is available on the appropriate machine.
To start the server on port 1099 (the default RMI port), just execute:
Registry reg = LocateRegistry.createRegistry(1099);
And to stop it again on shutdown, try:
Simple eh? The only thing to be careful of is that the LocateRegistry.createRegistry() method returns the Registry object itself, not a stub (as the other RMI methods do). I’m just going to add this into SciFly, to make it one less thing for people to think about when running SciFly (to remind you, I use RMI to communicate between the barcode scanner, user interface servlet and natural language generation engine components).