Do we need sentiment analysis for email?
Tuesday January 22nd 2008, 12:32 pm
Filed under: email,information delivery,language technology,research,technology
Posted by: Andrew Lampert

Brij Singh at MessageDance has posted an interesting motivation for applying sentiment analysis to incoming email. He asks whether the sentiment evoked by incoming email results in cognitive turnover for knowledge workers, thus disrupting their productivity.

Brij thinks that the application of sentiment analysis to email could help address this mental wandering for knowledge-based employees:

I think it’s high time for companies to invest in sentiment classification and routing toxic emails to platform where immediate impact on employee productivity is less. Can carefully controlled social platform enable this process?

Having just yesterday attended a research presentation by Mary Gardiner on sentiment classification, it’s interesting to consider the possibilities and practicalities of applying the sentiment classification techniques to email.

One unsupervised technique, pioneered by Turney and Littman, is to use pointwise mutual information (PMI) and word co-occurrence counts from a search engine to help determine the valence of each word in a text. Turney and Littman used the NEAR operator in Altavista to determine the co-occurrence of each word in their text to be classified (in our case, this would be each word from an incoming email message) with each word from a set of words with known positive or negative valence. The counts for co-occurrence with the known-positive words contribute to the positive sentiment of our unclassified word, while counts for co-occurrence with negative words contribute to the negative sentiment. These co-occurrence counts are then normalised and combined to determine the overall valence of each word from our unclassified text. The technique, though simple, worked surprisingly well (80% classification accuracy at the word level), much better than many more complex techniques.

Ignoring the sad reality that the NEAR operator is no longer available to use in Altavista queries (and that no other search engines offer an operator of similar functionality in their public query interface), it’s interesting to think about whether such a technique could be usefully applied to email. I don’t know if people have addressed how to move from word-level classification up to message-level sentiment classification, but it doesn’t seem to be an insurmountable problem.

More of an issue for email is whether people would be happy for the entire text of their email messages to be sent in clear text to a single search provider. Depending on the volume and nature of data on a user’s own machine, perhaps we could use the desktop search interface to approximate Turney and Littman’s technique, without passing sensitive email data out onto the network? Of course, there’s a big difference in the scale of corpus being used to generate the co-occurrence counts in this case – Altavista at the time of the experiment, claimed to be indexing around 100 billion words. My desktop search index claims to contain about 1.5 million items (email messages, documents, visited web pages etc.) . While that’s not going to get us to 100 billion words, it might be enough to get some credible results?