Filed under: language technology, research, science, technology
Posted by: Andrew Lampert
How does Jorge Cham at PhD Comics have such a knack for getting inside my head?
How does Jorge Cham at PhD Comics have such a knack for getting inside my head?
MIT’s Technology Review magazine recently published an article on a product called Automatic Linguistic Indexing of Pictures – real time (ALIPR), an automatic image tagging technology. ALIPR seems to be an interesting but immature piece of research around algorithms for automatically applying appropriate tags to images. Unfortunately, I came away from reading the TR article with the feeling that the research in ALIPR is being lost in the hype.
Perhaps the product’s title is the first thing that irritated me – despite claiming to offer “linguistic indexing”, it offers nothing of the sort. Instead, it simply assigns tokens (that in this case happen to be labels from a closed set of 332 words) to images. This is less linguistic than the classical “bag-of-words” approach that is used in text search!
Next, let’s consider some of the statistics quoted in the article. In the first paragraph, we’re told that “At least one accurate tag was generated for 98 percent of all the pictures analysed”. As my colleague Shane Stephens pointed out in referring me to the article, this is an almost meaningless statistic! Think about what it means for a second – in generating 15 tags for an image, 98% of the time, 1 of those tags is relevant to the image. Even if you’ve got 14 completely irrelevant tags, that counts as a hit. That’s not exactly going to give you a tool as useful as a 98% success metric might indicate! The current capability is even less impressive if you look at the generality of tags that are actually applied.
Another apparently note-worthy metric is that for 51% of unseen Flickr images that it tagged, the first tag it assigned was also in the user’s tagset. Let’s interpret this one: only half the time was the tag that ALIPR thought was most relevant out of the 15 tags it applied actually relevant at all. Hmm, it seems there’s rather a large chasm to be crossed before this technology starts living up to the promise in that TR is suggesting.
In order to investigate its current capabilities, I’ve tried ALIPR on a few images I’ve got posted at Flickr, and, as you can read below, the results were mixed at best.
The blogosphere seems to have recently rediscovered the Enron Email Corpus, thanks to the publicity surrounding Trampoline Systems‘ newly released web application for exploring the Enron emails.
Exploring Enron offers a number of different views of the Enron data via:
Also offered are trendy Web 2.0 compliant ‘tag clouds’ for sowing related people and topics when browsing the Enron messages. There is nothing particularly novel in any of this functionality, but Exploring Enron does offer a better-than-prototype quality application that has the potential to bring the Enron email data to the attention of a whole new non-research-oriented audience. In this sense, it continues in the same vein (while offering greater functionality) than other polished sites like Inboxer’s Enron Email site.
It will be interesting to see whether anything comes from this renewed attention from people who haven’t yet played with a large-scale email corpus.
According to stats from CSIRO’s Information Management and Technology people, there were just shy of 81 million spam and virus-laden email messages blocked during September 2006, representing more the 96% of all email traffic. In fact, less than 3.5% of all email represented legitimate messages, which paints an even bleaker picture of the email world than the numbers presented at VNUNet.com. Even other academic institutions such as Rutgers University show an average of 7% legitimate email over the past month.
In contrast with spam, virus-laden messages continue to represent a relatively small proportion of messages – there were roughly 2500 spam messages blocked for every virus-laden email that was detected, with less than 0.05% of all messages carrying virus payloads, almost an order of magnitude less than the 0.41% rate for emails processed by SoftScan. Interestingly, the CSIRO results are down from a peak in December 2005, when virus emails accounted for 0.5% of email messages.
Looking back over the past 12 months, there were close to 1 billion spam email messages received – an astounding figure for an organisation the size of CSIRO. This represents more than 150,000 spam email messages per employee per annum, or more than 400 spam email messages per employee per day. From the numbers released, we can also determine that the average number of non-spam messages received per employee is less than 20 per day. This starts to demonstrate just how significant the spam problem really is: there are at least 20 spam messages received for every legitimate email message.
To the credit of CSIRO IM&T people, very little spam is actually delivered through to my work email address – less than 10 spam messages a week on average. So, kudos to the combination of technology and resources that are applied, which I know includes IronPort Anti-Spam, and almost certainly a whole suite of other tools and techniques.
The bottom-line, however, seems to be that email protocols are broken – what other technology or infrastructure demonstrates such overwhelming mis-use to the degree that we see in email traffic?