Filed under: language technology,search,technology
Posted by: Andrew Lampert
MIT’s Technology Review magazine recently published an article on a product called Automatic Linguistic Indexing of Pictures – real time (ALIPR), an automatic image tagging technology. ALIPR seems to be an interesting but immature piece of research around algorithms for automatically applying appropriate tags to images. Unfortunately, I came away from reading the TR article with the feeling that the research in ALIPR is being lost in the hype.
Perhaps the product’s title is the first thing that irritated me – despite claiming to offer “linguistic indexing”, it offers nothing of the sort. Instead, it simply assigns tokens (that in this case happen to be labels from a closed set of 332 words) to images. This is less linguistic than the classical “bag-of-words” approach that is used in text search!
Next, let’s consider some of the statistics quoted in the article. In the first paragraph, we’re told that “At least one accurate tag was generated for 98 percent of all the pictures analysed”. As my colleague Shane Stephens pointed out in referring me to the article, this is an almost meaningless statistic! Think about what it means for a second – in generating 15 tags for an image, 98% of the time, 1 of those tags is relevant to the image. Even if you’ve got 14 completely irrelevant tags, that counts as a hit. That’s not exactly going to give you a tool as useful as a 98% success metric might indicate! The current capability is even less impressive if you look at the generality of tags that are actually applied.
Another apparently note-worthy metric is that for 51% of unseen Flickr images that it tagged, the first tag it assigned was also in the user’s tagset. Let’s interpret this one: only half the time was the tag that ALIPR thought was most relevant out of the 15 tags it applied actually relevant at all. Hmm, it seems there’s rather a large chasm to be crossed before this technology starts living up to the promise in that TR is suggesting.
In order to investigate its current capabilities, I’ve tried ALIPR on a few images I’ve got posted at Flickr, and, as you can read below, the results were mixed at best.
Here’s the first photo I tried – a picture of me in the snow on our trek in Bhutan. This should be a reasonably simple image to tag, since it’s a portrait photo (and *lots* of photos are presumably people photos). So, what did ALIPR suggest as tags?
indoor, decoration, people, man-made, doll, snow, old, photo, ice, toy, ship, winter, thing, steam, dogsled
Ok, so a few of those are actually reasonable – like snow, winter (although it’s not actually winter), ice, and maybe even people. But look at how general some of those other tags are: photo, man-made, thing - I challenge you to suggest a more general tag than thing! (And yes, if you really can think of something more general than the root of most ontologies, please post in the comments!). Even tags like people are general enough that you’d probably get reasonable accuracy just by including people in your set of 15 tags for every photo, regardless of what your algorithms tell you. And some of those tags are off-the-chart in terms of irrelevance: doll, ship, indoor.
Bottom line: 4 tags out of 15 correct (being generous)
What about a slightly harder picture – one which isn’t cropped as a portrait. Again, it’s a picture from the mountains of Bhutan (because right now, that’s all I have published at Flickr). So what tags did ALIPR suggest for this picture?
animal, historical, rock, wild_life, tree, architecture, landscape, elephant, world, building, art, sky, grass, antelope, desert
Again, we see some very general, upper-ontology tags like world that provide coverage in the tagset, while almost all of the more specific tags are wrong (like elephant, building, antelope). There are again some tags that are arguably relevant, like sky, landscape and maybe even rock and grass, but again, the tags are so general as to be probably useless for most purposes. As a point of comparison, I wonder how many people would tag this photo in Flickr with grass or sky?
Bottom line:4 tags out of 15 correct (being very generous)
Another photo – this time a picture without people or landscapes, so removing two of the ‘catch-all’ categories. The tags produce are:
man-made, train, aviation, people, surf_side, water, ocean, sky, indoor, landscape, plane, drawing, beach, grass, poster
Now, I would argue that in this case none of those tags are appropriate for the picture, despite the generality of many of them. If I were feeling extraordinarily generous, I *might* allow man-made as a correct (but useless) tag for the tent or the clothing in the picture.
Bottom line: 1 tag out of 15 correct (being extraordinarily generous)
One final photo – this time not one of mine. This picture of a packet of Doritos chips came from the ALIPR collection of recently tagged images. The tags suggested for this picture were:
animal, wild_life, grass, tree, landscape, rock, wild_cat, people, rural, building, historical, tiger, reptile, forest, lake
I don’t think anyone could argue that any of those tags are even remotely relevant to the image.
Bottom line: 0 tags out of 15 correct.
So what does our little (admittedly biased) set of tests show? Well, if we calculate precision, a metric commonly used in IR and classification tasks, how does ALIPR perform? Out of 60 tags applied across 4 photos, 9 were possibly relevant, giving it a precision of 15%. Now, 15% precision sounds very different from the 98% marketing statistic quoted in the story, but it actually is just a different perspective on the same data – the percentage of photos for which 1 tag was applicable from our 4 pictures is 75% – a much better-sounding result than 15% precision.
I actually think the idea behind ALIPR is interesting, particularly when it’s combined with human feedback mechanisms for further refining and training the system as is done on the ALIPR website (though I wonder how spam or deliberately erroneous tags will affect the system’s performance). It is a neat piece of research in using machine learning techniques to apply labels to images.
ALIPR does not, however, even begin to provide anything approaching the ‘semantics’ of an image, nor does it deserve the ‘linguistic’ moniker – though if I’m being cynical, perhaps that’s only there to create an acronym for which the domain name was still available? I just wish research didn’t have to be hyped in order to be worthy of media attention.
9 Comments so far
Leave a comment
Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>