Filed under: language technology,search,technology
Posted by: Andrew Lampert
MIT’s Technology Review magazine recently published an article on a product called Automatic Linguistic Indexing of Pictures – real time (ALIPR), an automatic image tagging technology. ALIPR seems to be an interesting but immature piece of research around algorithms for automatically applying appropriate tags to images. Unfortunately, I came away from reading the TR article with the feeling that the research in ALIPR is being lost in the hype.
Perhaps the product’s title is the first thing that irritated me – despite claiming to offer “linguistic indexing”, it offers nothing of the sort. Instead, it simply assigns tokens (that in this case happen to be labels from a closed set of 332 words) to images. This is less linguistic than the classical “bag-of-words” approach that is used in text search!
Next, let’s consider some of the statistics quoted in the article. In the first paragraph, we’re told that “At least one accurate tag was generated for 98 percent of all the pictures analysed”. As my colleague Shane Stephens pointed out in referring me to the article, this is an almost meaningless statistic! Think about what it means for a second – in generating 15 tags for an image, 98% of the time, 1 of those tags is relevant to the image. Even if you’ve got 14 completely irrelevant tags, that counts as a hit. That’s not exactly going to give you a tool as useful as a 98% success metric might indicate! The current capability is even less impressive if you look at the generality of tags that are actually applied.
Another apparently note-worthy metric is that for 51% of unseen Flickr images that it tagged, the first tag it assigned was also in the user’s tagset. Let’s interpret this one: only half the time was the tag that ALIPR thought was most relevant out of the 15 tags it applied actually relevant at all. Hmm, it seems there’s rather a large chasm to be crossed before this technology starts living up to the promise in that TR is suggesting.
In order to investigate its current capabilities, I’ve tried ALIPR on a few images I’ve got posted at Flickr, and, as you can read below, the results were mixed at best.
Photo 1
Here’s the first photo I tried – a picture of me in the snow on our trek in Bhutan. This should be a reasonably simple image to tag, since it’s a portrait photo (and *lots* of photos are presumably people photos). So, what did ALIPR suggest as tags?
indoor, decoration, people, man-made, doll, snow, old, photo, ice, toy, ship, winter, thing, steam, dogsled
Ok, so a few of those are actually reasonable – like snow, winter (although it’s not actually winter), ice, and maybe even people. But look at how general some of those other tags are: photo, man-made, thing - I challenge you to suggest a more general tag than thing! (And yes, if you really can think of something more general than the root of most ontologies, please post in the comments!). Even tags like people are general enough that you’d probably get reasonable accuracy just by including people in your set of 15 tags for every photo, regardless of what your algorithms tell you. And some of those tags are off-the-chart in terms of irrelevance: doll, ship, indoor.
Bottom line: 4 tags out of 15 correct (being generous)
Photo 2
What about a slightly harder picture – one which isn’t cropped as a portrait. Again, it’s a picture from the mountains of Bhutan (because right now, that’s all I have published at Flickr). So what tags did ALIPR suggest for this picture?
animal, historical, rock, wild_life, tree, architecture, landscape, elephant, world, building, art, sky, grass, antelope, desert
Again, we see some very general, upper-ontology tags like world that provide coverage in the tagset, while almost all of the more specific tags are wrong (like elephant, building, antelope). There are again some tags that are arguably relevant, like sky, landscape and maybe even rock and grass, but again, the tags are so general as to be probably useless for most purposes. As a point of comparison, I wonder how many people would tag this photo in Flickr with grass or sky?
Bottom line:4 tags out of 15 correct (being very generous)
Photo 3
Another photo – this time a picture without people or landscapes, so removing two of the ‘catch-all’ categories. The tags produce are:
man-made, train, aviation, people, surf_side, water, ocean, sky, indoor, landscape, plane, drawing, beach, grass, poster
Now, I would argue that in this case none of those tags are appropriate for the picture, despite the generality of many of them. If I were feeling extraordinarily generous, I *might* allow man-made as a correct (but useless) tag for the tent or the clothing in the picture.
Bottom line: 1 tag out of 15 correct (being extraordinarily generous)
Photo 4
One final photo – this time not one of mine. This picture of a packet of Doritos chips came from the ALIPR collection of recently tagged images. The tags suggested for this picture were:
animal, wild_life, grass, tree, landscape, rock, wild_cat, people, rural, building, historical, tiger, reptile, forest, lake
I don’t think anyone could argue that any of those tags are even remotely relevant to the image.
Bottom line: 0 tags out of 15 correct.
Precision
So what does our little (admittedly biased) set of tests show? Well, if we calculate precision, a metric commonly used in IR and classification tasks, how does ALIPR perform? Out of 60 tags applied across 4 photos, 9 were possibly relevant, giving it a precision of 15%. Now, 15% precision sounds very different from the 98% marketing statistic quoted in the story, but it actually is just a different perspective on the same data – the percentage of photos for which 1 tag was applicable from our 4 pictures is 75% – a much better-sounding result than 15% precision.
Summary
I actually think the idea behind ALIPR is interesting, particularly when it’s combined with human feedback mechanisms for further refining and training the system as is done on the ALIPR website (though I wonder how spam or deliberately erroneous tags will affect the system’s performance). It is a neat piece of research in using machine learning techniques to apply labels to images.
ALIPR does not, however, even begin to provide anything approaching the ‘semantics’ of an image, nor does it deserve the ‘linguistic’ moniker – though if I’m being cynical, perhaps that’s only there to create an acronym for which the domain name was still available? I just wish research didn’t have to be hyped in order to be worthy of media attention.
9 Comments so far
Leave a comment
Scoble has video demonstration of a new likeness based search.
They’ve gone a slightly different route, and rather than use tags to find images, they use an existing image.
The image processing algortihm seems good, but not good enough for general use. However, in the targeted verticals they are using it for … wow!
Comment by Geoff Wilson 11.17.06 @ 7:46 amYeah – there’s lots of research systems in image retrieval (and more generally multimedia retrieval) that have used input images as a query to find similar images for quite a while now.
I was actually more impressed by Riya’s work in face recognition for photos than the “likeness-based” search, particularly given it is limited to very constrained domains of application right now.
More generally, I’m just not sure how useful image-input search is. A lot of the time, when I’m searching for an image, I don’t have a prototypical image to use as an input query. Even if I do, I’m not sure that you could adequately find similar images using image processing algorithms (I don’t just want something with similar colouring, contrast etc, but the same subject matter most of the time – and that’s *much* harder to detect).
But I guess, as you point out, it may have application in some limited domains. Perhaps even the fashion domain it is currently targeted at is one such domain – I don’t know.
Comment by Andrew Lampert 11.17.06 @ 8:21 amHi, we have now been working for a while on a prototype, proof-of-concept image search engine called Behold that combines statistical image auto-annotation with content based image browsing. It currently indexes over 1 million images from university websites. So far the keywords it can handle are very simple but some can be helpful when extracted html metadata is poor.
Comment by Alexei Yavlinsky 11.20.06 @ 11:33 amHi Alexei,
I had a quick look at Behold – seems a similar idea in terms of classifying images using a limited vocabulary of annotation tags. Is that right?
I tried ‘beach’ as an example visual search, and was surprised that the first couple of results seem less relevant than later results. Is there any attempt to rank/order the images presented like in text search, or is it simply a set of images with annotations that match the keyword? I was wondering, for example, whether you take into account multiple forms of evidence for image relevance: e.g., your automatic annotations combined with meta-data around the image, plus the image file name or other similar properties.
Thanks for the link.
Comment by Andrew Lampert 11.21.06 @ 9:55 amHi,
thanks for the feedback! It is the same idea of using a limited annotation vocabulary, but used for the purpose of helping ameliorate poor text metadata in bulk search, rather than suggesting tags to users. Right now automatic annotations are deliberately not combined with text metadata, to demonstrate raw capabilities of the former. You can try how metadata search would perform on its own by using it as a separate search option. We give a short rundown of how the system works here and we even have a blog for everyone to leave comments.
Comment by Alexei Yavlinsky 11.21.06 @ 1:26 pmThanks for the pointers to further information. Nice to see some PhD research actually out there being used in anger! Gives some hope to people like me still in the midst of our own PhD research!! I was interested to see your supervisor is Stefan Rueger – I remember meeting him a couple of years ago when he was visiting a multimedia research group where I work. He even gave a seminar in the seminar series that I co-ordinate. Small world!
Comment by Andrew Lampert 11.21.06 @ 4:36 pmHi, in response to your comment about the combination of metadata and ‘automatic’ image tags, this is a feature that I’ve been able to add in the past couple of days! Here’s a link to a post that describes the new feature
Comments and suggestions appreciated. Thanks!
The link in the previous post should have been pointing to here.
Comment by Alexei Yavlinsky 11.26.06 @ 6:05 pmNice work Alexei! Looks great from the examples you’ve got in your blog post. I’ll have to spend some time trying it out.
Comment by Andrew Lampert 11.28.06 @ 8:37 pmLeave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

