Google abandons PageRank for Wikipedia data?
Wednesday February 14th 2007, 11:31 am
Filed under: language technology,search,technology
Posted by: Andrew Lampert

Something I hadn’t noticed until recently is that, in addition to information about topics such as weather, stock reports, and news, Google OneBox now provides results from Wikipedia if the word ‘info’ or ‘information’ is included in a user’s search query.

The concept of Google OneBox appears to be an attempt to gradually and subtly integrate question answering style results in with the more familiar ranked list of results. This is done only for well understood domains with comprehensive and trusted data sources – of which Wikipedia is increasingly an excellent example.

The integration of Wikipedia results means that now, if you enter a query such as “csiro info”, you’ll sometimes get a result from Wikipedia above your ranked list of general web links that looks like this:

CSIRO Google Search Result

A little experimentation with this feature reveals some curious results.

If we search for “csir info”, we don’t get any Wikipedia OneBox results. A CSIR page does exist on Wikipedia, but it is actually a disambiguation page that points to several possibly intended topics, including the CSIR in India, the CSIR in South Africa, and Australia’s CSIRO (which was once called the CSIR).

More interesting, however, is that Google itself does not appear to be using its own ranking algorithms to determine which Wikipedia page to display in the OneBox results. If we use the search query “csir site:en.wikipedia.org”, which constrains our search to pages from the English Wikipedia, the highest ranked result is in fact the Wikipedia page about the CSIR in South Africa. The disambiguation page appears as the second result. Thus, if Google were using its own ranking algorithms for selecting Wikipedia results for OneBox, we would expect to see the CSIR South Africa Wikipedia article in our OneBox result for our “csir info” search query.

Instead, it appears that results from Wikipedia are only included in OneBox results if there is an exact (or perhaps very close) match between the search query and the title of a Wikipedia article. Disambiguation pages, which prompt a user to choose between multiple topics that might be referred to by a single phrase or term, appear to never be included in the Google OneBox results, even if the title is an exact match.

More importantly, even if external evidence suggests that an article is relevant to a search query, that article won’t be displayed if it’s title doesn’t match the query terms.

Arguably, this makes perfect sense: only results that Google is very confident about are included in OneBox. What is curious is that in determining this confidence level, Google seems to rate the title(s) of Wikipedia articles as a better indicator of relevance than their own ranking algorithms. For the OneBox results, Google relies on Wikipedia title data, which is really just another form of user-supplied metadata, above any combination of external evidence such as anchor text from pages that link to Wikipedia articles.

I think this can be interpreted as an early example (perhaps the first?) of Google relying on user generated metadata (Wikipedia article titles) above their sophisticated, and highly tuned, mathematical ranking algorithms. Is this a sign of things to come? Is Google beating Wikia at their own game before they’ve even got a beta of their social search engine out the door?


3 Comments so far
Leave a comment

Interesting to try and work out the algorythm for it.

any thoughts about;
microsoft info goes to /wiki/Microsoft_Game_Studios even though /wiki/Microsoft is a thorough (and more applicable page)

spanish info goes to /wiki/spain

and ‘java bad performance info’ doesn’t return anything, is google’s indexing so good it doesn’t bother to waste bytes returning common knowledge ? thats clever :P

Comment by aaron 02.14.07 @ 1:01 pm

Interesting examples Aaron. Things are obviously more complicated than simple title comparisons in some cases.

All the nationalities I tried (Australian, Swedish, Dutch) map to the wiki page about the relevant country. Some have additional titles that map to the country article (so /wiki/Australian goes straight to /wiki/Australia), but others, like Swedish, have a disambiguation page that, by my reckoning, is ignored. I suspect that the Spanish -> Spain style mapping could be due to simple word lookup. It’s not simply word stemming, since you’re never going to get from Dutch -> Netherlands by that route.

The Microsoft example is very puzzling to me. Searching wikipedia using google with the query ‘microsoft’ returns the obvious /wiki/Microsoft page. The Game Studios page doesn’t appear in the first 10 links. The /wiki/Microsoft page is both a better answer, and seemingly simpler to identify. I can’t think of a reason that the Microsoft_Game_Studios page would be returned instead. Any ideas?

Another weird example is searching for ‘Andrew info’. The OneBox result is wiki/Andy_Andy. That article isn’t linked anywhere I can find, and I can’t understand how it would be the chosen article. Again, there’s obviously an Andrew->Andy mapping, but I can’t see where this is being made using wikipedia data, so perhaps it’s being done separately using Google resources?

Comment by Andrew Lampert 02.14.07 @ 2:09 pm

Andrew info is a mystery.

my first theory was andrew was in the history of the document and google had an old version, but no.

my second theory, as there is a french version of the andy andy article, maybe french to english convertion of andy = andrew, but according to babelfish, it’s just andy. so no.

Third theory was maybe wiki is doing cloaking to the googlebot, but wget says no to that idea.

Why not link to wiki/Andrew_Snoid ? If it was wiki/Andy blah would it still get there ?

why does info james point to wiki/James_(Nip/Tuck) and not wiki/Jimmy_Jimmy ?

Weird behaviour, now i’m confused and wont sleep properly tonight :-| Its definately a different smell to the normal google indexing, esp as site:en.wikipedia.org andy doesn’t list andy andy in the first page.

Even weirder, no other articles point to the Andy Andy, so googlebot would have to come in from the all pages index.

Comment by aaron 02.14.07 @ 2:59 pm



Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(required)

(required)