Deconstructing real Google searches - why Powerset matters
I was looking at the log files for my blog today, as I regularly do, and I was suddenly struck by the variety of search queries in Google for which users were getting referred to my posts. I write often about the different flavors of search - including vertical search, parametric search, semantic search, and so on - so users with queries about Search often land here. But do they always find what they're looking for?
Some Real-life Search Results
Let us examine some of the actual Google queries - in the form of referring URLs - that led users to my blog. In most cases, Google did a fine job of matching the content to the query; in some cases, it was a somewhat random match at best; finally, in a few cases, the Google search algorithms are clearly getting confused. It is this third case that is the most interesting.
The Good
In many cases, the match was quite straightforward and very relevant. Some examples are given below.
1.
Query: http://www.google.fr/search?q=Guru+Avinash+Kaushik& ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:fr:official& client=firefox-a
Result: A conversation with Avinash Kaushik, Web Analytics Guru
Well, can't argue with that!
2.
Query: http://www.google.cn/search?sourceid=navclient&aq=t& hl=zh-CN&ie=UTF-8&rlz=1T4XNLA_zh-CNCN246CN247& q=vertical+search
Result: The rise of Vertical Search Engines (VSEs)
Query: http://www.google.com/search?
q=wikipedia+to+try+and+compete+with+google& ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official& client=firefox-a
Result: Wikipedia Search to compete with Google
Again, can't argue with those.
3.
Query: http://www.google.com/search?hl=en& q=search+technology+exits
Result: So You've Built an Alternative Search Engine - Now What?
This is actually pretty awesome, the algorithm has figured out "search technology" and "exits"; in fact, this post does talk about exit strategies for search engines, so it's a great match.
The Bad
Some search queries are so vague that the matches you get are bound to be somewhat random. I don't blame Google for the following matches:
4.
Query: http://www.google.com/search?hl=en& q=conceptual+architecture
Result: A Conceptual Architecture for Search
Is the search string too vague? Although technically this post matches the search query, I'm guessing that this is not what the user intended to look for.
5.
Query: http://www.google.com/search?hl=en&safe=off& client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial& hs=pGP&q=disruptive+technologies+blog& btnG=Search
Result: Disruptive technologies for 2007
While the words match, and possibly this may satisfy the user, I get the sense that the user was looking for a blog dedicated to discussing disruptive technologies, not a single post. But who knows? Again, too vague!
In the future, I wonder how soon Search technology will progress to the point where the UI will automatically ask the user for more information to qualify search terms that are too general or vague. A little while ago, I envisioned a similar scenario ( Vertical Search, with authority ) when taking a look at the search engine MetaMojo, which has taken some steps in this direction.
The Ugly
In a few cases, though, the proximity of certain keywords fools the search algorithms. Consider the following matches:
6.
Query: http://www.google.com/search? q=best+search+engine+for+directions&ie=utf-8& oe=utf-8&aq=t&rls=org.mozilla:en-US:official& client=firefox-a
Result: Future Directions in Search
A post about "future directions in Search" is not a post about "search engines for directions", although the text itself is undoubtedly a close match.
7.
Query: http://www.google.com/search?hl=en& q=people+search+software+compared&btnG=Search
Result: Search and the Dumbness of Crowds
Hmm? This is a popular post, but I'm not sure if it helps the user, who is not trying to compare search strategies (as this post does); instead, the user appears to be trying to compare people search engines.
Are these good matches? While the content of the posts bears a superficial resemblance to the text in the respective queries, the results are not relevant to the requested user searches.
The Larger Problem
The samples given above are not that important; the matches from my blog do not always show up at the top of the search results and although these are real referrals, not many users will actually click on these links in the Results page. But these examples point to a deeper underlying issue, one that will be far from easy to fix in the general sense.
All the major search engines currently rely on the proximity of keywords and search terms to match results. But that approach can be misleading, causing the search engine to systematically produce incorrect results under certain conditions.
To demonstrate, let us take a look at three general use cases.
[Note: The examples given below are all drawn from Google. To be fair, all the major search engines use similar algorithms, and all suffer from similar problems. For its part, Google handles billions of queries every day, usually very competently. As the reigning market leader, though, Google is the obvious target - it goes with the territory!]
1. Difficulty of Finding Long Tail Results
Take Britney Spears. Given the current popularity of articles, pictures and videos of the superstar singer, the results for practically any query with the word "spears" in it will be loaded with matches about her - especially if the search involves television or entertainment in any way.
Let's say you're watching the movie Zulu and you start wondering what those large spears that all the extras are waving about, are made of. So, you go to Google and type in "movie spears material" - this is an obviously insufficient description, as the screen shot below shows.

What happens if you expand on the query further - say: "what are movie spears made out of?" - does it help? Here's a screen shot.

The general issue here is that articles about very popular subjects accumulate high levels of PageRank and then totally overwhelm long tail results. This makes it very difficult for a user to find information about unusual topics that happen to lie near these subjects.
2. Keyword Ordering
Since the major search engines focus only on the proximity of keywords without context, a user search that's similar to a popular concept gets swamped with those results, even if the order of keywords in the query has been reversed. For example, a tragic occurrence that's common in modern life is that of a bicycle getting hit by a car. Much less common is the possibility of a car getting hit by a bicycle, although it does happen. How would you search for the latter? Try typing "car hit by bicycle" into Google; here's a screen shot of what you get. [Note the third result, which is actually relevant to this search!]

3. Keyword Relationships
Since the major search engines focus only on the keywords in the search phrase, all sense of the relationship between the search terms is lost. For example, users commonly change the meaning of search terms by using negations and prepositions; it is also fairly common to look for the less common members of a set.
This takes us into the realm of natural language processing (NLP). Without NLP, the nuances of these query modifications are totally invisible to the search algorithms.
For example, a query such as "Famous Science fiction writers other than Isaac Asimov" is doomed to failure. A screen shot of this search in Google is given below. Most of the returned results are about Isaac Asimov, even when the user is explicitly trying to exclude him from the list of authors found.

All of the searches shown above look like gimmicks - queries designed intentionally to mislead Google's search algorithms. And in a sense, they are; these specific queries can be easily fixed by tweaking the search engine. Nevertheless, these queries do point to a real need: the value of understanding the meaning behind both the query and the content indexed.
Semantic Search
That's where the concept of semantic search comes in. I attended a media event earlier this year at stealth search startup Powerset (see: Powerset is Not a Google-killer! ) which showcased a live demo of their search engine, currently in closed alpha, that highlighted solutions to exactly this type of issue.
For example, type "What was said about Jesus" into a major search engine, and you usually get a whole list of results that consist of the teachings of Jesus; this means that the search engine entirely missed the concepts of passive voice and "about". The Powerset results, on the other hand, were consistently on target (for the demo, anyway!).
In other words, when you look at just the keywords in the query, you don't really understand what the user is looking for; by looking at them within context, by taking into account the qualifiers, the prepositions, the negatives, and other such nuances, you can create a semantic graph of the query. The same case can be made for semantic parsing of the content indexed. Put the two together, as Powerset does, and you can get a much better feel for relevance of results.
What about Google? I'm sure the smart folks in Google's search-quality team are busily working on this problem as well. I look forward to the time when the major search engines handle long tail queries more accurately and make Search a better experience for all of us.
Great roundup, Nitin. I agree that search technology has a long way to improve, though from some demos I've tried in Powerset labs, I'm not yet convinced they're the answer. They're still demos though, so I'm eager to see what they do.
Posted by: David Berkowitz | January 09, 2008 at 09:25 AM
This may horribly pessimistic, but I don't think its in Google's best interest to provide accurate search results. Think of the lost advertising revenue if Google only returned only one perfect result. It would pretty much destroy Google's adwords program.
I think that as long as pageviews are tied to revenue, Google will make sure you click through at least 3 pages before you get to your best result. That applies to all ad driven search applications.
They are trying to build up their affiliate advertising business so that they can get out of this trap.
They can process 20 petabytes a day, plus they have millions of points of data governing user behavior. They have enough data to take a random sampling representative of the whole over time. I can't believe they can't figure out what I want. I think its a revenue model, not a technical issue.
Posted by: Irvin Owens Jr | January 09, 2008 at 09:24 PM
Nice article / debate. Will be interesting to see how things progress.
Posted by: Mary Stanton | January 14, 2008 at 12:02 PM
I went to Powerset's offices last week for a video interview with Founder Lorenzo Thione
Lorenzo Thione
Posted by: xavierv | March 10, 2008 at 03:04 PM