I
was looking at the log files for my blog today, as I regularly do, and
I was suddenly struck by the variety of search queries in Google for
which users were getting referred to my posts. I write often about the
different flavors of search - including vertical search, parametric
search, semantic search, and so on - so users with queries about Search
often land here. But do they always find what they're looking for?
Some Real-life Search Results
Let
us examine some of the actual Google queries - in the form of referring
URLs - that led users to my blog. In most cases, Google did a fine job
of
matching the content to the query; in some cases, it was a somewhat
random match at best; finally, in a few cases, the Google search
algorithms are clearly getting confused. It is this third case that is
the most interesting.
The Good
In many cases, the match was quite straightforward and very relevant. Some examples are given below.
1.
Query:
http://www.google.fr/search?q=Guru+Avinash+Kaushik& ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:fr:official& client=firefox-a
Result: A conversation with Avinash Kaushik, Web Analytics Guru
Well, can't argue with that!
2.
Query:
http://www.google.cn/search?sourceid=navclient&aq=t& hl=zh-CN&ie=UTF-8&rlz=1T4XNLA_zh-CNCN246CN247& q=vertical+search
Result: The rise of Vertical Search Engines (VSEs)
Query:
http://www.google.com/search?
q=wikipedia+to+try+and+compete+with+google& ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official& client=firefox-a
Result: Wikipedia Search to compete with Google
Again, can't argue with those.
3.
Query: http://www.google.com/search?hl=en& q=search+technology+exits
Result: So You've Built an Alternative Search Engine - Now What?
This is actually pretty awesome, the algorithm has figured out "search
technology" and "exits"; in fact, this post does talk about exit
strategies for search engines, so it's a great match.
The Bad
Some search
queries are so vague that the matches you get are bound to be somewhat
random. I don't blame Google for the following matches:
4.
Query: http://www.google.com/search?hl=en& q=conceptual+architecture
Result: A Conceptual Architecture for Search
Is the search string too vague? Although technically this post matches
the search query, I'm guessing that this is not what the user intended
to look for.
5.
Query:
http://www.google.com/search?hl=en&safe=off& client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial& hs=pGP&q=disruptive+technologies+blog& btnG=Search
Result: Disruptive technologies for 2007
While the words match, and possibly this may satisfy the user, I get
the sense that the user was looking for a blog dedicated to discussing disruptive
technologies, not a single post. But who knows? Again, too vague!
In
the future, I wonder how soon Search technology will progress to the point
where the UI will automatically ask the user for more information to
qualify search terms that are too general or vague. A little while ago,
I envisioned a similar scenario ( Vertical Search, with authority ) when taking a look at the search engine MetaMojo, which has taken some steps in this direction.
The Ugly
In a few cases, though, the proximity of certain keywords fools the search algorithms. Consider the following matches:
6.
Query:
http://www.google.com/search? q=best+search+engine+for+directions&ie=utf-8& oe=utf-8&aq=t&rls=org.mozilla:en-US:official& client=firefox-a
Result: Future Directions in Search
A post about "future directions in Search" is not a post about "search
engines for directions", although the text itself is undoubtedly a close match.
7.
Query: http://www.google.com/search?hl=en& q=people+search+software+compared&btnG=Search
Result: Search and the Dumbness of Crowds
Hmm?
This is a popular post, but I'm not sure if it helps the user, who is
not trying to compare search strategies (as this post does); instead,
the user appears to be trying to compare people search engines.
Are
these good matches? While the content of the posts bears a superficial
resemblance to the text in the respective queries, the results are not
relevant to the requested user searches.
The Larger Problem
The
samples given above are not that important; the matches from my blog do
not always show up at the top of the search results and although these
are real referrals, not many users will actually click on these links
in the Results page. But these examples point to a deeper underlying
issue, one that will be far from easy to fix in the general sense.
All the major search engines currently rely on the proximity of
keywords and search terms to match results. But that approach can be
misleading, causing the search engine to systematically produce incorrect results under certain conditions.
To demonstrate, let us take a look at three general use cases.
[Note:
The examples given below are all drawn from Google. To be fair, all the
major search engines use similar algorithms, and all suffer from
similar problems. For its part, Google handles billions of queries
every day, usually very competently. As the reigning market leader,
though, Google is the obvious target - it goes with the territory!]
1. Difficulty of Finding Long Tail Results
Take
Britney Spears. Given the current popularity of articles, pictures and
videos of the superstar singer, the results for practically any query
with the word "spears" in it will be loaded with matches about her -
especially if the search involves television or entertainment in any
way.
Let's say you're watching the movie Zulu
and you start wondering what those large spears that all the extras are
waving about, are made of. So, you go to Google and type in "movie spears material" - this is an obviously insufficient description, as the screen shot below shows.

What happens if you expand on the query further - say: "what are movie spears made out of?" - does it help? Here's a screen shot.

The general issue here is that articles about very popular subjects
accumulate high levels of PageRank and then totally overwhelm long tail
results. This makes it very difficult for a user to find information
about unusual topics that happen to lie near these subjects.
2. Keyword Ordering
Since
the major search engines focus only on the proximity of keywords
without context, a user search that's similar to a popular concept gets
swamped with those results, even if the order of keywords in the query has been reversed.
For example, a tragic occurrence that's common in modern life is that
of a bicycle getting hit by a car. Much less common is the possibility
of a car getting hit by a bicycle, although it does happen. How would
you search for the latter? Try typing "car hit by bicycle" into Google; here's a screen shot of what you get. [Note the third result, which is actually relevant to this search!]

3. Keyword Relationships
Since
the major search engines focus only on the keywords in the search
phrase, all sense of the relationship between the search terms is lost.
For example, users commonly change the meaning of search terms by using
negations and prepositions; it is also fairly common to look for the
less common members of a set.
This takes us into the realm of
natural language processing (NLP). Without NLP, the nuances of these
query modifications are totally invisible to the search algorithms.
For example, a query such as "Famous Science fiction writers other than Isaac Asimov" is doomed to failure. A screen shot of this search in Google is given below. Most of the returned results are about Isaac Asimov, even when the user is explicitly trying to exclude him from the list of authors found.

All
of the searches shown above look like gimmicks - queries designed intentionally
to mislead Google's search algorithms. And in a sense, they are; these
specific queries can be easily fixed by tweaking the search engine.
Nevertheless,
these queries do point to a real need: the value of
understanding the meaning behind both the query and the content indexed.
Semantic Search
That's where the concept of semantic search comes in. I attended a media event earlier this year at stealth search startup Powerset (see: Powerset is Not a Google-killer!
) which showcased a live demo of their search engine, currently in
closed alpha, that highlighted solutions to exactly this type of issue.
For example, type "What was said about Jesus" into a major
search engine, and you usually get a whole list of results that consist
of the teachings of Jesus; this means that the search engine entirely
missed the concepts of passive voice and "about". The Powerset results,
on the other hand, were consistently on target (for the demo, anyway!).
In
other words, when you look at just the keywords in the query, you don't
really understand what the user is looking for; by looking at them
within context, by taking into account the qualifiers, the
prepositions, the negatives, and other such nuances, you can create a semantic graph
of the query. The same case can be made for semantic parsing of the
content indexed. Put the two together, as Powerset does, and you can
get a much better feel for relevance of results.
What about Google? I'm sure the smart folks in Google's search-quality team
are busily working on this problem as well. I look forward to the time
when the major search engines handle long tail queries more accurately and make Search a better experience for all of us.