May 15, 2008

Yahoo! SearchMonkey - Released to Developers

The good folks from Yahoo! unveiled their new open search platform Yahoo! SearchMonkey, at a developer launch party today at their Sunnyvale headquarters. In some ways, the SearchMonkey platform is revolutionary and a major step forward in search, allowing publishers to participate directly in improving the quality of their own information presented on the Yahoo! search results page (this is also implicitly a push for the bottom-up approach to the Semantic Web, which most industry observers have given up on in favor of a top-down approach). The platform also lets publishers and third-party developers build applications aimed at improving the search experience. Finally, and most important, if enough publishers and app developers participate in the program, it promises to improve the quality of search results for end users.

Features

At the simplest level, you can think of SearchMonkey as a community-powered set of rich information boxes (similar to the Google OneBox) that appear on the Yahoo! search results page. Publishers can provide this rich data to the Yahoo! search index in a variety of ways: through structured data feeds (RSS), through RDF or Microformat markup on web pages, or through simple page extraction. The "Information Bar" shows up underneath the main search results. The Yahoo! search team has also provided tools to enable developers to build search-based applications very simply and easily.

Continue reading "Yahoo! SearchMonkey - Released to Developers" »

May 11, 2008

Powerset Launches Wikipedia Search

Semantic search engine Powerset, which we've written about here before, has just launched its initial release. The current release is limited to indexing Wikipedia content, but it provides a great showcase for their technology and user experience.

For example, my search for "Alexander the Great" provided the following results page:

Continue reading "Powerset Launches Wikipedia Search" »

May 07, 2008

Cognition Technologies recognized by KMWorld as one of "100 that matter"

Cognition Technologies, which focuses on Semantic natural language processing technology, was named by KMWorld as one of the top 100 Companies That Matter in Knowledge Management for 2008.

Says Cognition CEO Scott Jarus:

One of the biggest barriers to building a natural language understanding system is to build the semantic map and the dictionary with details of the syntactic behavior of words (i.e. how words behave within context).  Cognition's team has spent more than 20 years building this capability into Cognition’s Semantic NLP for the English language ...  and our technology is commercially available today!

Semantic search and NLP technologies seem to have arrived - they are generating a lot of buzz lately. In addition to mainstays Hakia and Powerset, there is a spate of new entries, including Cognition, BooRah and eeggi. We will be reviewing some of these new alternate search engines on this blog in the near future.

Congratulations, Scott and the Cognition team!



April 29, 2008

Thoughts about Alternative Search Engines Day 2008

I was at the Alternative Search Engines Day event in San Francisco last week. Organized by Charles Knight of the Alt Search Engines blog (and friends), it brought together key people from over 40 alternative search engines. It was an amazing crowd, full of interesting and bright people, and the overall energy was incredible!

At the keynote, Charles gave a pitch for bringing ASEs together that was very well received. He showed us some examples of what a unified User Interface that combined multiple search engines would look like. I contributed a tiny bit (expanding on the idea that complementary ASEs could band together to provide Federated Searches for enhanced traffic and usability, and listing a few ways for the Alts to cooperate even while competing ).

Continue reading "Thoughts about Alternative Search Engines Day 2008" »

April 20, 2008

Cooperation of Alternate Search Engines: A Manifesto

( This post is inspired by my discussions with my friend, Charles Knight of AltSearchEngines )

Background

I'll be at the Alternative Search Engines Day tomorrow, a unique event in San Francisco put together by Charles and the AltSearchEngines team. The event is sponsored by SeeqPod, UpTake, Matchpoint, HealthPricer, GoPubMed and Blogdimension. (Unfortunately, it's not open to the general public.) If you're part of an Alternative Search Engine, I hope to see you there!

As I was getting ready for the event, it got me thinking about ASEs and how they can work together.

The Case for the Alts

I love the ASEs - Alts rock! Without them, there would be little innovation in Search, no new frontiers to be explored.

The Alts are the ones that keep pushing the envelope with new directions in search technology, whether it's algorithms, user interface, social search or something else.  Although Google has some fine technology and is synonymous with search, I firmly believe that we're still at Search 1.0, and have a long way to go. Because of all this competition from the Alts, and the resulting innovation, web search continues to improve.

Continue reading "Cooperation of Alternate Search Engines: A Manifesto" »

March 31, 2008

Could You Survive For A Day - Without Google?

Can you spend a whole day without using Google? - that's the challenge issued by my friend Charles Knight over on the Alt Search Engines blog (see also ReadWriteWeb's coverage). To help you out, he's going to publish the latest version of his popular Top 100 Alternative Search Engines list tomorrow.

I think this is a great idea! We have all become addicted to the power (and limitations) of Google search - just like television before the age of the Internet, we cannot imagine life without it. And yet, as Charles' list shows, there are plenty of alternative search engines out there, innovating Search in a variety of different ways.

Personally, I'm going to use this opportunity to learn the latest features of Quintura, an innovative search engine we've covered before on this blog (here and here ). Quintura has jumped on board this idea by creating a special destination page for discovering the best hoaxes, pranks, jokes and tricks for April fool's day. [Rest assured, this is no joke!]

So how about you - can you do it? Why not give it a shot and try out an alternative search engine? Or two, or five, or all hundred on Charles' list? Can you last a day, an hour, even five minutes? Try it and the results may surprise you!

January 29, 2008

Zvents makes Local Search pop!

There is a class of web search engines that can prove even more useful than Google within a certain context. I'm talking, of course, about Vertical Search engines - the writer and tech strategist Sramana Mitra considers them Google's Achilles heel and Profy.com's Cyndy Aleo-Carreira seems to agree. This blog also has long held the position that vertical search represents a powerful mechanism to find information on the web, and is a key category to watch in the search wars of the future. [see: The rise of Vertical Search Engines from Aug 2006].

Another way of achieving a similar focus, in order to improve the relevance of search results, is by segmenting by location rather than by industry vertical - i.e. create a hyperlocal search engine that limits its search results to a given geographical area.

One such alternate search engine is Zvents, which is relentlessly focused on local information, of any sort. This company, which has been around since early 2005, has just introduced an advanced feature called Federated Local search - basically, its own version of Universal Search (recall that Google introduced its Universal Search feature with much fanfare last May).

Federated Local Search: Multi-Dimensional Results for Local Information

What does Universal Search mean, for a local search engine? Initially this was not very clear to me; an email discussion with Paul O'Brien, Director of Marketing at Zvents, inspired me to draw the following diagram:



The basic idea is to enable the user to implement a general-purpose search within a local context. This allows the user to find local information about a given topic, across many different dimensions. For example, a sports fan living in San Jose, CA who tries a local search for the term "hockey", would get the following different types of results:

  • Upcoming games for the San Jose Sharks, the local hockey team
  • The location of Roosevelt Park Roller Hockey Rink
  • The description and link for a local "Hockey Night" event
  • Results about relevant personalities (what Zvents calls "performers")
  • And other related links ...

Zvents has already partially implemented this vision, although some of the lower-ranked results could provide a better match. Hopefully these will improve in the future as the search index grows and the algorithm improves. A screen shot of this local Hockey search in Zvents is given below.



Similarly, here's a search for the term "Web 2.0" for Cupertino, CA:



Outcome: Relevance

The big advantage of this type of search, over a general-purpose Google or Yahoo! search, is that the user can obtain the benefits of a broad cross-section of results, while still constraining the search to a limited geographical area.

This is not a significant issue in highly developed, urban, technologically advanced areas like Silicon Valley, Boston or New York; but it could one day make a big difference for someone living in David Letterman's "home office" of Wahoo, NE , or even more important, someone trying to find the Boston Public School located in Boston, Ontario - as we've seen before, highly popular keywords tend to swamp nearby long-tail keywords in the search results for major search engines.

From a business model perspective, hyperlocal searches tend to provide highly qualified prospects for local merchants, so I would guess that this type of search is very easily monetizable in the long run.

From a user interface point-of-view, the NLP-like implementation of time period for the search engine ("when: tonight, this weekend, ...") is a nice touch; I tried different possibilities ("next month"), and it seemed to work just fine.

On a more technical note, Zvents has been making waves with the release of its open-source Bigtable clone called Hypertable, which adds a C++ option for this project.

Going forward, it will be interesting to see how Zvents scales to additional locations, and to additional dimensions within each locality. Will it make inroads into the market share for any of the major search engines, or into that of other locally-focused web sites like topix.com and craigslist?



January 27, 2008

Quintura Launches Site Search Widget

Alternative search engine Quintura, which I've mentioned before on this blog, has launched its site search widget. This widget allows site publishers to provide users with a specialized search limited to that specific site; it joins earlier offerings from Google, Yahoo!, Rollyo and Eurekster swiki in this space.

This blog was an early user of this widget. You can see a customized, Quintura-generated mini-tag cloud in the earlier post; a full-size tag cloud is also available. The widget is hosted by Quintura, so installation was a snap: once the site was indexed, all I had to do was to embed the widget code into my blog pages and provide some styling control.

The biggest benefit of using the Quintura solution, as I've said before, is the dynamic tag cloud that allows the user to navigate the search space; initial feedback from our readers here has been positive, but not enthusiastic.

The real benefits to both users and publishers will come when Quintura search results prove to be better than equivalent results from a mainstream search engine solution, such as Google; as long as the Google site search results are good enough, it will be hard for the Quintura widget to make significant inroads into the market share of the big-G juggernaut.

This widget release is currently in private beta; an invite for this beta is available over on ReadWriteWeb.



January 09, 2008

Deconstructing real Google searches - why Powerset matters

I was looking at the log files for my blog today, as I regularly do, and I was suddenly struck by the variety of search queries in Google for which users were getting referred to my posts. I write often about the different flavors of search - including vertical search, parametric search, semantic search, and so on - so users with queries about Search often land here. But do they always find what they're looking for?

Some Real-life Search Results

Let us examine some of the actual Google queries - in the form of referring URLs - that led users to my blog. In most cases, Google did a fine job of matching the content to the query; in some cases, it was a somewhat random match at best; finally, in a few cases, the Google search algorithms are clearly getting confused. It is this third case that is the most interesting.

The Good

In many cases, the match was quite straightforward and very relevant. Some examples are given below.
1.

Query: http://www.google.fr/search?q=Guru+Avinash+Kaushik& ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:fr:official& client=firefox-a
ResultA conversation with Avinash Kaushik, Web Analytics Guru

Well, can't argue with that!

2.

Query: http://www.google.cn/search?sourceid=navclient&aq=t& hl=zh-CN&ie=UTF-8&rlz=1T4XNLA_zh-CNCN246CN247& q=vertical+search
ResultThe rise of Vertical Search Engines (VSEs)

Query:    http://www.google.com/search?
q=wikipedia+to+try+and+compete+with+google& ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official& client=firefox-a
ResultWikipedia Search to compete with Google

Again, can't argue with those.

3.

Query:  http://www.google.com/search?hl=en& q=search+technology+exits
ResultSo You've Built an Alternative Search Engine - Now What?

This is actually pretty awesome, the algorithm has figured out "search technology" and "exits"; in fact, this post does talk about exit strategies for search engines, so it's a great match.

The Bad

Some search queries are so vague that the matches you get are bound to be somewhat random. I don't blame Google for the following matches:

4.

Query:  http://www.google.com/search?hl=en& q=conceptual+architecture
ResultA Conceptual Architecture for Search

Is the search string too vague? Although technically this post matches the search query, I'm guessing that this is not what the user intended to look for.

5.

Query: http://www.google.com/search?hl=en&safe=off& client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial& hs=pGP&q=disruptive+technologies+blog& btnG=Search
ResultDisruptive technologies for 2007

While the words match, and possibly this may satisfy the user, I get the sense that the user was looking for a blog dedicated to discussing disruptive technologies, not a single post. But who knows? Again, too vague!

In the future, I wonder how soon Search technology will progress to the point where the UI will automatically ask the user for more information to qualify search terms that are too general or vague. A little while ago, I envisioned a similar scenario ( Vertical Search, with authority ) when taking a look at the search engine MetaMojo, which has taken some steps in this direction.

The Ugly

In a few cases, though, the proximity of certain keywords fools the search algorithms. Consider the following matches:

6.

Query: http://www.google.com/search? q=best+search+engine+for+directions&ie=utf-8& oe=utf-8&aq=t&rls=org.mozilla:en-US:official& client=firefox-a
ResultFuture Directions in Search

A post about "future directions in Search" is not a post about "search engines for directions", although the text itself is undoubtedly a close match.

7.

Query:  http://www.google.com/search?hl=en& q=people+search+software+compared&btnG=Search
ResultSearch and the Dumbness of Crowds

Hmm? This is a popular post, but I'm not sure if it helps the user, who is not trying to compare search strategies (as this post does); instead, the user appears to be trying to compare people search engines.

Are these good matches? While the content of the posts bears a superficial resemblance to the text in the respective queries, the results are not relevant to the requested user searches.

The Larger Problem

The samples given above are not that important; the matches from my blog do not always show up at the top of the search results and although these are real referrals, not many users will actually click on these links in the Results page. But these examples point to a deeper underlying issue, one that will be far from easy to fix in the general sense.

All the major search engines currently rely on the proximity of keywords and search terms to match results. But that approach can be misleading, causing the search engine to systematically produce incorrect results under certain conditions.

To demonstrate, let us take a look at three general use cases.

[Note: The examples given below are all drawn from Google. To be fair, all the major search engines use similar algorithms, and all suffer from similar problems. For its part, Google handles billions of queries every day, usually very competently. As the reigning market leader, though, Google is the obvious target - it goes with the territory!]

1. Difficulty of Finding Long Tail Results

Take Britney Spears. Given the current popularity of articles, pictures and videos of the superstar singer, the results for practically any query with the word "spears" in it will be loaded with matches about her - especially if the search involves television or entertainment in any way.

Let's say you're watching the movie Zulu and you start wondering what those large spears that all the extras are waving about, are made of. So, you go to Google and type in "movie spears material" - this is an obviously insufficient description, as the screen shot below shows.




What happens if you expand on the query further - say: "what are movie spears made out of?" - does it help? Here's a screen shot.




The general issue here is that articles about very popular subjects accumulate high levels of PageRank and then totally overwhelm long tail results. This makes it very difficult for a user to find information about unusual topics that happen to lie near these subjects.

2. Keyword Ordering

Since the major search engines focus only on the proximity of keywords without context, a user search that's similar to a popular concept gets swamped with those results, even if the order of keywords in the query has been reversed. For example, a tragic occurrence that's common in modern life is that of a bicycle getting hit by a car. Much less common is the possibility of a car getting hit by a bicycle, although it does happen. How would you search for the latter? Try typing "car hit by bicycle" into Google; here's a screen shot of what you get.  [Note the third result, which is actually relevant to this search!]



3. Keyword Relationships

Since the major search engines focus only on the keywords in the search phrase, all sense of the relationship between the search terms is lost. For example, users commonly change the meaning of search terms by using negations and prepositions; it is also fairly common to look for the less common members of a set.

This takes us into the realm of natural language processing (NLP). Without NLP, the nuances of these query modifications are totally invisible to the search algorithms.

For example, a query such as "Famous Science fiction writers other than Isaac Asimov" is doomed to failure. A screen shot of this search in Google is given below. Most of the returned results are about Isaac Asimov, even when the user is explicitly trying to exclude him from the list of authors found.



All of the searches shown above look like gimmicks - queries designed intentionally to mislead Google's search algorithms. And in a sense, they are; these specific queries can be easily fixed by tweaking the search engine. Nevertheless, these queries do point to a real need: the value of understanding the meaning behind both the query and the content indexed.

Semantic Search

That's where the concept of semantic search comes in. I attended a media event earlier this year at stealth search startup Powerset (see: Powerset is Not a Google-killer! ) which showcased a live demo of their search engine, currently in closed alpha, that highlighted solutions to exactly this type of issue.

For example, type "What was said about Jesus" into a major search engine, and you usually get a whole list of results that consist of the teachings of Jesus; this means that the search engine entirely missed the concepts of passive voice and "about". The Powerset results, on the other hand, were consistently on target (for the demo, anyway!).

In other words, when you look at just the keywords in the query, you don't really understand what the user is looking for; by looking at them within context, by taking into account the qualifiers, the prepositions, the negatives, and other such nuances, you can create a semantic graph of the query. The same case can be made for semantic parsing of the content indexed. Put the two together, as Powerset does, and you can get a much better feel for relevance of results.

What about Google? I'm sure the smart folks in Google's search-quality team are busily working on this problem as well. I look forward to the time when the major search engines handle long tail queries more accurately and make Search a better experience for all of us.



December 23, 2007

Web Poll Results: What is the Most Important Component of a Search Engine?

Our last web poll had asked readers to vote on what they considered to be the most important component of a Search Engine - an indication of the areas a small search startup should focus on to help capture market share away from the major search engines.

27 readers voted (thank you!). The poll results are shown below.



These results are interesting because I had expected a higher percentage of votes for the Results Visualization choice, followed by the Algorithm choice, but the votes did not match my expectations. Part of the reason could be that readers of this blog are predisposed to have a stronger interest in the algorithms and strategies used by various search engines than in their UI paradigms.

As for the size of the Content Index, that's a metric that is slowly declining in importance. There was a time when the major search engines would fall all over themselves in trying to top each other in terms of the amount of data indexed; but as the content on the web explodes and grows progressively richer, it simply does not matter as much, and that is reflected by the votes.

As expected, the Query Spec choice got completely ignored. The search engine input spec could be so much richer than a minimal number of words or a single phrase. However, I fear that we're condemned to using keyword-ese for specifying our needs to the Search Engine for the long-term future, which is a pity; like the QWERTY keyboard, it may stick with us well after its usefulness has waned.



  • Search This Blog


    Web This Blog