May 15, 2008

Yahoo! SearchMonkey - Released to Developers

The good folks from Yahoo! unveiled their new open search platform Yahoo! SearchMonkey, at a developer launch party today at their Sunnyvale headquarters. In some ways, the SearchMonkey platform is revolutionary and a major step forward in search, allowing publishers to participate directly in improving the quality of their own information presented on the Yahoo! search results page (this is also implicitly a push for the bottom-up approach to the Semantic Web, which most industry observers have given up on in favor of a top-down approach). The platform also lets publishers and third-party developers build applications aimed at improving the search experience. Finally, and most important, if enough publishers and app developers participate in the program, it promises to improve the quality of search results for end users.

Features

At the simplest level, you can think of SearchMonkey as a community-powered set of rich information boxes (similar to the Google OneBox) that appear on the Yahoo! search results page. Publishers can provide this rich data to the Yahoo! search index in a variety of ways: through structured data feeds (RSS), through RDF or Microformat markup on web pages, or through simple page extraction. The "Information Bar" shows up underneath the main search results. The Yahoo! search team has also provided tools to enable developers to build search-based applications very simply and easily.

Continue reading "Yahoo! SearchMonkey - Released to Developers" »

May 11, 2008

Powerset Launches Wikipedia Search

Semantic search engine Powerset, which we've written about here before, has just launched its initial release. The current release is limited to indexing Wikipedia content, but it provides a great showcase for their technology and user experience.

For example, my search for "Alexander the Great" provided the following results page:

Continue reading "Powerset Launches Wikipedia Search" »

April 20, 2008

Cooperation of Alternate Search Engines: A Manifesto

( This post is inspired by my discussions with my friend, Charles Knight of AltSearchEngines )

Background

I'll be at the Alternative Search Engines Day tomorrow, a unique event in San Francisco put together by Charles and the AltSearchEngines team. The event is sponsored by SeeqPod, UpTake, Matchpoint, HealthPricer, GoPubMed and Blogdimension. (Unfortunately, it's not open to the general public.) If you're part of an Alternative Search Engine, I hope to see you there!

As I was getting ready for the event, it got me thinking about ASEs and how they can work together.

The Case for the Alts

I love the ASEs - Alts rock! Without them, there would be little innovation in Search, no new frontiers to be explored.

The Alts are the ones that keep pushing the envelope with new directions in search technology, whether it's algorithms, user interface, social search or something else.  Although Google has some fine technology and is synonymous with search, I firmly believe that we're still at Search 1.0, and have a long way to go. Because of all this competition from the Alts, and the resulting innovation, web search continues to improve.

Continue reading "Cooperation of Alternate Search Engines: A Manifesto" »

April 01, 2008

Introducing: Gmail Custom Time!

In the grand tradition of Project Teaspoon, Google unveiled yet another profound and significant product today: Gmail Custom Time .  This exciting new product is sure to be a life-saver for many a forgetful techie.

To quote Google's own marketing content:

Ever wish you could go back in time and send that crucial email that could have changed everything -- if only it hadn't slipped your mind? Gmail can now help you with those missed deadlines, missed birthdays and missed opportunities.


If you want to see what it looks like, here's the image:


Gmail Custom Time


Remember: you heard it here first! :-)



March 31, 2008

Could You Survive For A Day - Without Google?

Can you spend a whole day without using Google? - that's the challenge issued by my friend Charles Knight over on the Alt Search Engines blog (see also ReadWriteWeb's coverage). To help you out, he's going to publish the latest version of his popular Top 100 Alternative Search Engines list tomorrow.

I think this is a great idea! We have all become addicted to the power (and limitations) of Google search - just like television before the age of the Internet, we cannot imagine life without it. And yet, as Charles' list shows, there are plenty of alternative search engines out there, innovating Search in a variety of different ways.

Personally, I'm going to use this opportunity to learn the latest features of Quintura, an innovative search engine we've covered before on this blog (here and here ). Quintura has jumped on board this idea by creating a special destination page for discovering the best hoaxes, pranks, jokes and tricks for April fool's day. [Rest assured, this is no joke!]

So how about you - can you do it? Why not give it a shot and try out an alternative search engine? Or two, or five, or all hundred on Charles' list? Can you last a day, an hour, even five minutes? Try it and the results may surprise you!

February 17, 2008

Social Data: Observations from "Search & The Social Graph" Event

Dave McClure moderated an event on Search & The Social Graph at the Yahoo! campus this week, organized by the Search SIG of the Software Development Forum. With the meteoric rise of Facebook and the heightened interest in leveraging the social graph - both Google and Yahoo! have launched new APIs and OpenSocial is gaining momentum - this discussion was timely and attendance was strong.

The panelists represented some of the most interesting players in this space:

  • Kevin Marks from Google
  • Aditya Agarwal from Facebook
  • Kent Brewster of Yahoo!
  • Eve Phillips, CEO of Chirp

It turned out to be an interesting event, with lots of good discussion about the implications of portability, privacy, utility and monetization of social data. No stranger to the social data space, moderator McClure did an outstanding job of keeping things focused and the discussion lively; he was clearly  knowledgeable and well-prepared, launching into a series of leading questions that moved the conversation forward.

Key Observations

By grouping together related comments, I've distilled the discussion at this event into the following topics:

1. Relevance of Search Results

- With the explosion of self-publishing and user-generated content on the web, the type of data getting created on the web is changing, and the classic search algorithms are becoming less effective.
- Users are increasingly interested in what their friends and peers are doing online.
- By using a social graph to filter out results during a specific search, you can boost the relevance of search results.

2. Monetization

- It is no longer uncommon for a person to become a media source, using tools such as twitter, blogs and RSS feeds; but this is hard to monetize. A referral model works better in this case than advertising.
- Brand advertising is still big, even for social search, but it works differently than for targeted search
- Online brand advertising will move into more interactive experiences in the future
- The key question is: Does membership in a social group signal an intention that can be targeted by advertisers? The panelists felt that, on balance, it did Not
- For a more concrete example: Google's directed search is very monetizable; Facebook has a lot of social data, but user behavior is not very monetizable

3. Privacy

- There is a clear difference between a publicly-proclaimed graph, such as the friends on Facebook, and a private list, such as Email contacts; application developers will ignore this distinction at their peril
- Yahoo!'s Brewster said it best: "There should never be a privacy surprise for the user!"
- Applications should make it clear to users if they are making data public or private; e.g. Flickr is three-valued in this regard

4. Interaction Levels

- From a monetization perspective, all "friends" are not created equal; some connections in the social graph are stronger than others
- The smallest inner set of friends is the most valuable; the first 25 people have 80% of the value
- The viral rate of promotion in Facebook is incredible
- If users can annotate connections, they can more fully express their network graph
- You can infer relationships from user behavior, such as sites visited and click-throughs
- The most important part of social data is the connections, followed by the profile; eventually, it gives you the ability to answer the question: "Who should you go to, to answer this question?"

5. OpenSocial

- OpenSocial allows application developers to write one application, and then take it to where the users are on diverse other social networks
- The vision: take some of the good parts of Facebook and bring those to a lot of people
- This allows any application to spread through the social graph

6. Social Email

- Email networks have a lot of connection data, which has social data buried in it
- These connections can either be one-way or two-way; the difference signals intent on the part of the user
- Google's Marks made an interesting point: a person's email address and personal URL are opposites - with the former, you can communicate with that that person; with the latter, the person communicates with you

Facebook

Facebook's Agarwal did a great job of articulating the company's approach to some of these issues. His contributions to the discussion were somewhat Facebook-centric; but given the strong community interest in Facebook lately, this only added to the value of the panel.

In discussing the value of social data for search, Agarwal compared the issues of selecting for relevance among a large number of results for a targeted search, with those of producing Facebook's news feed, which must also present a large amount of data to the user in a format that's easy to consume.

In terms of privacy, Facebook wants to allow users to annotate the social graph, so that they can fully express their network. This will allow users to separate their strong connections from casual friends. The size of a user's graph is another dimension to be considered.

For data portability, Facebook currently doesn't have any plans to implement enabling features focusing on it. Agarwal clarified that although philosophically they support data portability initiatives, they have not determined it to be the best use of resources at this time.

Finally, although Agarwal did not acknowledge this directly, the panelists agreed that the Facebook-type social network data and searches are far less monetizable than directly targeted activities that display clear intent, such as a Google search.

Chirp

This was the first time I saw a demo of Chirp . Eve Phillips, Chirp's CEO, gave a demo of chirpscreen, an interactive screen saver that displays content from your social network, such as pictures from Flickr and status messages from Facebook. On the whole, the audience loved it - a series of photos of her friends kept popping up on the screen - but there were some concerns about being able to control what gets shown. According to Phillips, Chirp is planning to introduce new features soon that will allow users to set preferences of what content is displayed, from which sources, and so on.

Open Questions

McClure asked some incisive questions to the panelists, which deserve to be listed in their own right; I hope these lead to a wider discussion about social data and related topics:

  • Is Social Search - revolutionary, or evolutionary?
  • Which benefits more from social data: targeted search or discovery?
  • How well does social search monetize?
  • How should we use the social data that's automatically present in Email?
  • If Facebook and other networks encourage lightweight friendships, does it obscure the real social graph?


January 29, 2008

Zvents makes Local Search pop!

There is a class of web search engines that can prove even more useful than Google within a certain context. I'm talking, of course, about Vertical Search engines - the writer and tech strategist Sramana Mitra considers them Google's Achilles heel and Profy.com's Cyndy Aleo-Carreira seems to agree. This blog also has long held the position that vertical search represents a powerful mechanism to find information on the web, and is a key category to watch in the search wars of the future. [see: The rise of Vertical Search Engines from Aug 2006].

Another way of achieving a similar focus, in order to improve the relevance of search results, is by segmenting by location rather than by industry vertical - i.e. create a hyperlocal search engine that limits its search results to a given geographical area.

One such alternate search engine is Zvents, which is relentlessly focused on local information, of any sort. This company, which has been around since early 2005, has just introduced an advanced feature called Federated Local search - basically, its own version of Universal Search (recall that Google introduced its Universal Search feature with much fanfare last May).

Federated Local Search: Multi-Dimensional Results for Local Information

What does Universal Search mean, for a local search engine? Initially this was not very clear to me; an email discussion with Paul O'Brien, Director of Marketing at Zvents, inspired me to draw the following diagram:



The basic idea is to enable the user to implement a general-purpose search within a local context. This allows the user to find local information about a given topic, across many different dimensions. For example, a sports fan living in San Jose, CA who tries a local search for the term "hockey", would get the following different types of results:

  • Upcoming games for the San Jose Sharks, the local hockey team
  • The location of Roosevelt Park Roller Hockey Rink
  • The description and link for a local "Hockey Night" event
  • Results about relevant personalities (what Zvents calls "performers")
  • And other related links ...

Zvents has already partially implemented this vision, although some of the lower-ranked results could provide a better match. Hopefully these will improve in the future as the search index grows and the algorithm improves. A screen shot of this local Hockey search in Zvents is given below.



Similarly, here's a search for the term "Web 2.0" for Cupertino, CA:



Outcome: Relevance

The big advantage of this type of search, over a general-purpose Google or Yahoo! search, is that the user can obtain the benefits of a broad cross-section of results, while still constraining the search to a limited geographical area.

This is not a significant issue in highly developed, urban, technologically advanced areas like Silicon Valley, Boston or New York; but it could one day make a big difference for someone living in David Letterman's "home office" of Wahoo, NE , or even more important, someone trying to find the Boston Public School located in Boston, Ontario - as we've seen before, highly popular keywords tend to swamp nearby long-tail keywords in the search results for major search engines.

From a business model perspective, hyperlocal searches tend to provide highly qualified prospects for local merchants, so I would guess that this type of search is very easily monetizable in the long run.

From a user interface point-of-view, the NLP-like implementation of time period for the search engine ("when: tonight, this weekend, ...") is a nice touch; I tried different possibilities ("next month"), and it seemed to work just fine.

On a more technical note, Zvents has been making waves with the release of its open-source Bigtable clone called Hypertable, which adds a C++ option for this project.

Going forward, it will be interesting to see how Zvents scales to additional locations, and to additional dimensions within each locality. Will it make inroads into the market share for any of the major search engines, or into that of other locally-focused web sites like topix.com and craigslist?



January 27, 2008

Quintura Launches Site Search Widget

Alternative search engine Quintura, which I've mentioned before on this blog, has launched its site search widget. This widget allows site publishers to provide users with a specialized search limited to that specific site; it joins earlier offerings from Google, Yahoo!, Rollyo and Eurekster swiki in this space.

This blog was an early user of this widget. You can see a customized, Quintura-generated mini-tag cloud in the earlier post; a full-size tag cloud is also available. The widget is hosted by Quintura, so installation was a snap: once the site was indexed, all I had to do was to embed the widget code into my blog pages and provide some styling control.

The biggest benefit of using the Quintura solution, as I've said before, is the dynamic tag cloud that allows the user to navigate the search space; initial feedback from our readers here has been positive, but not enthusiastic.

The real benefits to both users and publishers will come when Quintura search results prove to be better than equivalent results from a mainstream search engine solution, such as Google; as long as the Google site search results are good enough, it will be hard for the Quintura widget to make significant inroads into the market share of the big-G juggernaut.

This widget release is currently in private beta; an invite for this beta is available over on ReadWriteWeb.



January 22, 2008

Disambiguation of Search Results? Yup, Google's got that

Just last week, in an email exchange with another search blogger, I wondered when Google would provide options for disambiguation of search results.

When you think about it, that's an obvious requirement for the Results page of any serious search engine. If I query for the search term "Java" - does it mean that I'm looking for results about the programming language, the coffee, or the island in Indonesia?

There's no way for the search engine to be able to tell, although personalization could provide clues. The easiest solution, as I wrote back in 2006, is for the search engine to just ask - which is why Wikipedia offers this page: Java (disambiguation) . Alternatively, the results can be grouped into various categories for the user to choose from, which is another way of doing the same thing.

Until now, Google has been mostly following a third option, which is to simply pick the most popular category regardless of the user's real preference; this can lead to some strange results, as highlighted in my earlier post on deconstructing real Google searches. But this approach doesn't really cut it, since it ignores all the unpopular search results - it's very possible that the long-tail searches can collectively make up a market share that rivals or exceeds the relatively few "popular" searches.

There has also been a limited amount of disambiguation offered by Google's "related searches" feature.

Well, no more. Google appears to be experimenting with offering disambiguation directly by grouping search results into categories. See the screen shot below, that shows Google search results for the query: "freebase" . Effectively, the results page seems to be asking: do you mean, the free semantic web database, or the other kind, associated with drugs? Or a third alternative: FreeBase - a free Windows software program to configure the Apple AirPort Base Station.

The use of horizontal ruled lines to separate the sections, is a nice touch!



Obviously this is some type of test; I certainly hope it's successful. I can't wait to see this feature become mainstream among the major search engines. It will be a big step forward in Search!



January 09, 2008

Deconstructing real Google searches - why Powerset matters

I was looking at the log files for my blog today, as I regularly do, and I was suddenly struck by the variety of search queries in Google for which users were getting referred to my posts. I write often about the different flavors of search - including vertical search, parametric search, semantic search, and so on - so users with queries about Search often land here. But do they always find what they're looking for?

Some Real-life Search Results

Let us examine some of the actual Google queries - in the form of referring URLs - that led users to my blog. In most cases, Google did a fine job of matching the content to the query; in some cases, it was a somewhat random match at best; finally, in a few cases, the Google search algorithms are clearly getting confused. It is this third case that is the most interesting.

The Good

In many cases, the match was quite straightforward and very relevant. Some examples are given below.
1.

Query: http://www.google.fr/search?q=Guru+Avinash+Kaushik& ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:fr:official& client=firefox-a
ResultA conversation with Avinash Kaushik, Web Analytics Guru

Well, can't argue with that!

2.

Query: http://www.google.cn/search?sourceid=navclient&aq=t& hl=zh-CN&ie=UTF-8&rlz=1T4XNLA_zh-CNCN246CN247& q=vertical+search
ResultThe rise of Vertical Search Engines (VSEs)

Query:    http://www.google.com/search?
q=wikipedia+to+try+and+compete+with+google& ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official& client=firefox-a
ResultWikipedia Search to compete with Google

Again, can't argue with those.

3.

Query:  http://www.google.com/search?hl=en& q=search+technology+exits
ResultSo You've Built an Alternative Search Engine - Now What?

This is actually pretty awesome, the algorithm has figured out "search technology" and "exits"; in fact, this post does talk about exit strategies for search engines, so it's a great match.

The Bad

Some search queries are so vague that the matches you get are bound to be somewhat random. I don't blame Google for the following matches:

4.

Query:  http://www.google.com/search?hl=en& q=conceptual+architecture
ResultA Conceptual Architecture for Search

Is the search string too vague? Although technically this post matches the search query, I'm guessing that this is not what the user intended to look for.

5.

Query: http://www.google.com/search?hl=en&safe=off& client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial& hs=pGP&q=disruptive+technologies+blog& btnG=Search
ResultDisruptive technologies for 2007

While the words match, and possibly this may satisfy the user, I get the sense that the user was looking for a blog dedicated to discussing disruptive technologies, not a single post. But who knows? Again, too vague!

In the future, I wonder how soon Search technology will progress to the point where the UI will automatically ask the user for more information to qualify search terms that are too general or vague. A little while ago, I envisioned a similar scenario ( Vertical Search, with authority ) when taking a look at the search engine MetaMojo, which has taken some steps in this direction.

The Ugly

In a few cases, though, the proximity of certain keywords fools the search algorithms. Consider the following matches:

6.

Query: http://www.google.com/search? q=best+search+engine+for+directions&ie=utf-8& oe=utf-8&aq=t&rls=org.mozilla:en-US:official& client=firefox-a
ResultFuture Directions in Search

A post about "future directions in Search" is not a post about "search engines for directions", although the text itself is undoubtedly a close match.

7.

Query:  http://www.google.com/search?hl=en& q=people+search+software+compared&btnG=Search
ResultSearch and the Dumbness of Crowds

Hmm? This is a popular post, but I'm not sure if it helps the user, who is not trying to compare search strategies (as this post does); instead, the user appears to be trying to compare people search engines.

Are these good matches? While the content of the posts bears a superficial resemblance to the text in the respective queries, the results are not relevant to the requested user searches.

The Larger Problem

The samples given above are not that important; the matches from my blog do not always show up at the top of the search results and although these are real referrals, not many users will actually click on these links in the Results page. But these examples point to a deeper underlying issue, one that will be far from easy to fix in the general sense.

All the major search engines currently rely on the proximity of keywords and search terms to match results. But that approach can be misleading, causing the search engine to systematically produce incorrect results under certain conditions.

To demonstrate, let us take a look at three general use cases.

[Note: The examples given below are all drawn from Google. To be fair, all the major search engines use similar algorithms, and all suffer from similar problems. For its part, Google handles billions of queries every day, usually very competently. As the reigning market leader, though, Google is the obvious target - it goes with the territory!]

1. Difficulty of Finding Long Tail Results

Take Britney Spears. Given the current popularity of articles, pictures and videos of the superstar singer, the results for practically any query with the word "spears" in it will be loaded with matches about her - especially if the search involves television or entertainment in any way.

Let's say you're watching the movie Zulu and you start wondering what those large spears that all the extras are waving about, are made of. So, you go to Google and type in "movie spears material" - this is an obviously insufficient description, as the screen shot below shows.




What happens if you expand on the query further - say: "what are movie spears made out of?" - does it help? Here's a screen shot.




The general issue here is that articles about very popular subjects accumulate high levels of PageRank and then totally overwhelm long tail results. This makes it very difficult for a user to find information about unusual topics that happen to lie near these subjects.

2. Keyword Ordering

Since the major search engines focus only on the proximity of keywords without context, a user search that's similar to a popular concept gets swamped with those results, even if the order of keywords in the query has been reversed. For example, a tragic occurrence that's common in modern life is that of a bicycle getting hit by a car. Much less common is the possibility of a car getting hit by a bicycle, although it does happen. How would you search for the latter? Try typing "car hit by bicycle" into Google; here's a screen shot of what you get.  [Note the third result, which is actually relevant to this search!]



3. Keyword Relationships

Since the major search engines focus only on the keywords in the search phrase, all sense of the relationship between the search terms is lost. For example, users commonly change the meaning of search terms by using negations and prepositions; it is also fairly common to look for the less common members of a set.

This takes us into the realm of natural language processing (NLP). Without NLP, the nuances of these query modifications are totally invisible to the search algorithms.

For example, a query such as "Famous Science fiction writers other than Isaac Asimov" is doomed to failure. A screen shot of this search in Google is given below. Most of the returned results are about Isaac Asimov, even when the user is explicitly trying to exclude him from the list of authors found.



All of the searches shown above look like gimmicks - queries designed intentionally to mislead Google's search algorithms. And in a sense, they are; these specific queries can be easily fixed by tweaking the search engine. Nevertheless, these queries do point to a real need: the value of understanding the meaning behind both the query and the content indexed.

Semantic Search

That's where the concept of semantic search comes in. I attended a media event earlier this year at stealth search startup Powerset (see: Powerset is Not a Google-killer! ) which showcased a live demo of their search engine, currently in closed alpha, that highlighted solutions to exactly this type of issue.

For example, type "What was said about Jesus" into a major search engine, and you usually get a whole list of results that consist of the teachings of Jesus; this means that the search engine entirely missed the concepts of passive voice and "about". The Powerset results, on the other hand, were consistently on target (for the demo, anyway!).

In other words, when you look at just the keywords in the query, you don't really understand what the user is looking for; by looking at them within context, by taking into account the qualifiers, the prepositions, the negatives, and other such nuances, you can create a semantic graph of the query. The same case can be made for semantic parsing of the content indexed. Put the two together, as Powerset does, and you can get a much better feel for relevance of results.

What about Google? I'm sure the smart folks in Google's search-quality team are busily working on this problem as well. I look forward to the time when the major search engines handle long tail queries more accurately and make Search a better experience for all of us.



  • Search This Blog


    Web This Blog