« August 2007 | Main | October 2007 »

September 25, 2007

Can the Semantic Web bring us Trusted Search Results?

Nova Spivack, during his recent talk about the Semantic Web (covered in my previous post ), made the point that addition of semantic processing to the underlying index for a search engine, make the issue of trust more serious. This was an intriguing statement, and I followed up with him via email to get further clarification; he was kind enough to respond at length. The questions and answers from our exchange are given below.


1. For Semantic web to really take off, the information on the web at large needs to become Semantic Web-compatible; i.e. web pages need to provide semantic information in the form of RDF, OWL etc. Do you see this happening in the forseeable future, given the huge mass of pages that already exist?
    Or is it more likely that technology will have to solve this problem for us, and we'll need to invent algorithms that can interpret currently existing web pages to extract and apply semantic knowledge on top (such as ClearForest Gnosis )?

The DBpedia.org is a good start. Also check out the emerging SPARQL and GRDDL standards at W3C -- they will bring existing data into the RDF world. There is also growing body of RDF already out there in the Dublin Core, FOAF and other ontologies on content on the Web. More will be coming from many big companies, Adobe, Oracle, Yahoo, etc. And of course other startups like Metaweb and Radar Networks will be adding a lot of content to the mix in different ways as well.


2. In your talk, you mentioned Trust - e.g. you referred to Powerset, wondering how they could add some sense of trust to the results they find through semantic processing (NLP) of web pages, because otherwise it would not be useful.
    I'm not sure I understand.

They are mining full text of the web and automatically building a knowledge base from that. So when they see a web page that says "Microsoft is a terrorist organization" or "Microsoft is a software manufacturer" how do they know which statement is true or false? Who do they trust? How do they determine who to trust? This determines what facts or assertions get what level of weight in their knowledge base. It's the crux of the issue really. You can mine in a lot of assertions, but if you have no good way to filter out the garbage, spam, erroneous statements, or deliberate deception, you can't use it for anything real. One solution is to only mine highly trusted sources -- such as encyclopedias and major newspapers for example. That's not a bad way to start. That would generate a decent knowledge base.

But the DBpedia.org might be a better way to start than mining free text. They've already done the heavy lifting of turning the wikipedia into RDF. I'm not sure you need natural language to get good, reasonably trustworthy knowledge, just use the DBpedia.

In the case of Powerset I believe their goals are different than the DBpedia/Wikipedia -- I think they don't just want almanac content, they want specialized vertical knowledge about travel, products, etc. That will require that they either are very selective of their data sources, or they have a sensitive way to measure trust and rank information accordingly.


        a] How does the addition of semantic processing to the underlying index make the issue of trust more serious? Google's PageRank is basically an approximation of the Wisdom of Crowds, using static links to represent votes; if Powerset uses some similar mechanism to rank the information sources in the underlying index, then their results should be no worse than Google's in terms of trust, and better than Google's in terms of relevance. [Incidentally, I've written about Powerset before when I attended their preview event.]

They could perhaps use a pagerank algorithm to attribute more trust to assertions they mine from various sites. That would be one solution. Unfortunately it gets more complex though -- because if their knowledge base has many grades of truth (statements ranked to varying degrees of trust) for a given assertion, then they will have to use modal logic or some other form of fuzzy reasoning to actually do any real reasoning or inferencing. That stuff is hard and uncharted territory to say the least. I don't think they will go there and if they did I think they would not be successful at it. So the question is what can they do without going there?


        b] How will the Semantic Web improve the trust situation for web search results?

First there is the issue of being able to assign a trust rank to each triple. Trust is relative so in fact there may not be a single global measure of trust that applies equally to everyone. I may trust someone that you don't trust and so I may take what they say to have more weight than you would for example. There isn't room within every triple to store that, but triples could be ratified by other triples that express "endorsement" of their content. So if I agree with something I can simply express that and now it is recorded that I (in an authenticated manner) have said I trust it. If lots of people do that with various assertions (triples), records (objects), and sites and people (sources), then we have a network of trust built in RDF. A network of trust can be reasoned on to determine weighted, socially relative trust rankings for triples. It can determine what triples I am likely to trust versus what you are likely to trust versus what everyone is likely to trust.

Second there is the issue of being able to trust reasoning performed by the system. For that the system needs to be able to explain to a human how it reached some logical conclusion and what data it used to do so. Work is being done both on computer-generated explanations and ways to record and show provenance to address these issues.


Many thanks to Nova for his detailed answers. You can find his blog here:  Minding the Planet .



Great Videos: Sir Ken Robinson on Creativity and Education

I'm introducing a new feature on this blog: Great Videos. There are some truly great videos that have caught on virally within the blogosphere - videos that are inspiring, creative, original - that I would like to share with readers. In general, I'm going to stay within the themes of technology, innovation and the future, but some of them may be slightly off-topic for this blog.

Caveat: Since each of these is a popular video, you may have already seen it and can just skip ahead; but for everyone else, I'm pretty sure it will be worth their time to watch. 

I'm leading off with the following video:

Sir Ken Robinson, speaking at the TED Conference (via Presentation Zen ):

Topic:           Creativity and Education
Duration:      20:02
Description: This is not a particularly interesting topic by itself, but Sir Ken is a great speaker and he makes it come alive. This video of interest to anyone who cares about how we educate our children and nourish their creativity.



Here's the original link.

If anyone has any other great videos to share, please let me know in the comments or via email.



September 12, 2007

The Promise of Semantic Web

Last week, I attended a session on the topic of Semantic Web, presented by Nova Spivack, CEO of Radar Networks. I've been following this topic off and on for a few years, but I've always wondered how real it is (remember the AI efforts of the '80s?). After listening to this talk, I'm convinced that at least parts of this technology are on the way, although full-fledged machine-automated reasoning remains as elusive as ever. The highlights of the session are covered below.

What is it good for?

Mr. Spivack defined Semantic Web as a specific set of W3C open standards for working with knowledge. The main idea is to use technologies based on these standards for adding machine-understandable structured data to the web, with the overall goal of enabling automated reasoning algorithms. Note that semantic web advocates do not insist that the ontologies be the same across the web (unlike, say, the microformat approach) - different sites can use different schemas, as long as they are published and can be mapped to one another.

Benefits of the Semantic Web include the following:

  1. Richer content
  2. More precise search and navigation
  3. Increased productivity and better collaboration
  4. Integrating data and applications
  5. Machine-automated reasoning and AI

Most exciting, the availability of semantic information could facilitate much richer web search paradigms, e.g. parametric search and associative search. Semantic web technologies hold the promise of automated reasoning, letting the engine make inferences based on structured data and the links between data.

How does it work?

Semantic Web depends on the following core standards:

  1. RDF - Resource Description Framework - Enables the  storage of data as "triples"  (subject, predicate, object)
  2. OWL - Web Ontology Language - Define systems of concepts called "ontologies"
  3. SPARQL - an RDF Query Language - To query RDF data
  4. SWRL - Semantic Web Rule Language - Enables us to define rules
  5. GRDDL - markup format - Transform xml/xhtml data to RDF
  6. (Microformats) - do they really belong in this list?

Today's web pages present information essentially as text (xhtml); links between pages are simply links, with no semantics associated with each link. In a semantic web, however, data is modeled as a network of nodes, which are connected together using semantic (meaningful) links. Specifically, information is usually modeled as a set of triples, each of which includes a Subject, a Predicate and an Object. The nodes themselves can be arranged in a hierarchy.

Modeling data in this way allows simple inferences to be made that were not explicitly stated in the information provided.

Challenges

In his presentation, Mr. Spivack identified the following barriers to adoption:

  1. A lack of tools
  2. Scaling challenges (what if you want to store a trillion+ triples?)
  3. Vision issues (how can we define a practical vision, for the low-hanging fruit?)
  4. Inadequate Content (not enough semantic data available)
  5. No killer apps
  6. Market education

Although all of these are clearly important and need to be solved, I see the core problem as being the lack of a practical, popular "Killer App". If a specific application were to catch the imagination of a large section of the population, I have no doubt that the rest of the problems - technical issues such as tools and scaling - could be solved.

[For more information about Semantic Web technologies, you should check out Nova Spivack's blog: Minding the Planet.]



September 02, 2007

Reactions: The Future of Alternative Search Engines

I recently wrote an article on this blog about the exit strategies for alternative search engines, that highlighted the recent and growing trend of publishers acquiring search engines; I also speculated about Charles Knight's quest to get these Alts to band together in order to grow overall traffic. I've gotten interesting reactions to the piece from some prominent bloggers.

 

I've been a fan of Ashkan Karbasfrooshan, of the HipMojo blog, for a long time. When I asked him about the Alts, his response was as follows:

No one has a crystal ball to predict with enough accuracy what will happen in the next few years with regards to search, but clearly, history suggests that there is always a period of consolidation after innovation and growth, so that should happen in search too.  And, we're seeing the big companies starting to have difficulty adding market share, it's not like adding one of these fringe search engines (including MetaMojo when I refer to alt engines as fringe) will add market share to a major one, but sometimes by buying a small player, a major company adds technology, know-how, but most importantly, brainpower.  That is the single most important variable for who will win or lose in search over the next decade: you can best deploy technology and market it, and not who can best develop technology.

One thing I was unsure of was whether traditional media would be doing the buying, I think ultimately traditional media will partner with search companies because it is not in the DNA to understand which search companies to acquire... which is a shame, because it's to old media that young search companies can most add value to...

 

David Berkowitz, of Inside The Marketers Studio stuck to his usual line where the Alts are concerned - he doesn't think most of them are good enough to get acquired or survive on their own:

Honestly, I don't think many of the alternative engines are good enough to be acquired, though a few will find that as an exit strategy, and a few others will survive as niche alternatives.

 

Microsoft's Don Dodge, who has a terrific blog The Next Big Thing, got so interested that he wrote a separate blog post to address this question. Here's a snippet from his post:

My thoughts? Some of these will be acquired by the big search engines or big content publishing networks. Most of them will fade away. I don't see any of them breaking out and creating a significant stand alone business with the possible exceptions of Powerset, Hakia, and Mahalo.

Read Don's post to find out where he thinks the big, untapped opportunities lie.

 

Bob Warfield, of the SmoothSpan blog, responded with an interesting blog post, proposing that:

Alt Search providers can get together and create an Open Sourced Collaborative Search-Oriented Social Network ...

In essence, he argues that the Alts could get together to share the costs and burdens of web crawling and the underlying infrastructure. This would help them to reduce the gap with the big players, which have a huge advantage in terms of resources. You can find his post here.

 

Conclusion

The overall consensus is that, unless something changes, most of the alternative search engines will not survive; some will get acquired and a few players may do well on their own, especially if they focus on a specific niche. A convergence in markets (e.g. local and mobile), or joint efforts on building the infrastructure, would give them a better chance.

Personally, I see the lack of web traffic as being the single biggest weakness of the alternative search engines; regardless of their cool technologies, innovative architectures or stunning visualizations, they cannot survive without getting the word out and capturing market share in search. There are simply too many web site destinations for the average user to remember. If the Alts could somehow cooperate to provide a single entry point that then branches off to different specializations, it would be a huge step forward!



  • Search This Blog


    Web This Blog