Nova Spivack, during his recent talk about the Semantic Web (covered in my previous post ), made the point that addition of semantic processing to the underlying index for a search engine, make the issue of trust more serious. This was an intriguing statement, and I followed up with him via email to get further clarification; he was kind enough to respond at length. The questions and answers from our exchange are given below.
1. For Semantic web to really take off, the information on the web at large
needs to become Semantic Web-compatible; i.e. web pages need to provide semantic information
in the form of RDF, OWL etc. Do you see this happening in the forseeable
future, given the huge mass of pages that already exist?
Or is it more likely that technology will have to solve this
problem for us, and we'll need to invent algorithms that can interpret currently
existing web pages to extract and apply semantic knowledge on top (such as ClearForest Gnosis
)?
The DBpedia.org is a good start. Also check out the emerging SPARQL and GRDDL standards at W3C -- they will bring existing data into the RDF world. There is also growing body of RDF already out there in the Dublin Core, FOAF and other ontologies on content on the Web. More will be coming from many big companies, Adobe, Oracle, Yahoo, etc. And of course other startups like Metaweb and Radar Networks will be adding a lot of content to the mix in different ways as well.
2. In your talk, you mentioned Trust - e.g. you referred to Powerset,
wondering how they could add some sense of trust to the results they find
through semantic processing (NLP) of web pages, because otherwise it would not
be useful.
I'm not sure I understand.
They are mining full text of the web and automatically building a knowledge base from that. So when they see a web page that says "Microsoft is a terrorist organization" or "Microsoft is a software manufacturer" how do they know which statement is true or false? Who do they trust? How do they determine who to trust? This determines what facts or assertions get what level of weight in their knowledge base. It's the crux of the issue really. You can mine in a lot of assertions, but if you have no good way to filter out the garbage, spam, erroneous statements, or deliberate deception, you can't use it for anything real. One solution is to only mine highly trusted sources -- such as encyclopedias and major newspapers for example. That's not a bad way to start. That would generate a decent knowledge base.
But the DBpedia.org might be a better way to start than mining free text. They've already done the heavy lifting of turning the wikipedia into RDF. I'm not sure you need natural language to get good, reasonably trustworthy knowledge, just use the DBpedia.
In the case of Powerset I believe their goals are different than the DBpedia/Wikipedia -- I think they don't just want almanac content, they want specialized vertical knowledge about travel, products, etc. That will require that they either are very selective of their data sources, or they have a sensitive way to measure trust and rank information accordingly.
a] How does the addition of semantic processing to the underlying index make the issue of trust more serious? Google's PageRank is basically an approximation of the Wisdom of Crowds, using static links to represent votes; if Powerset uses some similar mechanism to rank the information sources in the underlying index, then their results should be no worse than Google's in terms of trust, and better than Google's in terms of relevance. [Incidentally, I've written about Powerset before when I attended their preview event.]
They could perhaps use a pagerank algorithm to attribute more trust to assertions they mine from various sites. That would be one solution. Unfortunately it gets more complex though -- because if their knowledge base has many grades of truth (statements ranked to varying degrees of trust) for a given assertion, then they will have to use modal logic or some other form of fuzzy reasoning to actually do any real reasoning or inferencing. That stuff is hard and uncharted territory to say the least. I don't think they will go there and if they did I think they would not be successful at it. So the question is what can they do without going there?
b] How will the Semantic Web improve the trust situation for web search results?
First there is the issue of being able to assign a trust rank to each triple. Trust is relative so in fact there may not be a single global measure of trust that applies equally to everyone. I may trust someone that you don't trust and so I may take what they say to have more weight than you would for example. There isn't room within every triple to store that, but triples could be ratified by other triples that express "endorsement" of their content. So if I agree with something I can simply express that and now it is recorded that I (in an authenticated manner) have said I trust it. If lots of people do that with various assertions (triples), records (objects), and sites and people (sources), then we have a network of trust built in RDF. A network of trust can be reasoned on to determine weighted, socially relative trust rankings for triples. It can determine what triples I am likely to trust versus what you are likely to trust versus what everyone is likely to trust.
Second there is the issue of being able to trust reasoning performed by the system. For that the system needs to be able to explain to a human how it reached some logical conclusion and what data it used to do so. Work is being done both on computer-generated explanations and ways to record and show provenance to address these issues.
Many thanks to Nova for his detailed answers. You can find his blog here: Minding the Planet .