May 15, 2008

Yahoo! SearchMonkey - Released to Developers

The good folks from Yahoo! unveiled their new open search platform Yahoo! SearchMonkey, at a developer launch party today at their Sunnyvale headquarters. In some ways, the SearchMonkey platform is revolutionary and a major step forward in search, allowing publishers to participate directly in improving the quality of their own information presented on the Yahoo! search results page (this is also implicitly a push for the bottom-up approach to the Semantic Web, which most industry observers have given up on in favor of a top-down approach). The platform also lets publishers and third-party developers build applications aimed at improving the search experience. Finally, and most important, if enough publishers and app developers participate in the program, it promises to improve the quality of search results for end users.

Features

At the simplest level, you can think of SearchMonkey as a community-powered set of rich information boxes (similar to the Google OneBox) that appear on the Yahoo! search results page. Publishers can provide this rich data to the Yahoo! search index in a variety of ways: through structured data feeds (RSS), through RDF or Microformat markup on web pages, or through simple page extraction. The "Information Bar" shows up underneath the main search results. The Yahoo! search team has also provided tools to enable developers to build search-based applications very simply and easily.

Continue reading "Yahoo! SearchMonkey - Released to Developers" »

May 11, 2008

Powerset Launches Wikipedia Search

Semantic search engine Powerset, which we've written about here before, has just launched its initial release. The current release is limited to indexing Wikipedia content, but it provides a great showcase for their technology and user experience.

For example, my search for "Alexander the Great" provided the following results page:

Continue reading "Powerset Launches Wikipedia Search" »

May 07, 2008

Cognition Technologies recognized by KMWorld as one of "100 that matter"

Cognition Technologies, which focuses on Semantic natural language processing technology, was named by KMWorld as one of the top 100 Companies That Matter in Knowledge Management for 2008.

Says Cognition CEO Scott Jarus:

One of the biggest barriers to building a natural language understanding system is to build the semantic map and the dictionary with details of the syntactic behavior of words (i.e. how words behave within context).  Cognition's team has spent more than 20 years building this capability into Cognition’s Semantic NLP for the English language ...  and our technology is commercially available today!

Semantic search and NLP technologies seem to have arrived - they are generating a lot of buzz lately. In addition to mainstays Hakia and Powerset, there is a spate of new entries, including Cognition, BooRah and eeggi. We will be reviewing some of these new alternate search engines on this blog in the near future.

Congratulations, Scott and the Cognition team!



March 20, 2008

Tim O'Reilly and Sir Tim Berners-Lee concur: Semantic Web Likely to be Top-Down

In a previous post, I asked the question: Where are the Meaning-Enabled Authoring Tools?, arguing that publishers who regularly post similar content (especially content that conforms to common formats) would get a big advantage from using Semantic Authoring tools for creating new content. By using semantic tools, not only can you get SEO benefits and improve findability , the content can more easily be re-purposed for other uses such as web applications and services.

This is essentially a bottom-up approach to the semantic web: adding semantic notation to the content itself. However, as the post went on to say, the prevailing view is definitely a top-down one, viz. that semantic meaning will have to be extracted by applications from perfectly ordinary web pages, and that the adding of semantic knowledge to the content itself is unlikely (aside from very limited contexts, such as Microformats).

Two recent podcasts with two of the leading voices in this space further confirm this view.

Continue reading "Tim O'Reilly and Sir Tim Berners-Lee concur: Semantic Web Likely to be Top-Down" »

February 28, 2008

Semantic Web - What is the Core Problem?

In his latest blog post, Mathew Ingram writes about Paul Miller's interview with Sir Tim Berners-Lee, inventor of the World Wide Web. Miller's interview writeup is very interesting - as Marshall Kirkpatrick notes on the ReadWriteWeb, Sir Tim feels that all the pieces for the Semantic Web are already in place to realize a large part of the dream and to allow us to create applications that leverage the power of structured data and the integration of that data.

[One big problem for the Semantic Web that I've written about recently is the lack of meaning-enabled authoring tools; however, in the interview, Sir Tim indicates that this need is less critical; the structured data we need can come from databases.]

Coming back to Ingram's post, he says that the biggest problem with the Semantic Web is that "it’s as boring as dry toast" - i.e. it's all about the technical side, with discussions about plumbing and widgets and standards, and there's nothing there that will make people sit up and take notice.

Continue reading "Semantic Web - What is the Core Problem?" »

February 27, 2008

Web Poll Results: Killer App for Semantic Web technology

Our recent web poll asked readers to vote on what they saw as the most likely "killer app" for Semantic Web technology, from a variety of choices. A total of 51 readers voted - thank you! The results of the poll are shown below.


    

What is the most likely *Killer App* for Semantic Web technology?



Within this results set, users felt that Web Search was the most promising application of Semantic Web. This seems reasonable - being able to extract the meaning of web pages should help search engines match results better with specific queries. On the other hand, this blog focuses extensively on web search, so the demographics of the audience may be skewed, thus affecting the results.

More interesting is that readers felt that Enterprise Applications are more likely to benefit from semantic analysis, over vertical applications. This is curious because semantic analysis is easier and more effective within a given vertical domain, where some level of background knowledge can be assumed by the parsing algorithms.

Also, Social Applications got only one vote - but as Facebook's Aditya Agarwal noted in a recent SDForum meet, the problem of finding relevant content of interest for a given user from within their social group is actually very similar to the basic problem of web search. So those two choices are not as different as one might think.

Finally, several users chose the "other" option, and listed their own interesting choices:

  • gnodal    [??]
  • contextual search
  • Actually, semantics is not about a particular application but a way of expressing
  • All: Integration of search with social networking and other apps
  • portal and service interoperability


February 24, 2008

Semantic Web: Where are the Meaning-Enabled Authoring Tools?

Jason Kolb sees it as a way to identify data objects using URIs. John Markoff, of the New York Times, calls it Web 3.0 . And Nova Spivack has a long post clarifying what it is Not.

What are all these authors talking about? The Semantic Web - much has been written recently about its concepts, approaches and applications. But there's something missing, a piece that hasn't generated much interest to date.

In terms of understanding, finding and displaying content, there is no doubt that the Semantic Web is slowly becoming real (e.g. there were some great demos at a recent SDForum meet ). However, a gap is emerging with Content Authoring tools, which have not yet made this paradigm shift.

On the one hand, most authors are comfortable with, and proficient in, desktop authoring tools, such as Microsoft Word, FrontPage, Adobe GoLive and others; this is especially true for professionals and other experts who create technical reference content for web applications, such as legal references, accounting manuals or engineering documents. The current crop of authoring tools produce visually high-quality articles and web pages, but their XML or RDF creation capabilities are severely limited.

On the other hand, parsing Word documents or HTML web pages to extract meaningful structure out of them, gives poor results; much of the semantic knowledge of the content is lost. There do not appear to be any popular tools that create Semantic content natively and yet are natural and easy for a content author to use.

Top-Down? Or Bottom-Up?

Of course, there are ways to get around this issue to some extent. Allowing authors or readers to add tags to articles or posts allows a measure of classification, but it does not capture the true semantic essence of the document. Automated Semantic Parsing (especially within a given domain) is on the way - a la Spock, twine and Powerset (see writeup ) - but it is currently limited in scope and needs a lot of computing power; in addition, if we could put the proper tools in the authors' hands in the first place, extracting the semantic meaning would be so much easier.

For example, imagine that you are building an online repository of content, using paid expert authors or community collaboration, to create a large number of similar records - say, a cookbook of recipes, a stack of electrical circuit designs, or something similar. Naturally, you would want to create domain-specific semantic knowledge of your stack at the same time, so that you can classify and search for content in a variety of ways, including by using intelligent queries.

Ideally, the authors would create the content as meaningful XML text or RDF triples, so that parsing the semantics would be much easier. A side benefit is that this content can then be easily published in a variety of ways and there would be SEO benefits as well, if search engines could understand it more easily. But tools that create information in this way, and yet are natural and easy for authors to use, don't appear to be on their way; and the creation of a custom tool for each individual domain, seems a difficult and expensive proposition.

Car Review Example

As a more concrete example: imagine that you control a web site called New-Car-Reviews.com, a hypothetical site that reviews new cars; you pay expert authors to write reviews of new car models every year for this site. Unlike other automobile characteristics, reviews cannot be easily stored into a database and queried. Conceptually, your reviews are similar to this review for the 2008 Volvo S40 2.4i sedan on the automotive site Kelley Blue Book.

In the current paradigm, a typical element of the review is usually written something like this:

    <span id="ctl00">You'll Like This Car If...</span>
        ...description_positive...
    <span id="ctl00">You May Not Like This Car If...</span>
       ...desc
ription_negative...

For the future, imagine this: when your authors are originally composing this review, what if they could instead create it with semantic markup embedded:
(In this example, I use straight XML for simplicity; the actual format of the content could be RDF-triples, or some other improved format)

    <advantages>You'll Like This Car If...
        <text>...desc
ription_positive...<text>
    </advantages>
    <disadvantages>
You May Not Like This Car If...
        <text>...description_negative...<text>
    </disadvantages>

then you can get more value out out of the same content:

  (a) You can easily *re-purpose* the content in additional ways, such as for mobile devices, RSS feeds, web services APIs, mashups and so on
  (b) As search engines start to take advantage of semantic notation, you get SEO benefits
(c) You can provide users with ways to query the content *intelligently* ("show me cars which are family-friendly AND don't roll over easily vs those that work better off-road AND seat 7"), using tools such as the recently-released SPARQL .

As a content publisher, you want your content to be found and used as much as possible, and making it meaning-enabled is a big step in this direction. At the same time, you cannot ask authors to use a pure XML tool such as XMLSpy or an ontology editor like protégé; and MS Word creates unreadable XML that specifies formatting rather than semantics.

A solution for this specific example already exists: Microformats could be applied to handle the problem of annotating the advantages and disadvantages. While the Microformat solution works very well for specific types of information - such as for describing people and addresses - it is too limited to be applicable in a general way to add semantic information to web content at large.

It seems to me that the general problem must be solved if we are to see large-scale adoption of the Semantic Web. It would be a boon to expert authors everywhere, including those who create news articles for the newspaper publishing industry. But there do not seem to be any solutions on the horizon, in terms of technologies, tools or processes to promote the creation of more meaning-rich content.

Reactions: But is there a Business Case?

When I put this question to a group of prominent bloggers and industry thought-leaders in the Semantic Web space, the results were not encouraging. There does not seem to be much interest in building Semantic authoring tools. The main stumbling block is the lack of a clear business model for publishers to embrace this approach.

Jeremy Liew of Lightspeed Venture Partners, has recently penned a series of articles focused on Semantic Web: Meaning = Data + Structure , based on user-generated structure, domain knowledge and user behavior , which focus on the problem of inferring meaning from content.

He questions the business rationale for authors to take the effort to add XML markup to their content, and points to domain-specific extraction approaches as the more likely solution:

The challenge with getting most authors to markup in XML is not just one of tools, but also of motivation IMO. Unless and until a clear business case advantage justifies the additional effort required, and that advantage is greater than other projects offer, you won't see much semantic markup except from academics and others whose interests are more philosophically driven than business driven.

That is why I think the domain specific extraction approaches will likely be more prevalent - the business advantage of better search and structure accrues to the person doing the extraction, and because it is domain specific, the additional effort is lessened

He's right, of course; domain-specific extraction approaches are definitely going to be popular, and are beginning to take off already. It provides significant added value for the extractor. However, it's difficult and expensive to do it well, so the business case is somewhat dubious for the early adopters.

ReadWriteWeb's Alex Iskold is another thought leader in this space. He has a series of fantastic articles about the Semantic Web, including the problem of annotating data, the different approaches used, and a primer for the structured web.

His comments echoed those of Liew:

There seems to be little incentive for publishers to annotate information.

The problem is that if you go deep enough you hit RDF. The light version is Microformats. But the issue is not the format, its the incentive.


Tim O'Reilly wrote about this issue almost a year ago: Different Approaches to the Semantic Web , in which he echoes the same sentiment:

It seems easy enough, but why hasn't this approach taken off? Because there's no immediate benefit to the user. He or she has to be committed to the goal of building hidden structure into the data. It's an extra task, undertaken for the benefit of others. And as I've written before, one of the secrets of success in Web 2.0 is to harness self-interest, not volunteerism, in a natural "architecture of participation."

Conclusion

I guess I'm a minority of one. It seems to me that if content creators could add semantic meaning while constructing the content in the first place (which is, conceptually, only marginally more difficult for the authors), then the value of the content would increase exponentially at very low cost. That seems like a defensible business case for content publishers.

The business case for publishers to annotate existing web pages and content is certainly very weak. But for new content, if you're creating it for your site anyway, why wouldn't you add semantic markup to make it more findable and usable?

What do you think? Please leave a comment below or email the author (removing the ".aa" at the end) and let us know!



January 21, 2008

SPARQL: Query Language for the Semantic Web

The W3C has announced the publication of SPARQL , a language for querying distributed data on the web. Similar to the way SQL is a generic language used to query relational databases regardless of vendor, SPARQL will allow users and applications to create queries that express high-level goals across many different data sources, regardless of the database technology or data format involved.

From the W3C press release:

"Trying to use the Semantic Web without SPARQL is like trying to use a relational database without SQL," explained Tim Berners-Lee, W3C Director. "SPARQL makes it possible to query information from databases and other diverse sources in the wild, across the Web."

The combination of the SPARQL query language and protocol creates a Web service in its purest sense; running on top of HTTP or SOAP, it provides a standard Web service for anything which asks a question.

"SPARQL's focus on querying the data models saves time for developers; there's no need for a host of little Web services to retrieve different aspects of the state of a system," explained Lee Feigenbaum, Chair of the RDF Data Access Working Group. "This allows the user of the SPARQL endpoint to ask any question -- it is as though they could design their own interface instead of having to work with a limited set of fixed services."

The press release goes on to say that the SPARQL specification defines both a query language and a protocol, and works well with other Semantic Web technologies from the W3C: RDF, RDF Schema, OWL and GRDDL.

InfoWorld has a great article explaining this development in more detail [via Dave Cobley at Altiss ]:

Already available in 14 known implementations, SPARQL is designed to be used at the scale of the Web to allow queries over distributed data sources independent of format. It also can be used for mashing up Web 2.0 data.


I see this as a very positive development for the Semantic Web field in general. At its core, the operation of the Semantic Web is composed of the following basic functions:

  • Creating content with meaning (either implicit, like XML, or explicit, like Tags)
  • Understanding or extracting the information from a block of content
  • Classifying the blocks of content (into a hierarchy, taxonomy or folksonomy)
  • Presenting the information in a variety of forms (web, mobile, web services API, mashups, embedded devices and so on)
  • Finding the information of interest; this information may have to be derived from the content provided

The rise of easy-to-use self-publishing tools has led to an explosion in the amount of content available on the Web, and being able to find the answer to a question from this mountain of information is vital.

But first users have to be able to express what they are looking for, in a meaningful way. It is this need that is being addressed by SPARQL, which allows users to formulate intelligent queries. These queries can then be used by agents and applications on our behalf to find us the information we need.



December 08, 2007

Web Poll: What is the *Killer App* for Semantic Web technology?

I recently wrote an article describing working product demos for Semantic Web applications:  The Semantic Web is becoming real - slowly   . For our latest Web Poll, I would like to ask readers their opinion on this topic.


Which of the following do you see as the most likely Killer App for Semantic Web technology - the one that truly puts it on the map?

  • Internet Search (a la Powerset )
  • Enterprise applications - Supply Chain, Sales Force Automation, et al
  • Social Networking (a la twine )
  • Verticals - Travel, Finance, and the like
  • None; this technology is not real anyway!
  • Other: something else?
     

The web poll appears in the left side bar. Please vote and let us know what you think!

--------

Many thanks to everyone who voted in our previous poll about the most important component of a Search Engine. Check back next week to see the final scores.



November 29, 2007

The Semantic Web is becoming real - slowly

A couple of weeks ago, I attended an event from the SDForum in Palo Alto, featuring a series of project demos showcasing real applications built on the Semantic Web. While I was initially skeptical, I came away amazed at the social and semantic intelligence being built into the latest web applications.


Yahoo!

The most interesting demos came from Dr. Mor Naaman of Yahoo! - these projects were at once the most real and the least relevant to Semantic Web (at least, in its pure form).

TagMaps


Described as "a toolkit to visualize text (tags) geographically on a map", TagMaps allows the creation of applications that mashup text and geographical information (such as Flickr images) with Yahoo! Maps; Yahoo!'s sample application World Explorer is quite amazing. The most interesting thing about this application is that by combining the geo-tagging information about Flickr images with their corresponding tags and then displaying those tags on a map, the application accurately displays items of interest on the map - this is semantic information that has been extracted from the underlying raw data.

ZoneTag



Zonetag can automatically tag your photos with geographical information; in addition, it can suggest tags for the photo based on the location . This makes it easy to tag photos taken on a cell phone with both types of information.

FireEagle


FireEagle, currently in closed alpha testing, is billed as "a new way to share your location with friends or with other websites and services". The main idea is to create a new user location platform that any third-party can leverage to read and write the location of the user.


Radar Networks

Any set of Semantic Web demos would be incomplete without an entry from Radar Networks. Nova Spivack, CEO of Radar, presented a demo of their offering, twine [tagline: "using information as context"], which is basically a new social network to which Semantic Web concepts have been applied. twine, currently in closed beta, has been getting a lot of press recently as the first true Semantic Web application.

I have to admit, the demo was quite impressive. Mr. Spivack created a new "twine", assigning a series of web pages, articles and other web information to the twine, and the application extracted a whole range of meaning from the content - automatically assigning tags about topics, people, links, locations, even concepts. It was a cool thing to watch!

While this exercise clearly demonstrated that the underlying technology works, and works well - clearly, great things lie ahead for the Semantic Web - I was less than impressed by the actual application chosen by Radar Networks (maybe I just don't see it yet). Does the world really need another custom home page or social networking application, even one that harnesses the Semantic Web?


SRI

Adam Cheyer from SRI presented a demo of an experimental project named CALO. CALO, which stands for Cognitive Assistant Learning and Organizing, is a DARPA-funded project that gathers the user's context and supports dynamic decision-making. In effect, the "software assistant" watches everything you do to learn, so that it can eventually make intelligent suggestions, for example, act as a search assistant or suggest alternate knowledge users for a meeting. A parallel project, CALO Express, is a productized Windows version for commercial use.

An intelligent software assistant is a noble goal, but watching the slides, I wondered if it would get traction commercially - the idea of this virtual assistant watching everything I do was slightly creepy; it's probably a better fit for a more controlled world, such as a defense lab or that perennial Hollywood favorite, a "top-secret government project".


PARC

The folks from the legendary Xerox PARC demonstrated Magitti, a "mobile leisure guide". By implicitly collecting information about the user's behavior within their mobile device, the application learns about your interests within a given context; this is then used to guide the user by suggesting other activities by location, time of day and social peer behavior. Again, a good idea, perfect for today's Facebook-fed generation.


Semantic Web or Privacy: Pick one!

The demos were all very cool and worked flawlessly - it is amazing how much meaning can be gleaned by an application by combining data about geography, time, context and peer groups. At the same time, it requires participants to willingly share information in order to avail of the benefits of semantic processing. Is it a good trade-off, one that users are willing to accept? That remains to be seen. As the early commercial applications of Semantic Web become widespread and more easily available, the answer is likely to become increasingly obvious.



  • Search This Blog


    Web This Blog