Jason Kolb sees it as a way to identify data objects using URIs. John Markoff, of the New York Times, calls it Web 3.0 . And Nova Spivack has a long post clarifying what it is Not.
What are all these authors talking about? The Semantic Web
- much has been written recently about its concepts, approaches and
applications. But there's something missing, a piece that hasn't
generated much interest to date.
In terms of understanding, finding and displaying content, there is no doubt that the Semantic Web is slowly becoming
real (e.g. there were some great demos at a recent SDForum meet
). However, a gap is emerging with Content Authoring tools, which have not yet made this paradigm shift.
On
the one hand, most authors are
comfortable with, and proficient in, desktop authoring tools, such
as Microsoft Word, FrontPage, Adobe GoLive and others; this is
especially true for professionals and other experts who create technical reference
content for web applications, such
as legal references, accounting manuals or engineering documents. The
current crop of
authoring tools produce visually
high-quality articles and web pages, but their XML or RDF creation
capabilities are severely
limited.
On
the other hand, parsing Word documents or HTML web pages to extract
meaningful structure out of them, gives poor results; much of
the semantic knowledge of the content is lost. There do not
appear to be any popular tools that create Semantic content natively
and yet are natural and
easy for a content author to use.
Top-Down? Or Bottom-Up?
Of
course, there are ways to get around this issue to some extent. Allowing authors or readers
to add tags to articles or posts allows a measure of classification,
but it does not capture the true semantic essence of the document.
Automated Semantic Parsing (especially within a given domain) is on the way -
a la Spock, twine and Powerset (see writeup ) - but it is currently limited in scope
and needs a lot of computing power; in addition, if we could put the proper tools in the
authors' hands in the first place, extracting the semantic meaning would be so much easier.
For
example, imagine that you are building an online repository of content,
using paid expert authors or community collaboration, to create a large number of
similar records - say, a cookbook of recipes, a stack of
electrical circuit designs, or something similar. Naturally, you would want to create
domain-specific semantic knowledge of your stack at the same time, so that you can
classify and search for content in a variety of ways, including by using intelligent queries.
Ideally, the authors would create the content as meaningful XML
text or RDF triples, so that parsing the semantics would be much easier. A side
benefit is that this content can then be easily published in a variety of
ways and there
would be SEO benefits as well, if search engines could understand it
more
easily. But tools that create information in this way, and yet are natural and easy
for authors to use, don't appear to be on their way; and the creation of a
custom tool for each individual domain, seems a difficult and expensive
proposition.
Car Review Example
As a more concrete example: imagine that you control a web site called New-Car-Reviews.com,
a hypothetical site that reviews new cars; you pay expert authors to
write reviews of new car models every year for this site. Unlike other
automobile characteristics, reviews cannot be easily
stored into a database and queried. Conceptually, your reviews are
similar to this review for the
2008 Volvo S40 2.4i sedan on the automotive site Kelley Blue Book.
In the current paradigm, a typical element of the review is usually written something like this:
<span id="ctl00">You'll Like This Car If...</span>
...description_positive...
<span id="ctl00">You May Not Like This Car If...</span>
...description_negative...
For
the future, imagine this: when your authors are originally composing
this review, what if they could instead create it with semantic markup
embedded:
(In this example, I use straight XML for simplicity; the actual format
of the content could be RDF-triples, or some other improved format)
<advantages>You'll Like This Car If...
<text>...description_positive...<text>
</advantages>
<disadvantages>You May Not Like This Car If...
<text>...description_negative...<text>
</disadvantages>
then you can get more value out out of the same content:
(a) You can easily *re-purpose* the content in additional ways, such
as for mobile devices, RSS feeds, web services APIs, mashups and so on
(b) As search engines start to take advantage of semantic notation, you get SEO benefits
(c) You can provide users with ways to query the content
*intelligently* ("show me cars which are family-friendly AND don't roll
over easily vs those that work better off-road AND seat 7"), using tools such as the recently-released SPARQL .
As a content publisher, you want your content to be found and
used as much as possible, and making it meaning-enabled is a big step
in this direction. At the same time, you cannot ask authors to use a pure XML tool such as XMLSpy or an ontology editor like protégé; and MS Word creates unreadable XML that specifies formatting rather than semantics.
A
solution for this specific example already exists: Microformats could
be applied to handle the problem of annotating the advantages and
disadvantages. While the Microformat solution works very well for
specific types of information - such as for describing people and
addresses - it is too limited to be applicable in a general way to add
semantic information to web content at large.
It seems to me that the general problem must be solved if we are to see
large-scale adoption of the Semantic Web. It would be a boon to expert
authors everywhere, including those who create news articles for the
newspaper publishing industry. But there do not seem to be any solutions on
the horizon, in terms of technologies, tools or processes to promote
the
creation of more meaning-rich content.
Reactions: But is there a Business Case?
When
I put this question to a group of prominent bloggers and industry
thought-leaders in the Semantic Web space, the results were not
encouraging. There does not seem to be much interest in building
Semantic authoring tools. The main stumbling block is the lack of a
clear business model for publishers to embrace this approach.
Jeremy Liew of Lightspeed Venture Partners, has recently penned a series of articles focused on Semantic Web: Meaning = Data + Structure , based on user-generated structure, domain knowledge and user behavior , which focus on the problem of inferring meaning from content.
He
questions the business rationale for authors to take the effort to add
XML markup to their content, and points to domain-specific extraction
approaches as the more likely solution:
The challenge
with getting most authors to markup in XML is not just one of tools,
but also of motivation IMO. Unless and until a clear business case
advantage justifies the additional effort required, and that advantage
is greater than other projects offer, you won't see much semantic
markup except from academics and others whose interests are more
philosophically driven than business driven.
That is why I
think the domain specific extraction approaches will likely be more
prevalent - the business advantage of better search and structure
accrues to the person doing the extraction, and because it is domain
specific, the additional effort is lessened
He's right, of course; domain-specific extraction approaches are definitely
going to be popular, and are beginning to take off already. It
provides significant added value for the extractor. However, it's
difficult and expensive to do it well, so the business case is somewhat
dubious for the early adopters.
ReadWriteWeb's Alex Iskold is another thought leader in this space. He has a series of fantastic articles about the Semantic Web, including the problem of annotating data, the different approaches used, and a primer for the structured web.
His comments echoed those of Liew:
There seems to be little incentive for publishers to annotate information.
The
problem is that if you go deep enough you hit RDF. The light version is
Microformats. But the issue is not the format, its the incentive.
Tim O'Reilly wrote about this issue almost a year ago: Different Approaches to the Semantic Web , in which he echoes the same sentiment:
It
seems easy enough, but why hasn't this approach taken off? Because
there's no immediate benefit to the user. He or she has to be committed
to the goal of building hidden structure into the data. It's an extra
task, undertaken for the benefit of others. And as I've written before,
one of the secrets of success in Web 2.0 is to harness self-interest,
not volunteerism, in a natural "architecture of participation."
Conclusion
I guess I'm a minority of one. It seems to me that if content creators could add semantic meaning while
constructing the content in the first place (which is, conceptually,
only marginally more difficult for the authors), then the value of the
content would increase exponentially at very low cost. That seems like a defensible
business case for content publishers.
The business case for publishers to annotate existing web pages and content is certainly very weak. But for new content, if you're creating it for your site anyway, why wouldn't you add semantic markup to make it more findable and usable?
What do you think? Please leave a comment below or email the author (removing the ".aa" at the end) and let us know!