March 20, 2008

Tim O'Reilly and Sir Tim Berners-Lee concur: Semantic Web Likely to be Top-Down

In a previous post, I asked the question: Where are the Meaning-Enabled Authoring Tools?, arguing that publishers who regularly post similar content (especially content that conforms to common formats) would get a big advantage from using Semantic Authoring tools for creating new content. By using semantic tools, not only can you get SEO benefits and improve findability , the content can more easily be re-purposed for other uses such as web applications and services.

This is essentially a bottom-up approach to the semantic web: adding semantic notation to the content itself. However, as the post went on to say, the prevailing view is definitely a top-down one, viz. that semantic meaning will have to be extracted by applications from perfectly ordinary web pages, and that the adding of semantic knowledge to the content itself is unlikely (aside from very limited contexts, such as Microformats).

Two recent podcasts with two of the leading voices in this space further confirm this view.

Continue reading "Tim O'Reilly and Sir Tim Berners-Lee concur: Semantic Web Likely to be Top-Down" »

February 24, 2008

Semantic Web: Where are the Meaning-Enabled Authoring Tools?

Jason Kolb sees it as a way to identify data objects using URIs. John Markoff, of the New York Times, calls it Web 3.0 . And Nova Spivack has a long post clarifying what it is Not.

What are all these authors talking about? The Semantic Web - much has been written recently about its concepts, approaches and applications. But there's something missing, a piece that hasn't generated much interest to date.

In terms of understanding, finding and displaying content, there is no doubt that the Semantic Web is slowly becoming real (e.g. there were some great demos at a recent SDForum meet ). However, a gap is emerging with Content Authoring tools, which have not yet made this paradigm shift.

On the one hand, most authors are comfortable with, and proficient in, desktop authoring tools, such as Microsoft Word, FrontPage, Adobe GoLive and others; this is especially true for professionals and other experts who create technical reference content for web applications, such as legal references, accounting manuals or engineering documents. The current crop of authoring tools produce visually high-quality articles and web pages, but their XML or RDF creation capabilities are severely limited.

On the other hand, parsing Word documents or HTML web pages to extract meaningful structure out of them, gives poor results; much of the semantic knowledge of the content is lost. There do not appear to be any popular tools that create Semantic content natively and yet are natural and easy for a content author to use.

Top-Down? Or Bottom-Up?

Of course, there are ways to get around this issue to some extent. Allowing authors or readers to add tags to articles or posts allows a measure of classification, but it does not capture the true semantic essence of the document. Automated Semantic Parsing (especially within a given domain) is on the way - a la Spock, twine and Powerset (see writeup ) - but it is currently limited in scope and needs a lot of computing power; in addition, if we could put the proper tools in the authors' hands in the first place, extracting the semantic meaning would be so much easier.

For example, imagine that you are building an online repository of content, using paid expert authors or community collaboration, to create a large number of similar records - say, a cookbook of recipes, a stack of electrical circuit designs, or something similar. Naturally, you would want to create domain-specific semantic knowledge of your stack at the same time, so that you can classify and search for content in a variety of ways, including by using intelligent queries.

Ideally, the authors would create the content as meaningful XML text or RDF triples, so that parsing the semantics would be much easier. A side benefit is that this content can then be easily published in a variety of ways and there would be SEO benefits as well, if search engines could understand it more easily. But tools that create information in this way, and yet are natural and easy for authors to use, don't appear to be on their way; and the creation of a custom tool for each individual domain, seems a difficult and expensive proposition.

Car Review Example

As a more concrete example: imagine that you control a web site called New-Car-Reviews.com, a hypothetical site that reviews new cars; you pay expert authors to write reviews of new car models every year for this site. Unlike other automobile characteristics, reviews cannot be easily stored into a database and queried. Conceptually, your reviews are similar to this review for the 2008 Volvo S40 2.4i sedan on the automotive site Kelley Blue Book.

In the current paradigm, a typical element of the review is usually written something like this:

    <span id="ctl00">You'll Like This Car If...</span>
        ...description_positive...
    <span id="ctl00">You May Not Like This Car If...</span>
       ...desc
ription_negative...

For the future, imagine this: when your authors are originally composing this review, what if they could instead create it with semantic markup embedded:
(In this example, I use straight XML for simplicity; the actual format of the content could be RDF-triples, or some other improved format)

    <advantages>You'll Like This Car If...
        <text>...desc
ription_positive...<text>
    </advantages>
    <disadvantages>
You May Not Like This Car If...
        <text>...description_negative...<text>
    </disadvantages>

then you can get more value out out of the same content:

  (a) You can easily *re-purpose* the content in additional ways, such as for mobile devices, RSS feeds, web services APIs, mashups and so on
  (b) As search engines start to take advantage of semantic notation, you get SEO benefits
(c) You can provide users with ways to query the content *intelligently* ("show me cars which are family-friendly AND don't roll over easily vs those that work better off-road AND seat 7"), using tools such as the recently-released SPARQL .

As a content publisher, you want your content to be found and used as much as possible, and making it meaning-enabled is a big step in this direction. At the same time, you cannot ask authors to use a pure XML tool such as XMLSpy or an ontology editor like protégé; and MS Word creates unreadable XML that specifies formatting rather than semantics.

A solution for this specific example already exists: Microformats could be applied to handle the problem of annotating the advantages and disadvantages. While the Microformat solution works very well for specific types of information - such as for describing people and addresses - it is too limited to be applicable in a general way to add semantic information to web content at large.

It seems to me that the general problem must be solved if we are to see large-scale adoption of the Semantic Web. It would be a boon to expert authors everywhere, including those who create news articles for the newspaper publishing industry. But there do not seem to be any solutions on the horizon, in terms of technologies, tools or processes to promote the creation of more meaning-rich content.

Reactions: But is there a Business Case?

When I put this question to a group of prominent bloggers and industry thought-leaders in the Semantic Web space, the results were not encouraging. There does not seem to be much interest in building Semantic authoring tools. The main stumbling block is the lack of a clear business model for publishers to embrace this approach.

Jeremy Liew of Lightspeed Venture Partners, has recently penned a series of articles focused on Semantic Web: Meaning = Data + Structure , based on user-generated structure, domain knowledge and user behavior , which focus on the problem of inferring meaning from content.

He questions the business rationale for authors to take the effort to add XML markup to their content, and points to domain-specific extraction approaches as the more likely solution:

The challenge with getting most authors to markup in XML is not just one of tools, but also of motivation IMO. Unless and until a clear business case advantage justifies the additional effort required, and that advantage is greater than other projects offer, you won't see much semantic markup except from academics and others whose interests are more philosophically driven than business driven.

That is why I think the domain specific extraction approaches will likely be more prevalent - the business advantage of better search and structure accrues to the person doing the extraction, and because it is domain specific, the additional effort is lessened

He's right, of course; domain-specific extraction approaches are definitely going to be popular, and are beginning to take off already. It provides significant added value for the extractor. However, it's difficult and expensive to do it well, so the business case is somewhat dubious for the early adopters.

ReadWriteWeb's Alex Iskold is another thought leader in this space. He has a series of fantastic articles about the Semantic Web, including the problem of annotating data, the different approaches used, and a primer for the structured web.

His comments echoed those of Liew:

There seems to be little incentive for publishers to annotate information.

The problem is that if you go deep enough you hit RDF. The light version is Microformats. But the issue is not the format, its the incentive.


Tim O'Reilly wrote about this issue almost a year ago: Different Approaches to the Semantic Web , in which he echoes the same sentiment:

It seems easy enough, but why hasn't this approach taken off? Because there's no immediate benefit to the user. He or she has to be committed to the goal of building hidden structure into the data. It's an extra task, undertaken for the benefit of others. And as I've written before, one of the secrets of success in Web 2.0 is to harness self-interest, not volunteerism, in a natural "architecture of participation."

Conclusion

I guess I'm a minority of one. It seems to me that if content creators could add semantic meaning while constructing the content in the first place (which is, conceptually, only marginally more difficult for the authors), then the value of the content would increase exponentially at very low cost. That seems like a defensible business case for content publishers.

The business case for publishers to annotate existing web pages and content is certainly very weak. But for new content, if you're creating it for your site anyway, why wouldn't you add semantic markup to make it more findable and usable?

What do you think? Please leave a comment below or email the author (removing the ".aa" at the end) and let us know!



January 15, 2008

Indirect Business Models for Blogs

Fred Wilson wrote an interesting post yesterday on the A VC blog: The Long Tail Of Business Models , in response to an earlier article about Media Business Models by Chris Anderson, who first popularized the Long Tail concept.

In his post, Wilson gives us a long list of monetization strategies for FREE content, such as blogs; some of which are very popular strategies and others not so much. A few of the less common ones are reproduced below:


  • Lead generation (you pay for qualified names of potential customers)
  • Subscription revenues
  • Rental of subscriber lists
  • Licensing of brand (people pay to use a media brand as implied endorsement)
  • Alternate output (pdf; print/print-on-demand; customized Shared Book style; etc.)
  • Live events
  • Cost Per Install (popular with top Facebook apps who can help others get installs)
  • Sponsorships (ads of some sort that are sold based on time, not on the number of impressions)
  • Listings (paying a time based amount to list something like a job or real estate on your website)
  • Streaming Audio Advertising (like radio advertising delivered in the audio stream after a certain amount of audio content has been delivered)
  • Streaming Video Advertising (like streaming audio but in video)
  • API Fees (charging third parties to access your API)

The full list is available in his post. Overall, this is extremely valuable for any publisher of free content.

To Wilson's list, I would add the following strategies for generating indirect revenue - i.e. more in line with Business Development. These strategies are not directly monetizable, but equally real all the same, and can be converted into actual income with a bit of effort.

Indirect Revenue Strategies for Blogs

  • Lead-In to Consulting Business; this is more specific than, but a subset of, generic referrals and lead generation
  • Book Writing Opportunities; your blog allows you to gain credibility, build an audience and interact directly with your readers
  • Lead-In to Education Business, such as Classes and Webinars
  • Gather Market Intelligence, using Polls, Surveys, Feedback et al
  • Networking (in the good sense of the word) - you can find others with similar thoughts and interests
  • Define your own Viral Meme; for example, here's one viral term: "Web 2.0"

In addition, of course, there are the intangibles, such as name recognition for authors, increased visibility for brands and fresh content - which equates to increased traffic and SEO benefits - for publishers.

If you know of any additional ideas for indirect monetization, please leave a comment below (or comment on either of the main articles referred to).



December 28, 2007

How long before the walls around content come crashing down?

Scott Karp of Publishing 2.0 has posted an interesting article today: What Is The ROI Of Requiring User Registration To Access Online Content? , in which he takes a close look at the registration wall used by the New York Times online and wonders whether it is worthwhile.

The theory goes that personal data collected from registered users enables sites to better target ads and charge premium rates. But I wonder whether the lost traffic from users who choose not to jump through the registration hoop — which I bet is particularly true of NYTimes’ large volume of visitors from search engines — outweighs the gain of higher ads rates (assuming NYTimes.com is consistently able to charge higher rates).

As Karp notes, the registration requirement presents a barrier to access for users who come in through a search engine, at a time when NYTimes.com is  focused on growing their readership beyond the current regular readers; and these casual users are just the type of users who are likely to have a lower tolerance for jumping through registration hoops, notwithstanding the NYTimes.com claim that registration takes "only a minute".

In one of the comments to the article, Howard Owens responds by questioning a critical assumption; Owens asserts that the registration requirement does not, in fact, cause traffic to drop.

I’ve run two registration sites, and have spoken with other newspaper.com site managers who have run their own registration-required sites, and two things I found to be true based on empirical evidence:

1) There is no drop off in traffic past the first 60 days of registration (after 60 days, traffic exceeds pre-registration numbers and continues to grow).
...


Personally I believe that the ROI of requiring user registration is questionable at best. Intuitively, it makes sense that at least some users will get discouraged and drop off when confronted with a "registration required" notice; so there's bound to be some negative impact, with all due respect to Howard Owens [perhaps the numbers he saw can be explained by other changes that happened at the same time, such as SEO enhancements that bumped up traffic, compensating for the impact of the registration?].

At the same time, there is another major trend currently under way that will increase the importance of this debate, and in my opinion, accelerate the crumbling of these registration and payment walls.

This big change is in user behavior. Individual consumers are increasingly flocking first to the major search engines when looking for information and data, rather than to individual web sites, even when they already know high-quality sites that can provide the information. It makes sense from the user's point of view: the user wants to find high-quality content in general, regardless of source, and using a favorite search engine is a quick, easy and comfortable way to do that. This overall trend is inevitable and irreversible. As Don Dodge noted in a recent article: Search engines are the Start page for the Internet.

Empirical evidence indicates that, even for major web sites with strong brands, the number of users coming in from search engines is increasing as a percentage of total traffic (although I do not have hard numbers to back up this claim). This forces content publishers to open up more information in order to satisfy those users, which further solidifies the position of search engines as the starting point - which in turn, forces publishers to open up yet more information - and so on, in a self-reinforcing virtuous cycle.

As publishers see this changing user mix - a higher percentage of traffic consisting of new users coming from search engines - engaging those users will increase in importance, and putting barriers in their path will be less acceptable. Instead, publishers will be forced to find new and innovative models for monetization; similarly, user tracking methods will need to be improved to collect data implicitly rather than requiring explicit action from the user.

As the user starts interacting with the site - if she wants to comment, post or otherwise participate, for example - then progressive upsells into registration and payment are perfectly valid and acceptable.  By that point, the site is dealing with thoroughly-engaged users, not casual visitors.

I see it as a question of time before 99% of content from major publishers (NYTimes.com included) becomes free and openly accessible on the Web.

To paraphrase Cory Doctorow (he of the free books !) and Tim O'Reilly, on content: the real danger isn't loss of revenue through sharing, it's obscurity and irrelevance.



  • Search This Blog


    Web This Blog