February 17, 2008

Social Data: Observations from "Search & The Social Graph" Event

Dave McClure moderated an event on Search & The Social Graph at the Yahoo! campus this week, organized by the Search SIG of the Software Development Forum. With the meteoric rise of Facebook and the heightened interest in leveraging the social graph - both Google and Yahoo! have launched new APIs and OpenSocial is gaining momentum - this discussion was timely and attendance was strong.

The panelists represented some of the most interesting players in this space:

  • Kevin Marks from Google
  • Aditya Agarwal from Facebook
  • Kent Brewster of Yahoo!
  • Eve Phillips, CEO of Chirp

It turned out to be an interesting event, with lots of good discussion about the implications of portability, privacy, utility and monetization of social data. No stranger to the social data space, moderator McClure did an outstanding job of keeping things focused and the discussion lively; he was clearly  knowledgeable and well-prepared, launching into a series of leading questions that moved the conversation forward.

Key Observations

By grouping together related comments, I've distilled the discussion at this event into the following topics:

1. Relevance of Search Results

- With the explosion of self-publishing and user-generated content on the web, the type of data getting created on the web is changing, and the classic search algorithms are becoming less effective.
- Users are increasingly interested in what their friends and peers are doing online.
- By using a social graph to filter out results during a specific search, you can boost the relevance of search results.

2. Monetization

- It is no longer uncommon for a person to become a media source, using tools such as twitter, blogs and RSS feeds; but this is hard to monetize. A referral model works better in this case than advertising.
- Brand advertising is still big, even for social search, but it works differently than for targeted search
- Online brand advertising will move into more interactive experiences in the future
- The key question is: Does membership in a social group signal an intention that can be targeted by advertisers? The panelists felt that, on balance, it did Not
- For a more concrete example: Google's directed search is very monetizable; Facebook has a lot of social data, but user behavior is not very monetizable

3. Privacy

- There is a clear difference between a publicly-proclaimed graph, such as the friends on Facebook, and a private list, such as Email contacts; application developers will ignore this distinction at their peril
- Yahoo!'s Brewster said it best: "There should never be a privacy surprise for the user!"
- Applications should make it clear to users if they are making data public or private; e.g. Flickr is three-valued in this regard

4. Interaction Levels

- From a monetization perspective, all "friends" are not created equal; some connections in the social graph are stronger than others
- The smallest inner set of friends is the most valuable; the first 25 people have 80% of the value
- The viral rate of promotion in Facebook is incredible
- If users can annotate connections, they can more fully express their network graph
- You can infer relationships from user behavior, such as sites visited and click-throughs
- The most important part of social data is the connections, followed by the profile; eventually, it gives you the ability to answer the question: "Who should you go to, to answer this question?"

5. OpenSocial

- OpenSocial allows application developers to write one application, and then take it to where the users are on diverse other social networks
- The vision: take some of the good parts of Facebook and bring those to a lot of people
- This allows any application to spread through the social graph

6. Social Email

- Email networks have a lot of connection data, which has social data buried in it
- These connections can either be one-way or two-way; the difference signals intent on the part of the user
- Google's Marks made an interesting point: a person's email address and personal URL are opposites - with the former, you can communicate with that that person; with the latter, the person communicates with you

Facebook

Facebook's Agarwal did a great job of articulating the company's approach to some of these issues. His contributions to the discussion were somewhat Facebook-centric; but given the strong community interest in Facebook lately, this only added to the value of the panel.

In discussing the value of social data for search, Agarwal compared the issues of selecting for relevance among a large number of results for a targeted search, with those of producing Facebook's news feed, which must also present a large amount of data to the user in a format that's easy to consume.

In terms of privacy, Facebook wants to allow users to annotate the social graph, so that they can fully express their network. This will allow users to separate their strong connections from casual friends. The size of a user's graph is another dimension to be considered.

For data portability, Facebook currently doesn't have any plans to implement enabling features focusing on it. Agarwal clarified that although philosophically they support data portability initiatives, they have not determined it to be the best use of resources at this time.

Finally, although Agarwal did not acknowledge this directly, the panelists agreed that the Facebook-type social network data and searches are far less monetizable than directly targeted activities that display clear intent, such as a Google search.

Chirp

This was the first time I saw a demo of Chirp . Eve Phillips, Chirp's CEO, gave a demo of chirpscreen, an interactive screen saver that displays content from your social network, such as pictures from Flickr and status messages from Facebook. On the whole, the audience loved it - a series of photos of her friends kept popping up on the screen - but there were some concerns about being able to control what gets shown. According to Phillips, Chirp is planning to introduce new features soon that will allow users to set preferences of what content is displayed, from which sources, and so on.

Open Questions

McClure asked some incisive questions to the panelists, which deserve to be listed in their own right; I hope these lead to a wider discussion about social data and related topics:

  • Is Social Search - revolutionary, or evolutionary?
  • Which benefits more from social data: targeted search or discovery?
  • How well does social search monetize?
  • How should we use the social data that's automatically present in Email?
  • If Facebook and other networks encourage lightweight friendships, does it obscure the real social graph?


January 07, 2008

Techmeme: Web 2.0 Discovery, with a Web 1.0 twist!

Jeremiah Owyang wrote an interesting post yesterday: The Five Members of the Techmeme Family - in which he lists the different types of bloggers that end up on Techmeme. I think he's right on the money; as an avid follower of the site, I've seen the same dynamics at play.

For technology watchers and bloggers, Techmeme is a gold mine, an invaluable resource that constantly highlights breaking news, unique perspectives and interesting blog posts. Through the site, I've discovered some amazing writers and their high-quality work: Scott Karp on Can Blogs Do Journalism? , Fred Wilson's incisive post - What My Kids Tell Me About The Future of Media , Jeremy Liew's ongoing series about the Semantic Web - Meaning = Data + Structure , Dale Dougherty's wonderful post on Journalism is Burning Or How Breaking News is Broken and so many others.

In his post, Owyang also looks at how posts are rated on Techmeme. What's interesting about it is that the person who breaks the story does not necessarily get the lead; a more mainstream news source or blogger often becomes the "top node", even if all he or she is doing is to repeat the story without any additional content or unique insight. This is a reasonable approach from an automated content discovery perspective, but it sometimes gives funny results.

As Owyang says:

...

The Breaker: This can be mainstream news source or a mainstream blogger that discovers the story from the Original News Source and blogs it, as a result, they often become the top node, even if they aren’t the original source. It seems as if some websites are naturally geared to be an “H1″ even if they are resonators.

The Resonator: Also referred to as those who echo or copy, they repeat what was already said, adding little or no additional content, news or opinion.

...



As an example, consider this Techmeme snapshot from 5:55 PM ET, December 31, 2007 - the image below shows a fragment of that page.



At that time, the big news of the moment was about an executive defection, er, employment change - Steve Souders, Chief Performance Yahoo, left his post at Yahoo! to join Google.

What is interesting to note is the ordering of the various stories on the Techmeme web site.

The lead story on this topic is the Silicon Alley Insider post by Henry Blodget - an A-list blogger. Now, Mr. Blodget is a fine writer and SAI is a great blog, but this particular story that leads is written mostly as a breaking-news flash, with minimal opinion and no particular startling insights. (Where is the story behind the story ?)

However, the story had already been broken by techno.blog on the previous day (according to the respective blog post time stamps), so it wasn't really breaking news by the time it appeared on Silicon Alley Insider. And others - for example, Donna Bogatin and Ashkan Karbasfrooshan - provide a lot more additional content and, arguably, much more insight. So how did the big-T pick Blodget's post as the lead?

My belief is that the Techmeme algorithms choose their lead based on the prominence of the source and on the links to a given post (which two factors are generally highly correlated, in any case).

This is fine and generally works well. Are there other options, other algorithms that can be used to choose the lead for a developing story, that could highlight the more meaty posts? A few possibilities come to mind:

  • Reader Votes: Within the set of posts for a developing story, allow readers to vote for the ones they like best, so that the most popular ones rise to the top.
  • Link Count: Examine the cross-linking between posts to leverage the implicit knowledge therein, similar to Google's PageRank algorithm. I believe Techmeme already incorporates this to some extent.
  • Bookmark Count: Examine the incidence of social bookmarks for different posts, for popular bookmarking services like del.icio.us .
  • Human Editors: Use human editors to select the top leads. Of course, this may prove too expensive and/or cumbersome.
  • Author Markup: Enable authors to include metadata in some standard format for their posts. By using markup or tags such as "news", "opinion", "analysis", "multi-idea" and so on, authors could indicate the type of their post to the selection engine. Admittedly, this approach is susceptible to gaming, although it could be combined with voting to improve quality.

Over time, the significance of "prominence" as a measure of content quality is eroding - especially for blog posts in particular. As the web evolves, Techmeme and other sites are sure to experiment with these and other alternative approaches; it will be interesting to see which ones emerge as the winners.



August 26, 2007

Survey Results: The Future of Web Search

Thank you to everyone who participated in the last Software Abstractions survey! We asked: which features do you see as the most important ones for Web Search in the future? The results were interesting.

Out of a total of 33 votes, the top votes were closely split between a variety of answers.

  • Personalization  [6 votes]
  • Social Input  [5 votes]
  • Semantic Query  [5 votes]
  • Semantic Index  [6 votes]
  • Trusted Sources  [6 votes]

For search engines with advanced linguistic parsing capabilities, it's reasonable to assume that semantic processing will be applied to both the query and to the indexed content as a whole. If you combine those two answers, then Semantic Processing is the clear winner with 33% of the votes!

The high number of votes for the "Trusted Sources" answer was a surprise - it's clear that a stronger focus on quality of the results in the future (and their being spam-free) weighs heavily on users.

The complete picture of results is given below:

 

 


January 31, 2007

A tip o' the hat to - The Repliqa blog

I've been a regular reader of Mark Seremet's Repliqa blog for a long time. It's a great blog - he regularly manages to come up with some fascinating posts, such as this post on Entrepreneurship outside Silicon Valley and this one on Donald Trump. Mark is a serial entrepreneur, and is working on a discovery engine called Repliqa (currently in stealth mode) and a startup called Wallhogs. In his latest post, he gives a nod to the Software Abstractions blog. Thanks, Mark - I appreciate the compliment! Keep those great posts coming ...

  • Search This Blog


    Web This Blog