Long Tail post mortem

I turned off the updater services for the Long Tail of the Blogosphere site. I said at the start that it was an experiment, and now I think I can consider it done.

The goal of the site was simple enough: To try to find interesting blog posts among the thousands (or tens of thousands) of new posts every day, without having someone link to them.

The way the web works today, people don't generally find interesting content on their own. Rather, they have sites that they read, that find the interesting content for them. Slashdot, BoingBoing, Fark, even CNN.

This works, but it means that as a new blogger, the only way to get noticed is to go out and publicize your new site. Asking Scoble to link to you works for some but for non-tech bloggers, there are far fewer central sites on which to get your writings noticed.

We're leaving the decisions about what we read online to a few people.

I would rather see the decisions about what I read made by software. Software is unbiased, and with the help of aggregated ratings by a large number of users, can be much more accurate than humans.

For example, using a bayesian algorithm, it can be determined that since you enjoyed reading a particular set of articles, a particular new article may be interesting to you.

Or, and this is the method that I was hoping to use with the Long Tail, it can be determined that since a particular item, when shown to a certain number of people, was clicked on a certain number of times, that this item has a certain level of interest.

The more people who click on an item, the more people the item will be shown to. It's an automatic rating and classification system.

So why didn't it work? I believe there were two reasons.

First, there weren't enough users. This sort of rating system depends on the users themselves filtering the content - with an average of 20 users a day, it just didn't go anywhere. Every post was seen once or (rarely) twice, so no data could be gathered about the relative interest of a new blog post. Without a higher number of users, the quality of the articles chosen to display was low since no filtering could be done, which made it likely a user would leave and not come back, not having found anything interesting.

Second, and this was a major factor in my opinion, is that there was simply too much spam.

Of 100 random "blog posts", maybe 10 of them are actually the creative writings of an individual. The rest were automated "welcome to your new blog" items, "first post" items, "sorry I haven't written in so long but I will soon" items, and pure spam. The spam was often formulated to look like real content, the way spam mail is - but with maybe a dozen links to online casinos or adult websites. The signal to noise ratio made reading no fun.

The automatic filtering of content by users would have cleaned out the spam with enough users - and some bayesian spam filtering would have removed most of it before it made it to the users in the first place.

But in the end, I don't have the means to draw enough users to make this work. Rather than maintaining the infratstructure to pull all the blogs and posts and keep the database updated for the very few users who were still discovering the site every day, I've turned it off. While the service was up, it harvested nearly half a million blogs, and over 3 million posts.

If anyone is interested in taking the idea and running with it, go for it. I can share the software I've written (C# / SQL Server, based on my Syndicache stuff).