I have an avid interest in search engines and search algorithms. If you've been following my blog for any time at all, you know this.
It's commonly known that the Big Three search engines (Google, Bing and Yahoo) base their ranking of results primarily on one thing: links. Yes, the documents need to be relevant to the search query, but after meeting that criteria it's the quantity and quality of links to a document that lands it in the top spot for any given set of keywords.
While links are a great way to provide a baseline of trust or authority, links are not the only way a document should be judged.
Here's an example of how using links as the primary ranking factor fails. Go to Google (or Bing) and search for:
You'll see the #1 result is RadioLovers.com. It's a very popular site because it offers a lot of downloads of old time radio shows in MP3 format.
That's great if that's what you were after, but what if you were actually wanting to read about Old Time Radio? In that case RadioLovers.com is a terrible #1 result -- it's mostly just a list of links.
The #2 result in Google isn't any better (Otr.net) -- it's all links.
In fact, if you go through each of the top 10 results in Google for "old time radio" you'll see that, with the exception of the Wikipedia entry that comes in at #9, there's very little information about Old Time Radio in any of the results!
Thus the failure of using links as the primary method for ranking web sites becomes clearer.
In some cases using links is perfect, such as when looking for the official web site of a particular company. Search for "Microsoft" at Google and you'll get Microsoft.com -- even though there's very little information on the home page. But if you searched for "Microsoft" in the hope of getting some information about the company, that doesn't start until Google's fourth result (the Wikipedia entry for Microsoft).
That got me thinking: while links are a great way to establish a measure of trust or authority, wouldn't it be better to have an alternative ranking method that puts sites that offer the most information first?
What I'd like to do is:
1. Pull the top 30 results from Google (or Bing or Yahoo).
2. Gather up all of the information in all of those top 30 documents.
3. Score each document to see how much of the overall body of knowledge was covered in each document.
So, for instance, if a particular web page contained 80% of the information covered by the top 30 results, that's the web page I'd want at #1 because it covers most of the information found in all 30 documents. Number two would be the next best coverage, and so on.
That would mean that I'd have to visit a lot fewer pages to get a great overall understanding of a topic.
This method would also be great for news items, since it would list the news article with the best overall coverage of a topic first -- saving a lot of time reading multiple articles to try and get a balanced or complete picture.
Wouldn't that be nice?
Well, stop day-dreaming, because it's a reality! I've incorporated the methodology listed above into my search engine, Shablast:
Now, when you enter a query into Shablast and select either "Standard" or "News" from the drop-down list next to the keyword field, Shablast will score the top 30 Bing results using the methodology listed above and sort the results to show the web pages/news articles with the best coverage first.
Search Shablast for "old time radio" in Standard mode and RadioLovers.com comes in at #13 because it only contains 19% of the overall body of knowledge contained in the top 30 results. The #1 result is MysteryShows.com, which contains 64%, followed by the Wikipedia article at #2 with 56%.
Search Shablast for "Microsoft" and the #1 result is the CNet.com news page dedicated to Microsoft because it contains a whopping 84% of all the information contained in the top 30 results for "Microsoft."
The top 30 results are pulled from Bing using their awesome API. The advantage of using the Bing API is that links still play a part. It takes a lot of links to get into the top 30 results for major keywords, which ensures a certain level of trust or authority. But then Shablast examines those top 30 documents and resorts them to show the documents with the best coverage first.
Why not give it a try and let me know what you think, either in a comment on this post or by discussing Shablast at the Shablast discussion forum?