Remember way back when Google used to show the number of pages they have indexed on their home page? Remember the war between Yahoo and Google where they competed to get the most pages indexed? It seemed that the operators of the big engines felt that if their index was bigger, they were the better search engine. It was fun to watch at the time, but eventually those numbers quietly disappeared.
However, when the Cuil search engine came out last year, its creators made the bold claim that it was the largest search engine in the world. Its home page, as of this blog post, proclaims its index to be 127 billion pages. Interestingly, just three days before Cuil officially launched on July 28, 2008, Google made the statement that their search engine was aware of one trillion urls, but added that they "don't index every one of those trillion pages." For a moment I wondered whether the number wars were going to start up again. They didn't.
But is it true? Does Cuil index more pages than Google? And why should you care either way?
First, let's see if it's true, then we'll talk about the implications to you as a webmaster.
Here's a simple way to find out: search for a single word term in Cuil and Google and see how many pages comes back in the result counts. For optimal results, the word should be extremely common -- likely to appear on just about every single (English) content page indexed by both engines.
For example, the word the. I searched for "the" in Cuil and Google (and, for fun, Yahoo and Bing). Here are the numbers I got back, sorted with the engines having the most results first.
|Results for "the"|
Rather dramatic differences! Based on these results, Cuil does seem to index a lot more pages than Google and the other major search engines (at least pages written in English).
Remember, though: Google made it clear that they don't index every page they are aware of. In fact, assuming that Cuil actually indexes most of the publicly available content on the web, that means that Google is choosing not to index more than 80% of pages which contain the word "the" (which it's safe to say appears on pretty much all content written in English).
What causes Google to filter a page from its index? The previously referenced blog post on Google's blog says that "many [pages] are similar to each other, or represent auto-generated content ... that isn't very useful to searchers."
Google is notorious for making vague statements that are understood by just about nobody. So what's the real truth? Let's disect their statement a bit and find out.
Google says that pages which are "similar to each other" aren't necessarily indexed. These kinds of statements from Google have really caused a lot of misunderstanding and the dissemination of misinformation by self-proclaimed gurus of search engine optimization, who often claim that your page will get penalized if it's a duplicate of some other page.
We can prove from Google's own results that the engine does, in fact, index duplicate content. How? It's easy: Hop over to EzineArticles.com and grab the title of the most published article in any given category, then search for that title in Google using the "intitle:" operator.
For example, the most published article in the last 60 days in the Finance category right now is titled "Same Day Loans - When You Are Running Out of Options!"
Go to Google and search for this:
Right now 7 results show up. When I click the link at the bottom of the results to show duplicates as well, I get 87 results. That means Google has 87 copies of the same article in its index. Clearly, the fact that the content is the same doesn't prevent Google from adding a page to its index.
Reading Google's own words about duplicate content in their support material gives the impression that when they refer to duplicate content, they're mostly talking about content on the same site. They also state that "duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results."
So Google appears to be indicating that similar content is not indexed if it's perceived to be for search engine manipulation.
That may be their goal, but it's not the case in actuality. Content is very often created and syndicated for the purpose of building links that get a page ranked in Google. That practice works very well, too. Perhaps if the duplication is egregious enough... but unless you're doing some large scale duplication and distribution you generally have nothing to worry about.
The second part of Google's statement indicated that a page may not be indexed if it's "auto-generated content ... that isn't very useful to searchers." An example they give is a calendar script that would create an infinite number of pages if the search engine crawler kept following all of the links for all of the dates going forward in time. Google was not specifically talking about page content that's generated by software.
Again, we can prove that Google indexes software-generated content by using their own index. I went to Google's shopping page and clicked on one of the "recently found" links (in this case, it was "bicycle trailers"). The first result was from Amazon.com, which has an API that allows people to use Amazon product information on their own sites (e.g., you can create Amazon-like product pages using software).
I searched at Google for two sentences found in the product description (with quotes):
That search returned 10 pages, and when I clicked the link for showing filtered results, 46 pages. So obviously Google has no problem indexing that kind of content, either.
So what is it, then, that will prevent Google from indexing a page? The answer is simple, and it's one that you'll probably never hear from Google.
Sometimes I wonder why Google bothers putting up support materials when they never give you a real answer to anything. They want to be vague to prevent people from manipulating the results, but guess what? People manipulate it all the time anyway!
So here it is, the real answer to why Google's index is 80% smaller than Cuil's, and why so many pages go unindexed:
No links = No Indexing
That is, if a page is a duplicate of some other page on some other site, but the page has no links to it, Google will crawl the page -- and even put it in their index for a while -- but after a few days or weeks the page will usually be removed from their index.
I say "usually", because if the site that the duplicate page appears on has enough links to it overall, then the page will stay indexed even if there aren't any links directly to it. That's why sites like EzineArticles.com, for instance, has 4,690,000 pages in Google's index even though many of those pages don't have any external links to them -- the site as a whole has enough links for Google to feel it's worth keeping anything that appears on that site indexed.
That makes sense, right? Why muck up the index with massive amounts of duplicate content that isn't important enough for anybody to link to it?
What all of this means for you as a webmaster is simple: if you're going to distribute articles or other duplicate content in order to build links to your web site and rank better in Google, you need to make sure that the content you distribute is linked to by other external pages. Whether you accomplish that by social bookmarking or writing additional articles on EzineArticles that link to your articles on "lesser" sites or though some other method, you need to be sure that the content is linked to.
So, sure, Cuil's database might be a lot larger than Google's, but that doesn't mean that it's better -- Google just does more filtering. That's important for you, because if you want your content to stick around, you need to make sure Google considers it valuable by throwing some links at it.
Please post your thoughts and questions in a comment below.