|on 14 Sep Posted by Admin Category: Search Engines|
by Rob Sullivan
I wrote about this topic last year, but it seems to have come to surface - the fact the Google may have hit their limit on the number of pages they can index.
As you probably already noticed, if you watch the web closely, Google doesn't seem to have performed a major update of their index in some time. There is an idea out there that illustrates why this may be so.
The idea goes something like this: The way Google was built was to use a unique identifier called a PageID which is used to categorize the web. Each page gets its own PageID. The problem may be that Google's PageID system may have only been built to assign a maximum of 4.2 billion unique PageID values (based on the math behind the ID value). Since they have been showing over 3 billion pages indexed since November, it is safe to assume that they could be nearing that 4.2 billion mark. Now, I'm not saying this is what happened, or that the theory is correct. I am only stating that it is a possibility. And when you consider that the engine doesn't seem to have performed a major update to the index or PageRank in some time, and the fact that they show on their home page "Searching 4,285,199,774 web pages" it starts to make sense.
Maybe they really are broken?
Now, you wouldn't think that such a limitation would have been left untreated for so long. Sure, they may have initially built the system this way, but you would think that, at some point, they would have realized that 4.2 billion pages isn't realistic and they would have tried to find a way to fix it. But then again, perhaps it would have been too big a fix? (Maybe that's what the IPO is for - to fund the fix).
Not convinced? How about this? I have been checking the server logs of some of our new clients. These are sites which are brand new - never been used. One has only been visited by Googlebot once in the last 5 weeks, and this site has over 50,000 pages. In fact, that 1 visit happened the last week of July, and nothing at all in August. Compare that to the MSNBot - which visited 139 times in the last 5 weeks. Normally, Google is the largest consumer of pages, compared to the rest of the crawlers, yet it hasn't even made a dent in the site.
However, if you look at the Searchengineposition.com site, you can see indexed pages as recently as August 25 of this year. An article I wrote about internet sales tax is ranking #1 for "internet sales tax a reality"
This plainly shows that the index is growing, because that article is only 19 days old.
But the PageRank values for articles going back to March haven't been updated yet. It used to be that they would update PageRank to within about 6 weeks of the most recent article.
Maybe the entire system isn't broken, but it appears that they do have some bugs. While the index hasn't grown appreciably, there still does appear to be some growth. Just not with new sites.
There is another theory out there which could account for the new sites not getting indexed - it's called the Google Sandbox.
Consider the sandbox as a staging ground. New sites get placed there and will eventually start to show in the main index. It's almost like a quarantine area for new sites.
We've known about the sandbox for some time. Usually a new site spends 2-3 weeks in the sandbox before it starts to rank. We can identify this by reviewing spider logs. Googlebot has visited and requested a ton of pages, and sometimes they show up in the main index for a couple of days, only to disappear for a couple more weeks. Then they return for good, but they usually have to work their way up through the ranks.
In either case, whether it is a flaw in the architecture, or merely the sandbox kicking in, there does seem to be something happening down in California which is causing new sites to take longer than ever to get indexed and existing sites longer than ever to update PageRank.
Maybe the engineers are too busy cashing in on their stocks to worry about fixing the engine, or maybe there really is a fundamental flaw in the Google architecture (which I'm sure they are fixing) but which is hampering the growth of the index.
In either case, Google better get it's act together and fix it's problems, because other engines (like Yahoo!) are getting better every day, and still others (like MSN) are close to relea