If you’re into the techy side of things, you’ve probably wondered how Googlebot resolves its way around problems, and specifically, web crawler traps.
For example, let’s suppose that Googlebot encounters some sort of endless loop, or a page that can expand forever (like a calendar page where you can click to load the next month...and the next...and the next...and so on).
At ~46:06 of the English Google SEO Office-Hours From November 19, 2021 (video queued below), John Mueller addressed a question related to this.
The question read:
“Occasionally, when crawling a website, I run across a spider trap, infinitely expanding URLs.
“And I've been wondering how Googlebot handles such situations. Does it somehow ignore those URLs to focus on the rest of the normal URLs on the site or does Googlebot get stuck in some way and miss crawling URLs as a result?”
“Yeah, that's a complicated question.
“And it is something that sometimes causes problems. For the most part, I think we end up figuring this out, because what happens is a kind of spider trap area, which is something, for example, maybe you have an infinite calendar, where you can scroll into March 3000, or something like that, and essentially you can just keep on clicking to the next day, and the next day, and it'll always have a calendar page for you.
“That's kind of an infinite space kind of thing. For the most part, because we crawl incrementally, we'll start off, and go off and find...I don't know...maybe 10 or 20 of these pages.”
That’s a good point--that Googlebot crawls incrementally. It doesn’t dive headfirst.
This gives it a chance to dip its toes into the water, so to speak.
“And then we'll say, ‘Well, there's not much content here, but maybe if we look a little bit deeper and we go off and crawl maybe a hundred of those pages....’
“And we start saying, ‘Well, all of this content essentially looks the same, and they're all kind of linked from this long chain where you have to click Next, Next, Next, Next to actually get to that page.
“At some point, our systems are going to say, ‘Well, there's not much value in crawling even deeper here, because we found a lot of the rest of the website that has really strong signals telling us this is actually important.
“And we found this really weird long chain here.
“Then overall, we'll say, ‘Well, these are probably not that important. We don't have to crawl them that often, if at all, if we want to keep them.’
“And rather, we focus on the rest of the site.”
So, there you have it: Googlebot is programmed with a process whereby it’ll eventually conclude that it’s in a web crawler trap, and then it’ll cease that process.