On Google’s most recent Q&A webmaster session, titled English Google SEO office-hours from August 13, 2021, one of the numerous questions submitted was, “20% of my pages are not getting indexed. It says they’re discovered, but not crawled. Does this have anything to do with the fact that it’s not crawled because of potential overload of my server or does it have to do with the quality of my page?”
NOTE: I should make a point here that the wording above is more accurately stated as the “quality of my site” rather than the “quality of my page.”
You’ll see why below.
John Mueller, who is a search advocate for Google, responded: “Probably…a little bit of both.”
So, let’s use those two as criteria to explore.
One: The Server Side and Crawl Budget. If You Have a Large Site, You May Want to Consider This
First, let me try to quantify the size of what may be considered a large site: in the tens of thousands (at least 10K) pages. So, if your site isn’t anywhere near that, I wouldn’t worry too much about this criteria.
The only exceptions might be if you added a large number of pages in a short time, or you’ve done a number of redirects.
Mueller went on to mention something called crawl budget.
Crawl budget is basically what it sounds like: Googlebot will only crawl a certain number of pages of a site within a given time period. (As a side note, I recall Mueller saying that Google doesn’t index 100% of a site. Of course, he was probably talking about large sites.)
This limit is there to help minimize load on your site’s server(s).
Of course, you can see how this might not apply to a small site: today’s web servers are, on average, pretty capable. That’s why crawling each page of a small site may not put too much burden on a server.
With all this said, today’s servers are pretty capable, so maybe the issue has to do with the next criteria…
Two: On Site Quality: “That Is Something We Take Into Account Quite Strongly”
You’ll recall that above, I made a distinction between page and site. I mention this because Mueller said that Google takes into account the quality of a site quite strongly. If Google looks at your site as a whole (the various pages in its index), and evaluates that your site isn’t of high quality, you may find that additional pages (or some pages) aren’t indexed, even though they’ve been discovered.
I know this brings up a bit of a chicken-and-egg situation, because what if your pages that aren’t indexed are newer and of high quality?
That would imply that you’d have to go back to the original pages that are in Google’s index, which make up Google’s impression of your site, improve the quality of each of those pages (or a significant enough number of them), and wait for re-indexing.
While We’re On The Subject of Indexing…
I’m reminded of something I saw on one of Google’s Twitter channels:
It’s basically a link to an index coverage report, support for which was recently made available.
So, in conclusion, if you’re having trouble with indexing, look into your site’s quality. I’d also add to double-check robots.txt to ensure that Google’s allowed to crawl those pages.