SEOIntel Weekly News Round-up (Last week of May 2024)

The last week of May seems to be explosive with the Massive Google Search API documentation leak that happened. It got a lot of SEOs excited going through the document and the insights it provides. Some say it’s nothing new and they already know it, nevertheless, it is a confirmation of it.

Check out this weeks notable SEO news:

Massive Google Search API Documentation Leak

This has got to be the biggest news in SEO this week, this month, this year, maybe even in years. A massive trove of Google Search API documentation has been leaked and was accessible in Github between March and May 2024. An initially anonymous source (not anymore as he has come forward) approached Rand Fishkin about the leaked documents and which was then confirmed as authentic by ex-Google employees.

The leaked documentation contains more than 2,500 pages of API documentation containing 14,041 ranking attributes. The documentation, however, does not show the weight of particular elements in ranking, nor does it prove that they are currently used. However, it does provide incredible details about the types of data that Google collects.

Fishkin has also employed the help of Mike King in reviewing the technical aspects of the documentation as not only is it massive, it is also quite technical.

Below are Fishkin and King’s posts on the leak and what they have found out from it:

An Anonymous Source Shared Thousands of Leaked Google Search API Documents with Me; Everyone in SEO Should See Them – Rand Fishkin

Secrets from the Algorithm: Google Search’s Internal Engineering Documentation Has Leaked – Mike King

Here is the link to the documentation –

Google_api_content_warehouse

So what have we learned from the leak so far?

The existence of some elements in the documentation seem to contradict a number of statements that Google has made over the years. Rand wrote:

“Many of their claims directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more.”

Aside from that, these are the notable elements that exist within the document:

*Demotions – There are a series of algorithmic demotions discussed in the documentation. These demotions are for anchor mismatch, SERP demotion, Nav demotion, exact match domains demotion, product review demotion, location demotion, porn demotion
*Links are still important – A metric called sourceType shows a loose relationship between where a page is indexed and how valuable it is. The higher the tier, the more valuable the link. Pages that are considered “fresh” are also considered high quality.
*Google only uses the last 20 changes for a given URL when analyzing links
*Homepage PageRank is considered for all pages- Every document has its homepage PageRank (the Nearest Seed version) associated with it. This likely used as a proxy for new pages until they capture their own PageRank.
*Homepage Trust – Google is decides how to value a link based on how much they trust the homepage.
*Page Titles Are Still Measured Against Queries
*Short Content is Scored for Originality
*Dates are Very Important
*Domain Registration Info is Stored About the Pages
*Font Size of Terms and Links Matters
*Navboost and the use of clicks, CTR, long vs. short clicks, and user data
*Google Uses Click Data to Determine How to Weight Links in Rankings
*Use of Chrome browser clickstreams to power Google Search
*Employing Quality Rater Feedback

Read more on each findings in Fishkin’s and King’s articles. As this is a massive document, we expect more as more people are able to study and go through it.

In more recent news, the anonymous source reveals himself to be Erfan Azimi – CEO and director of SEO for EA Eagle Digital. Below is his video on the leak and his reasoning behind it:

Google Confirms The Leaked Documentation Is Real

After more than a day of silence since the massive Google Search API documentation leak, Google has sent a statement to various news outlets such as The Verge, Search Engine Land, and SERoundtable, among others.

The statement reads:

“We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.”

Google adds that search ranking signals are changing constantly. While the core ranking principles do not change, specific and individual signals do change. However, they would not comment about specific elements in the leak which are accurate, invalid, are currently being used, and how strongly they are weighted.

According to Barry Schwartz, a spokesperson told him Google would not comment about specifics because Google never comments about specifics when it comes to its ranking algorithm. If they did comment, spammers and/or bad actors could use it to manipulate rank. Further, it would be incorrect to assume that this data leak is comprehensive, fully-relevant, or even provides up-to-date information on its Search rankings.

Google’s Next Steps For AI Overviews

Google’s Head of Search, Liz Reid, has published a blog post to address people’s backlash on AI Overviews which was released a few weeks ago.

In the last week, people have shared some “odd and erroneous” overviews they have encountered while using search. The post explained how the AI Overviews work, where the weird responses came from, and improvements they have made. Here are more details on it:

AI Overviews work very differently than chatbots and other LLM products. They’re not simply generating an output based on training data. While AI Overviews are powered by a customized language model, the model is integrated with Google’s core web ranking systems and designed to carry out traditional “search” tasks – like identifying relevant, high-quality results from Google’s index. That’s why AI Overviews don’t just provide text output. Because accuracy is paramount in Search, AI Overviews are built to only show information that is backed up by top web results.

This means that AI Overviews generally don’t “hallucinate” or make things up in the ways that other LLM products might. When AI Overviews get it wrong, it’s usually for other reasons: misinterpreting queries, misinterpreting a nuance of language on the web, or not having a lot of great information available.

Google states that this approach is highly effective and their tests show that the accuracy rate for AI Overviews is on par with that of Featured Snippets — which also uses AI systems to identify and show key info with links to web content.

As for the Odd results showing up, Reid states that a large number of the screenshots shared widely were fake and encourages those who encounter such screenshots to do the search themselves to check. They did admit that some odd, inaccurate, or unhelpful overviews did show up. These were generally for queries that people do not commonly do and it highlighted areas where improvements were needed.

One area they identified was the ability to interpret nonsensical queries and satirical content. For example, for the query “How many rocks should I eat?” Prior to these screenshots going viral, practically no one asked Google that question. There isn’t much web content that seriously contemplates that question, either. This is what is often called a “data void” or “information gap,” where there’s a limited amount of high-quality content about a topic. However, in this case, there is satirical content on this topic that also happened to be republished on a geological software provider’s website. So when someone put that question into Search, an AI Overview appeared that faithfully linked to one of the only websites that tackled the question.

In other examples, AI Overviews featured sarcastic or troll-y content from discussion forums. Forums are often a great source of authentic, first-hand information, but in some cases can lead to less-than-helpful advice, like using glue to get cheese to stick to pizza.

In a small number of cases, they have also seen AI Overviews misinterpret language on webpages and present inaccurate information, which they worked quickly to address either through improvements to our algorithms or through established processes to remove responses that don’t comply with their policies.

From looking at the examples from the past couple of weeks, they were able to determine patterns where they did not get it right and have made improvements:

Built better detection mechanisms for nonsensical queries that shouldn’t show an AI Overview, and limited the inclusion of satire and humor content.
Updated our systems to limit the use of user-generated content in responses that could offer misleading advice.
Added triggering restrictions for queries where AI Overviews were not proving to be as helpful.
For topics like news and health, they already have strong guardrails in place. For example, they aim not to show AI Overviews for hard news topics, where freshness and factuality are important. In the case of health, they launched additional triggering refinements to enhance their quality protections.