Google uses publisher content to train AI despite opt-outs

May 4, 2025

GPT-Image-1 prompted by THE DECODER

Publishers have pulled 80 billion training tokens from Google Deepmind, but Google is still using their content to train AI for search.

Google uses content from across the web to train its AI models for search, even when website owners have explicitly said no. That was the message from Eli Collins, Vice President at Google Deepmind, during a Washington court hearing, according to Bloomberg.

The current publisher opt-out system only applies to Deepmind, Google's AI research division that trains the Gemini models. Other parts of Google, including the team responsible for web search, can still use this content for their own AI systems.

During the hearing, Diana Aguilar of the US Department of Justice asked, "Once you take the Gemini [AI model] and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?" Collins replied, "Correct — for use in search."

Google uses this data to power search features like "AI Overviews," which display AI-generated answers directly above traditional search results. This puts Google in direct competition with website owners, since users may get answers without clicking through to the sites whose content was used to generate them.

Google "loses" half of publisher training data due to opt-out

An internal Google document from summer 2024 listed 160 billion tokens, or short snippets of text, that were originally intended for AI training. Of these, 80 billion tokens were removed because they came from publishers who opted out.

But Collins' testimony suggests that this data is still being used for Google's web search AI, just not by Deepmind directly. In practice, Google is using the data elsewhere inside the company, even though publishers were trying to block all Google AI training.

These details came out as part of an ongoing antitrust case against Google in federal court. The US Department of Justice is pushing for Google to sell off the Chrome browser and to stop paying hardware and app makers to set Google as the default search engine. The DOJ says these restrictions should also apply to Google's AI products, including Gemini, since they benefit from the same search monopoly.

If leading AI labs need high-quality training data to keep their models performing well, a market for this kind of content could emerge. But that would go against the current practice of scraping freely available content from the open web, often justified as "fair use". A US judge recently dismissed this defense in a case involving Meta.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

Google uses publisher content to train AI despite opt-outs

Google "loses" half of publisher training data due to opt-out

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.