Content
summary Summary

Publishers have pulled 80 billion training tokens from Google Deepmind, but Google is still using their content to train AI for search.

Ad

Google uses content from across the web to train its AI models for search, even when website owners have explicitly said no. That was the message from Eli Collins, Vice President at Google Deepmind, during a Washington court hearing, according to Bloomberg.

The current publisher opt-out system only applies to Deepmind, Google's AI research division that trains the Gemini models. Other parts of Google, including the team responsible for web search, can still use this content for their own AI systems.

During the hearing, Diana Aguilar of the US Department of Justice asked, "Once you take the Gemini [AI model] and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?" Collins replied, "Correct — for use in search."

Ad
Ad

Google uses this data to power search features like "AI Overviews," which display AI-generated answers directly above traditional search results. This puts Google in direct competition with website owners, since users may get answers without clicking through to the sites whose content was used to generate them.

Google "loses" half of publisher training data due to opt-out

An internal Google document from summer 2024 listed 160 billion tokens, or short snippets of text, that were originally intended for AI training. Of these, 80 billion tokens were removed because they came from publishers who opted out.

But Collins' testimony suggests that this data is still being used for Google's web search AI, just not by Deepmind directly. In practice, Google is using the data elsewhere inside the company, even though publishers were trying to block all Google AI training.

These details came out as part of an ongoing antitrust case against Google in federal court. The US Department of Justice is pushing for Google to sell off the Chrome browser and to stop paying hardware and app makers to set Google as the default search engine. The DOJ says these restrictions should also apply to Google's AI products, including Gemini, since they benefit from the same search monopoly.

If leading AI labs need high-quality training data to keep their models performing well, a market for this kind of content could emerge. But that would go against the current practice of scraping freely available content from the open web, often justified as "fair use". A US judge recently dismissed this defense in a case involving Meta.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A Google Deepmind manager stated in an antitrust case that Google uses web content to train its search AI, even when publishers have opted out of AI training through Deepmind's opt-out policy.
  • The opt-out policy applies only to Google Deepmind and does not cover Google Search, which still uses such data for features like "AI Overviews"—a function that lowers the number of clicks publishers receive from search results.
  • While 80 billion tokens from publishers using the opt-out were removed from Gemini model training, a manager confirmed that Google Search can continue to use this content for its own AI systems.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.