Publishers have pulled 80 billion training tokens from Google Deepmind, but Google is still using their content to train AI for search.
Google uses content from across the web to train its AI models for search, even when website owners have explicitly said no. That was the message from Eli Collins, Vice President at Google Deepmind, during a Washington court hearing, according to Bloomberg.
The current publisher opt-out system only applies to Deepmind, Google's AI research division that trains the Gemini models. Other parts of Google, including the team responsible for web search, can still use this content for their own AI systems.
During the hearing, Diana Aguilar of the US Department of Justice asked, "Once you take the Gemini [AI model] and put it inside the search org, the search org has the ability to train on the data that publishers had opted out of training, correct?" Collins replied, "Correct — for use in search."
Google uses this data to power search features like "AI Overviews," which display AI-generated answers directly above traditional search results. This puts Google in direct competition with website owners, since users may get answers without clicking through to the sites whose content was used to generate them.
Google "loses" half of publisher training data due to opt-out
An internal Google document from summer 2024 listed 160 billion tokens, or short snippets of text, that were originally intended for AI training. Of these, 80 billion tokens were removed because they came from publishers who opted out.
But Collins' testimony suggests that this data is still being used for Google's web search AI, just not by Deepmind directly. In practice, Google is using the data elsewhere inside the company, even though publishers were trying to block all Google AI training.
These details came out as part of an ongoing antitrust case against Google in federal court. The US Department of Justice is pushing for Google to sell off the Chrome browser and to stop paying hardware and app makers to set Google as the default search engine. The DOJ says these restrictions should also apply to Google's AI products, including Gemini, since they benefit from the same search monopoly.
If leading AI labs need high-quality training data to keep their models performing well, a market for this kind of content could emerge. But that would go against the current practice of scraping freely available content from the open web, often justified as "fair use". A US judge recently dismissed this defense in a case involving Meta.