Content
summary Summary

A new study reveals that AI models are increasingly losing access to their web-based training data. This growing trend of restrictions could force models to learn from less, more biased, and outdated information in the future.

Ad

The Data Provenance Initiative, an independent academic group, conducted a large-scale study documenting a rapid decrease in web data access for AI models. Researchers analyzed robots.txt files and terms of use for 14,000 web domains that serve as sources for popular AI training datasets like C4, RefinedWeb, and Dolma.

From April 2023 to April 2024, the percentage of tokens in these datasets completely blocked for AI crawlers rose from about 1% to 5-7%. Tokens are the individual sentence and word components used to train AI models.

The increase was even more significant for key data sources, where the proportion of blocked tokens jumped from less than 3% to 20-33%. Researchers predict this trend will continue in the coming months. OpenAI faces the most frequent blocks, followed by Anthropic and Google.

Ad
Ad
The visualization provided by the Data Provenance Initiative shows that many content providers have switched to blocking access to their content for AI companies in the second half of 2023, either through robots.txt files, clauses in website terms of use, or both. | Image: Data Provenance Initiative

News websites, forums, and social media platforms are the main sources imposing restrictions. On news sites, the share of completely blocked tokens surged from 3% to 45% within a year.

As a result, their representation in the training data is likely to decline in favor of corporate and e-commerce sites, which have fewer restrictions but often lower quality content. This trend could particularly affect AI developers, as the industry has realized that learning from high-quality data produces better models.

The study also highlights a disparity between the actual use of generative AI models and the content of their training data. This could be relevant in legal cases where publishers sue AI companies, claiming that services like ChatGPT compete with their information offerings based on the publishers' content.

The most common web services do not correspond to actual ChatGPT use cases, according to the researchers. Left: Share of tokens per web service and their monetization through paywalls/advertising. Right: Share of different user requests in WildChat, a dataset of ChatGPT interactions. | Image: Data Provenance Initiative

Overall, this development could make it more difficult, or at least pricier, to train powerful and reliable AI systems. High-quality content providers could potentially find new revenue streams and become major beneficiaries. But OpenAI and Meta CEO Mark Zuckerberg both also said that licensing all the data they need to train a good AI model would be impossible or unaffordable.

For example, OpenAI has recently negotiated several multi-million dollar deals with publishers to access their content for real-time display in chat systems and AI training. Other companies are likely to follow suit, unless a fair use ruling dramatically changes the situation.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new study by the Data Provenance Initiative reveals that AI models are rapidly losing access to their web-based training data, with the percentage of completely blocked tokens rising from 1% to 5-7% in just one year.
  • News websites, forums, and social media platforms are the main sources imposing restrictions, with the share of blocked tokens on news sites surging from 3% to 45%, potentially leading to a decline in representation in favor of lower-quality corporate and e-commerce sites.
  • This trend could make it more difficult and expensive to train powerful and reliable AI systems, forcing them to learn from less, more biased, and outdated information, while high-quality content providers could potentially find new revenue streams through licensing deals with AI companies.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.