Content
summary Summary

A recent study by the Reuters Institute shows that a significant number of news websites are blocking AI crawlers from OpenAI and Google. These crawlers collect data from websites to train large language models (LLMs) and extract real-time information from websites.

According to a study by the Reuters Institute, 48% of the most-used news sites in ten countries blocked OpenAI crawlers by the end of 2023. A smaller proportion, 24%, also blocked Google's AI crawler. In contrast, almost all sites that block Google's AI crawler also block the OpenAI crawler. Perhaps there is more reluctance to block the Google AI crawler because publishers are worried about the potential impact on search engine rankings.

The differences between countries are significant, with the percentage of news sites blocking OpenAI ranging from 79 percent in the U.S. to just 20 percent in Mexico and Poland. For Google, the numbers range from 60 percent in Germany to 7 percent in Poland and Spain.

Image: Reuters Institute

Print media sites are more likely to block crawlers than broadcast or digital publishers, which is surprising given that language models and chatbots pose the greatest economic risk to digital content. Publishers can use the robots.txt file on their sites to reject web crawlers.

Ad
Ad
Image: Reuters Institute

There are two main reasons for blocking crawlers: The New York Times, for example, believes it should be compensated for allowing OpenAI and other AI companies to use its content to train AI models.

Others fear that chatbot platform providers will not link to publishers, or that these links will be displayed but not used. This would result in significant financial losses for publishers, who rely primarily on traffic to their sites.

Major AI companies such as OpenAI, Microsoft, and Google have recognized the problem, but have yet to address it. Currently, some publishers are said to be negotiating licensing deals with AI companies.

OpenAI has already announced a deal with Axel Springer and other publishers. This includes the use of content as training data and the provision of messages in chatbots.

AI crawlers, also known as "spiders" or "bots," systematically collect data from the Internet for various purposes. Search engines use the data collected by their web crawlers to index websites and quickly answer search queries.

Recommendation

AI companies like OpenAI use crawlers to collect data from the web and train their models. AI companies need vast amounts of data to operate efficiently, and the Web is a major source of high-quality textual and audiovisual data.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A study by the Reuters Institute shows that 48 percent of the most used news sites in ten countries blocked the OpenAI crawler by the end of 2023. Only 24 percent also blocked Google's AI crawler.
  • Reasons for blocking are financial compensation claims for the use of publishers' content and the fear that chatbot platforms will not link to publishers or users will not use these links, which would lead to significant losses.
  • Major AI companies such as OpenAI, Microsoft, and Google are aware of the problem, but there is no solution yet. Some publishers are currently negotiating licensing agreements with AI companies. OpenAI has already signed an agreement with Axel Springer and other publishers.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.