Content
summary Summary

OpenAI's GPTBot crawls the web for content that can be used by AI models. If you do not want this, you can block the bot.

Ad

The content that GPTBot visits can be used to improve future AI models, according to OpenAI. Those who give GPTBot access to their content are helping to make AI models more accurate, capable, and safe, the company writes.

Block GPTBot from crawling your site

If you do not want to share your content with OpenAI's models for free, you can block GPTBot. By configuring "User-agent: GPTBot," you can either block the bot from visiting your site altogether or from visiting individual folders or categories on your site. Similar to blocking a Google crawler, you can control GPTBot by adding it to your robots.txt with the following commands

User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

Ad
Ad

User-agent: GPTBot
Disallow: /

Example:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

According to OpenAI, content behind paywalls, pages that request personal identification information, or that violate OpenAI's content guidelines are automatically filtered out. Full instructions are available here.

ChatGPT and the Content Dilemma

With the launch of ChatGPT's web browsing feature, OpenAI announced that website owners such as publishers could block the crawling bot if they did not want their content to be used within or for ChatGPT.

Blocking the bot, however, means not being present in a potentially emerging content ecosystem - a dilemma similar to (non-)indexing in Google search, where content providers inadvertently become both suppliers to and financially dependent on a third-party ecosystem.

Recommendation

In the case of chatbots, however, the starting position for content providers is even less favorable: While search engines are (largely) designed to direct searchers to sites where they can provide value to the site operator, chatbots are optimized to provide searchers with the most direct and comprehensive answers possible directly in chat. This almost exclusively benefits the provider of the chatbot.

OpenAI does not currently offer web browsing, following the discovery that ChatGPT browsing could partially read content behind paywalls and pull it into the chat for free. It is not known when the browsing plugin will be back online. Perhaps OpenAI is concerned about further legal repercussions for the reasons mentioned above.

Meta, Microsoft and Google also train their chatbots with copyrighted material and pull content from websites into their chatbots without further consent. They are reportedly in talks with publishers to charge billions for the use of their content.

So far, major chatbot providers like Microsoft have paid lip service, at best, to keep the web ecosystem open. Google's new AI search is designed to keep users in the Google ecosystem much longer than traditional web search.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI's GPTBot crawls the web for content that can be used by AI models. Site owners can block the bot if they do not want this to happen.
  • GPTBot can be blocked by adding "User-agent: GPTBot" to the robots.txt file with appropriate permissions, similar to how webmasters configure Google crawlers.
  • For content providers, there is a dilemma between participating in a potentially emerging AI ecosystem and avoiding becoming a free content provider for, and economically dependent on, a third-party system.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.