Reddit plays a central role in training large language models. Now the social network is looking to monetize its data.
OpenAI’s GPT-3.5 or GPT-4, Meta’s LLaMA, or Google’s Bard – large language models are trained on Internet text, and a significant portion of the training data comes from Reddit threads.
The fact that this is happening without compensation seems to be a thorn in Reddit’s side. Similar to publishers who have already publicly spoken out against the use of their content to train generative AI models, Reddit has now joined the protest and announced consequences.
The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free.
Steve Huffman, founder and CEO of Reddit
Reddit plans to start charging companies to use its application programming interface (API), the network announced Tuesday.
Reddit API: AI training is now explicitly mentioned
The company has updated its Reddit API usage guidelines. While they previously did not mention the use of Reddit data for machine learning, leaving it to the broader legal landscape, they now explicitly exclude this use case.
You must not, and must not allow those acting on your behalf to:
use the Data APIs to encourage or promote illegal activity or violation of third party rights (including using User Content to train a machine learning or AI model without the express permission of rightsholders in the applicable User Content);
The FAQ has also been updated to reflect this, allowing AI training on Reddit content only with the company’s express permission. Use of the API for scientific purposes is not generally restricted, according to the site.
In GPT-3.5 training data, for example, Reddit plays a role in several ways. Just over a fifth of the training data consists of the WebText2 dataset, which extracts web pages from Reddit posts above a certain rating. Reddit is also part of the Common Crawl collections used by companies like OpenAI, Meta, and Google for AI training.
Reddit’s move may be related to IPO
The timing of the announcement may be related to an anticipated initial public offering planned for later this year. With several new hires at the company, there is also speculation that Reddit is working on developing its own large language model.
Reddit isn’t the only social network to try to monetize its API recently. Twitter, under Elon Musk, has also gone this route, making third-party applications virtually unusable.
The extent to which Huffman’s plan to monetize its own data will work remains to be seen, as more than a decade of Reddit data is already publicly available via Common Crawl. However, the value of high-quality, human-curated data may increase in the future – and with it, the value of Reddit threads.