Content
summary Summary

Reddit plays a central role in training large language models. Now the social network is looking to monetize its data.

OpenAI's GPT-3.5 or GPT-4, Meta's LLaMA, or Google's Bard - large language models are trained on Internet text, and a significant portion of the training data comes from Reddit threads.

The fact that this is happening without compensation seems to be a thorn in Reddit's side. Similar to publishers who have already publicly spoken out against the use of their content to train generative AI models, Reddit has now joined the protest and announced consequences.

The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free.

Steve Huffman, founder and CEO of Reddit

Reddit plans to start charging companies to use its application programming interface (API), the network announced Tuesday.

Ad
Ad

Reddit API: AI training is now explicitly mentioned

The company has updated its Reddit API usage guidelines. While they previously did not mention the use of Reddit data for machine learning, leaving it to the broader legal landscape, they now explicitly exclude this use case.

You must not, and must not allow those acting on your behalf to:

  • use the Data APIs to encourage or promote illegal activity or violation of third party rights (including using User Content to train a machine learning or AI model without the express permission of rightsholders in the applicable User Content);

Reddit API Terms of Use.

The FAQ has also been updated to reflect this, allowing AI training on Reddit content only with the company's express permission. Use of the API for scientific purposes is not generally restricted, according to the site.

In GPT-3.5 training data, for example, Reddit plays a role in several ways. Just over a fifth of the training data consists of the WebText2 dataset, which extracts web pages from Reddit posts above a certain rating. Reddit is also part of the Common Crawl collections used by companies like OpenAI, Meta, and Google for AI training.

Reddit's move may be related to IPO

The timing of the announcement may be related to an anticipated initial public offering planned for later this year. With several new hires at the company, there is also speculation that Reddit is working on developing its own large language model.

Reddit isn't the only social network to try to monetize its API recently. Twitter, under Elon Musk, has also gone this route, making third-party applications virtually unusable.

Recommendation

The extent to which Huffman's plan to monetize its own data will work remains to be seen, as more than a decade of Reddit data is already publicly available via Common Crawl. However, the value of high-quality, human-curated data may increase in the future - and with it, the value of Reddit threads.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Reddit plans to charge for the use of the API in the future.
  • Explicit consent will also be required to train AI models with Reddit's content.
  • The value of high-quality, human-curated data may increase in the future - and Reddit's plan to monetize its own data may work.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.