Content
summary Summary

Updated on February 22, 2024:

The AI company licensing Reddit data is apparently Google. This was reported by Reuters, citing anonymous sources. Reuters confirms the license fee of 60 million dollars per year, although it is unclear to what extent and what data Reddit will provide in return.

Original article from February 17, 2024:

Reddit has signed a $60 million annual contract with an unnamed AI company to use the platform's content to train its AI models.

Ad
Ad

According to Bloomberg, Reddit disclosed this in advance to potential investors, who are expected to support its planned IPO with a valuation of at least five billion US dollars. The deal shows how Reddit can capitalize on the current interest in AI training data.

Other social media platforms could also sell their user content in this way and generate additional revenue. Meta and X use their social media data to train their own AI models.

Many assume that Reddit plays a central role in the training of large language models such as OpenAI's GPT-3.5 or GPT-4, Meta's LLaMa, or Google's models.

This is because many Reddit posts already carry a human rating thanks to the platform's upvote and downvote function, which facilitates pre-sorting. The posts also contain additional contextual links. Both of these factors make the data valuable to AI companies.

"The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free," said Reddit co-founder Steve Huffman in the spring of 2023.

Recommendation

At the time, Reddit announced that it would start charging companies that wanted to access user data through its API. Previous models were trained on Reddit data for free. These rising licensing costs for training future AI models affect other text sources in addition to Reddit.

AI companies are increasingly partnering with publishers to get data to train their models. OpenAI, for example, has confirmed a deal with Axel Springer that includes making Springer news available on ChatGPT. More deals will follow, the company said. Apple and Google are also said to be offering licensing deals to publishers.

Meta explained in a submission to the US Copyright Office that training AI on purely licensed material would be prohibitively expensive on the scale required. OpenAI also told the UK government that the development of leading AI models is not possible without training on licensed material.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Reddit has signed a $60 million annual contract with an unnamed AI company to train its AI models on the platform's content. This shows how Reddit can benefit from the high demand for AI training data.
  • Reddit posts are valuable to AI companies because they contain human ratings through upvote and downvote functions, as well as additional contextual links. Both facilitate the selection of high-quality training data for AI models.
  • AI companies are increasingly partnering with publishers to obtain data to train their models. This will lead to increasing licensing costs for training future AI models.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.