Updated on February 22, 2024:
The AI company licensing Reddit data is apparently Google. This was reported by Reuters, citing anonymous sources. Reuters confirms the license fee of 60 million dollars per year, although it is unclear to what extent and what data Reddit will provide in return.
Original article from February 17, 2024:
Reddit has signed a $60 million annual contract with an unnamed AI company to use the platform's content to train its AI models.
According to Bloomberg, Reddit disclosed this in advance to potential investors, who are expected to support its planned IPO with a valuation of at least five billion US dollars. The deal shows how Reddit can capitalize on the current interest in AI training data.
Other social media platforms could also sell their user content in this way and generate additional revenue. Meta and X use their social media data to train their own AI models.
Many assume that Reddit plays a central role in the training of large language models such as OpenAI's GPT-3.5 or GPT-4, Meta's LLaMa, or Google's models.
This is because many Reddit posts already carry a human rating thanks to the platform's upvote and downvote function, which facilitates pre-sorting. The posts also contain additional contextual links. Both of these factors make the data valuable to AI companies.
"The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free," said Reddit co-founder Steve Huffman in the spring of 2023.
At the time, Reddit announced that it would start charging companies that wanted to access user data through its API. Previous models were trained on Reddit data for free. These rising licensing costs for training future AI models affect other text sources in addition to Reddit.
AI companies are increasingly partnering with publishers to get data to train their models. OpenAI, for example, has confirmed a deal with Axel Springer that includes making Springer news available on ChatGPT. More deals will follow, the company said. Apple and Google are also said to be offering licensing deals to publishers.
Meta explained in a submission to the US Copyright Office that training AI on purely licensed material would be prohibitively expensive on the scale required. OpenAI also told the UK government that the development of leading AI models is not possible without training on licensed material.