The Atlantic has developed a search tool that lets users check if their work appears in LibGen, a massive archive of pirated books, scientific papers, and articles that was reportedly used to train language models. According to court documents, Meta used the LibGen dataset to train its Llama models. OpenAI told Gizmodo that LibGen content is not included in the current versions of ChatGPT or in OpenAI's API. Other AI companies have not yet commented on whether they used LibGen data in their training. Microsoft recently began offering book licensing deals to publishers.

Ad
Image: The Atlantic
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.