Microsoft is being sued by several authors who say their books were used without permission to train a Megatron model. The lawsuit, filed in federal court in New York, claims Microsoft used a dataset of about 200,000 pirated books to build a system that mimics the style, voice, and themes of the original works. The plaintiffs are asking for a ban on further use and up to $150,000 in damages per title.
Courts in similar cases involving Meta and Anthropic have said such use may qualify as "transformative" under fair use rules. But it is still unclear if using pirated books overrides fair use, or if scraping copyrighted content from the internet is considered legal and to which extent, and whether this harms the market for the original books, which could prevent the use from being considered fair use.