What data can be processed to train AI models? Japan and Israel are staking out initial positions, but like everything else on this topic, these are still in the early stages.
Large language and image models are trained on a huge amount of data from the Internet. Much of this data is copyrighted and has not been explicitly released for AI model training.
As a result, there has been a debate about the legal viability of such models, especially in the fields of design and art, and since the advent of widely available image generators such as Stable Diffusion.
Japanese law supports generative AI
In a hearing with Japanese politician Takashi Kii in late April, Japan's Minister of Education, Culture, Sports, Science and Technology, Keiko Nagaoka, confirmed that existing Japanese law allows the use of data collected on the Internet for both non-commercial and commercial purposes. She said this in response to his question about potential copyright issues with generative AI.
While this is not an explicit endorsement of the legitimacy of large AI models trained on copyrighted data, it is a snapshot of existing Japanese law. Takashi Kii expressed at this meeting that he believes new copyright rules are needed, adapted to the AI era. So this is far from being resolved.
Kii also said that Japan does not yet have rules for dealing with generative AI in an educational context.
Israel's Ministry of Justice weighs in on copyright and AI training data
A more specific position paper published by the Israeli Ministry of Justice in 2022 (via Project Disco) states that "typically" the fair use doctrine applies to AI training data from the web, and that some projects may fall under a doctrine that allows "incidental use of copyrighted material" if the copyrighted works are deleted at the end of the training process.
Excluded from this approach are datasets that are specifically trained on the works of individual creators to compete with them. For example, imagine an AI system trained exclusively on Harry Potter novels to generate more.
In addition, the statement refers only to the training and not to the output of the systems, which could infringe copyrights regardless of the training process, the Ministry of Justice notes.
Another special case in the copyright debate is likely to be chatbots, such as those from Microsoft, OpenAI, and Google, which scan web content in real-time and present it in a slightly modified form, e.g. as a search result.
This copyright debate is separate from the debate over copyrighted material in training datasets, although publishers are likely to try to assert any rights they may have if their works are used for AI training or generation without their permission.