AI in practice

Japan develops its own culturally sensitive language model

Matthias Bastian
Abstract visualisation of the japanese fleg in a data stream

Midjourney prompted by THE DECODER

The Japanese government and major technology companies such as NEC, Fujitsu and SoftBank are investing "hundreds of millions" in the development of a Japanese language model.

This Japanese language model is supposed to represent cultural and linguistic subtleties better than ChatGPT and the like. It will be trained on Japanese texts using the national supercomputer Fugaku.

Few letters against many characters

Keisuke Sakaguchi, a researcher in natural language processing at Tohoku University in Japan, sees the differences in the alphabet system and the limited training data of Western systems as a disadvantage for Japanese users.

For example, ChatGPT sometimes generates "extremely rare characters that most people have never seen" and rare unknown words, Sakaguchi said. Similarly, ChatGPT often fails to apply culturally appropriate and polite communication norms when generating responses in Japanese.

The English alphabet has 26 characters, while Japanese has at least two groups of 48 characters, plus 2,136 commonly used Chinese characters (Kanji). In addition, there are different pronunciations for each character and approximately 50,000 rarely used Kanji characters.

Japan gets its own LLM benchmark

To measure LLMs' sensitivity to Japanese culture, the researchers developed the Rakuda Ranking, which uses GPT-4-generated questions to measure how well LLMs can answer Japan-specific questions. The current best open Japanese LLM is ranked fourth in this ranking. The list is topped by GPT-3.5, and GPT-4 should significantly outperform its predecessor.

The Japanese LLM being developed by Tokyo Institute of Technology, Tohoku University, Fujitsu, and government-funded RIKEN is expected to be released as open source next year and will have at least 30 billion parameters.

A much larger model is being built by Japan's Ministry of Education, Culture, Sports, Science and Technology. The model, with at least 100 billion parameters, will also be based on the Japanese language and optimized for scientific applications: Based on published research, it will generate new hypotheses to accelerate research. The model will cost approximately $200 million and is expected to be available to the public in 2031.

Recently, the Japanese Ministry of Education also issued guidelines allowing the limited use of generative artificial intelligence such as ChatGPT in elementary, middle, and high schools.

Sources: