Content
summary Summary

The Japanese government and major technology companies such as NEC, Fujitsu and SoftBank are investing "hundreds of millions" in the development of a Japanese language model.

This Japanese language model is supposed to represent cultural and linguistic subtleties better than ChatGPT and the like. It will be trained on Japanese texts using the national supercomputer Fugaku.

Few letters against many characters

Keisuke Sakaguchi, a researcher in natural language processing at Tohoku University in Japan, sees the differences in the alphabet system and the limited training data of Western systems as a disadvantage for Japanese users.

For example, ChatGPT sometimes generates "extremely rare characters that most people have never seen" and rare unknown words, Sakaguchi said. Similarly, ChatGPT often fails to apply culturally appropriate and polite communication norms when generating responses in Japanese.

Ad
Ad

The English alphabet has 26 characters, while Japanese has at least two groups of 48 characters, plus 2,136 commonly used Chinese characters (Kanji). In addition, there are different pronunciations for each character and approximately 50,000 rarely used Kanji characters.

Japan gets its own LLM benchmark

To measure LLMs' sensitivity to Japanese culture, the researchers developed the Rakuda Ranking, which uses GPT-4-generated questions to measure how well LLMs can answer Japan-specific questions. The current best open Japanese LLM is ranked fourth in this ranking. The list is topped by GPT-3.5, and GPT-4 should significantly outperform its predecessor.

The Japanese LLM being developed by Tokyo Institute of Technology, Tohoku University, Fujitsu, and government-funded RIKEN is expected to be released as open source next year and will have at least 30 billion parameters.

A much larger model is being built by Japan's Ministry of Education, Culture, Sports, Science and Technology. The model, with at least 100 billion parameters, will also be based on the Japanese language and optimized for scientific applications: Based on published research, it will generate new hypotheses to accelerate research. The model will cost approximately $200 million and is expected to be available to the public in 2031.

Recently, the Japanese Ministry of Education also issued guidelines allowing the limited use of generative artificial intelligence such as ChatGPT in elementary, middle, and high schools.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • The Japanese government and technology companies such as NEC, Fujitsu and SoftBank are investing millions in the development of a Japanese language model that is supposed to represent cultural and linguistic subtleties better than ChatGPT and the like.
  • The Japanese language model is being trained on the national supercomputer Fugaku and is expected to be released as open source next year with at least 30 billion parameters.
  • Researchers have developed the Rakuda ranking to measure the sensitivity of LLMs to Japanese culture; the best open Japanese LLM currently ranks fourth.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.