Content
summary Summary

The Chinese government has released a dataset to train language models that reflect their political views. This is another example of how the Chinese government is trying to control generative AI.

Ad

The Artificial Intelligence Security Governance Professional Committee of the Cyberspace Administration of China (CAC) announced a public dataset of 50 billion tokens in 100 million data points. This dataset has been officially approved by the government and is in line with its policies.

In terms of dataset size, the filtered version of the Common Crawl dataset used to train GPT-3 has approximately 410 billion tokens. Meta's Llama-2 models were pre-trained on 2 trillion tokens.

So the CCP dataset is relatively small and probably not enough to train a large, capable language model. But it can be part of the data mix and used to align the LLM.

Ad
Ad

Those interested can download the dataset from the CAC website after registration and authentication.

The CCP's struggle for control where control is difficult

The dataset announcement is noteworthy because it shows that the Chinese government continues to try to reconcile the language and image capabilities of large AI models, as well as their complex randomness, with its strict political discourse.

China released guidelines for generative AI services this past summer. For example, organizations that offer AI systems to the public must undergo a safety review process that checks for alignment with the CCP's political views. Generative AI services must adhere to the "core values of socialism" and not attempt to overthrow state power or the socialist system.

Baidu's ERNIE bot, the Chinese version of ChatGPT, shows what this looks like in practice in a recent test by CNN: ERNIE did not answer questions about the Tiananmen massacre or Xi Jinping's lifting of term limits. After several inquiries, the account was suspended by CNN.

Baidu's image AI had previously blocked the generation of images for political prompts, such as "Tiananmen Square," the site of the Tiananmen massacre.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • The Chinese government has released an official dataset of 50 billion tokens to train language models that reflect its political views.
  • The dataset, unveiled by the Cyberspace Administration of China (CAC), is in line with the government's policy line and can be downloaded from the CAC website after registration and authentication.
  • The move demonstrates China's efforts to harmonize the linguistic and visual capabilities of large-scale AI models with its strict political language use, and to align generative AI services with the "core values of socialism."
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.