CCP releases politically approved LLM dataset with 50 billion tokens

Dec 29, 2023

DALL-E 3 prompted by THE DECODER

The Chinese government has released a dataset to train language models that reflect their political views. This is another example of how the Chinese government is trying to control generative AI.

The Artificial Intelligence Security Governance Professional Committee of the Cyberspace Administration of China (CAC) announced a public dataset of 50 billion tokens in 100 million data points. This dataset has been officially approved by the government and is in line with its policies.

In terms of dataset size, the filtered version of the Common Crawl dataset used to train GPT-3 has approximately 410 billion tokens. Meta's Llama-2 models were pre-trained on 2 trillion tokens.

So the CCP dataset is relatively small and probably not enough to train a large, capable language model. But it can be part of the data mix and used to align the LLM.

Those interested can download the dataset from the CAC website after registration and authentication.

The CCP's struggle for control where control is difficult

The dataset announcement is noteworthy because it shows that the Chinese government continues to try to reconcile the language and image capabilities of large AI models, as well as their complex randomness, with its strict political discourse.

China released guidelines for generative AI services this past summer. For example, organizations that offer AI systems to the public must undergo a safety review process that checks for alignment with the CCP's political views. Generative AI services must adhere to the "core values of socialism" and not attempt to overthrow state power or the socialist system.

Baidu's ERNIE bot, the Chinese version of ChatGPT, shows what this looks like in practice in a recent test by CNN: ERNIE did not answer questions about the Tiananmen massacre or Xi Jinping's lifting of term limits. After several inquiries, the account was suspended by CNN.

Baidu's image AI had previously blocked the generation of images for political prompts, such as "Tiananmen Square," the site of the Tiananmen massacre.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

More than 16% discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

CCP releases politically approved LLM dataset with 50 billion tokens

The CCP's struggle for control where control is difficult

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.