CCP releases politically approved LLM dataset with 50 billion tokens

DALL-E 3 prompted by THE DECODER

The Chinese government has released a dataset to train language models that reflect their political views. This is another example of how the Chinese government is trying to control generative AI.

The Artificial Intelligence Security Governance Professional Committee of the Cyberspace Administration of China (CAC) announced a public dataset of 50 billion tokens in 100 million data points. This dataset has been officially approved by the government and is in line with its policies.

In terms of dataset size, the filtered version of the Common Crawl dataset used to train GPT-3 has approximately 410 billion tokens. Meta's Llama-2 models were pre-trained on 2 trillion tokens.

So the CCP dataset is relatively small and probably not enough to train a large, capable language model. But it can be part of the data mix and used to align the LLM.

Those interested can download the dataset from the CAC website after registration and authentication.

The CCP's struggle for control where control is difficult

The dataset announcement is noteworthy because it shows that the Chinese government continues to try to reconcile the language and image capabilities of large AI models, as well as their complex randomness, with its strict political discourse.

China released guidelines for generative AI services this past summer. For example, organizations that offer AI systems to the public must undergo a safety review process that checks for alignment with the CCP's political views. Generative AI services must adhere to the "core values of socialism" and not attempt to overthrow state power or the socialist system.

Baidu's ERNIE bot, the Chinese version of ChatGPT, shows what this looks like in practice in a recent test by CNN: ERNIE did not answer questions about the Tiananmen massacre or Xi Jinping's lifting of term limits. After several inquiries, the account was suspended by CNN.

Baidu's image AI had previously blocked the generation of images for political prompts, such as "Tiananmen Square," the site of the Tiananmen massacre.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI in practice

CCP releases politically approved LLM dataset with 50 billion tokens

The CCP's struggle for control where control is difficult

Nvidia positions GR00T N1 to dominate robotics ecosystem

AI startup 01.AI releases open-source LLM that beats Meta's Llama 2

With "InternLM", China enters the race for large language models

AI coding can make developers slower even if they feel faster

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

"Cat attack" on reasoning model shows how important context engineering is

CCP releases politically approved LLM dataset with 50 billion tokens

The CCP's struggle for control where control is difficult

Nvidia positions GR00T N1 to dominate robotics ecosystem

AI startup 01.AI releases open-source LLM that beats Meta's Llama 2

With "InternLM", China enters the race for large language models