CCP releases politically approved LLM dataset with 50 billion tokens

Dec 29, 2023 Matthias Bastian

An abstract visualization of a data stream, containing Chinese tokens, flowing dynamically in a digital environment. The scene showcases a vivid, intricate network of lines and nodes, symbolizing data transfer and processing. Within this network, numerous Chinese characters (tokens) are prominently featured, each glowing and floating along the data stream. The background is a futuristic digital landscape, with hints of Chinese cultural elements subtly integrated. The overall color palette is rich and vibrant, with electric blues, neon greens, and deep purples, creating a sense of advanced technology and data flow. This image represents the concept of a large language model being trained on a dataset with Chinese tokens, visualizing the complexity and scale of the data involved

The Chinese government has released a dataset to train language models that reflect their political views. This is another example of how the Chinese government is trying to control generative AI.

The Artificial Intelligence Security Governance Professional Committee of the Cyberspace Administration of China (CAC) announced a public dataset of 50 billion tokens in 100 million data points. This dataset has been officially approved by the government and is in line with its policies.

In terms of dataset size, the filtered version of the Common Crawl dataset used to train GPT-3 has approximately 410 billion tokens. Meta's Llama-2 models were pre-trained on 2 trillion tokens.

So the CCP dataset is relatively small and probably not enough to train a large, capable language model. But it can be part of the data mix and used to align the LLM.

Those interested can download the dataset from the CAC website after registration and authentication.

The CCP's struggle for control where control is difficult

The dataset announcement is noteworthy because it shows that the Chinese government continues to try to reconcile the language and image capabilities of large AI models, as well as their complex randomness, with its strict political discourse.

China released guidelines for generative AI services this past summer. For example, organizations that offer AI systems to the public must undergo a safety review process that checks for alignment with the CCP's political views. Generative AI services must adhere to the "core values of socialism" and not attempt to overthrow state power or the socialist system.

Baidu's ERNIE bot, the Chinese version of ChatGPT, shows what this looks like in practice in a recent test by CNN: ERNIE did not answer questions about the Tiananmen massacre or Xi Jinping's lifting of term limits. After several inquiries, the account was suspended by CNN.

Baidu's image AI had previously blocked the generation of images for political prompts, such as "Tiananmen Square," the site of the Tiananmen massacre.

Sources:

Weixin, Matt Sheehan via X