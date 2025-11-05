AI research
Jonathan Kemper

German Commons shows that big AI datasets don’t have to live in copyright limbo

Sora prompted by THE DECODER
German Commons shows that big AI datasets don’t have to live in copyright limbo
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Profile
E-Mail
Content
summary Summary

German Commons is now the largest openly licensed German text dataset, offering a foundation for building legally compliant German language models.

Ad

Most large language models train on web data with unclear copyright. German Commons takes a different approach: every text comes from institutions with clear, verifiable licensing. The project, led by the University of Kassel, the University of Leipzig, and hessian.AI, relied on the licensing info provided by these sources, without additional verification. According to their study, the team collected 154.56 billion tokens from 35.78 million documents.

The dataset pulls from 41 sources across seven categories: web content, political documents, legal texts, news, business, cultural, and scientific material. Contributors include the German National Library, Austrian National Library, the German Digital Dictionary (DWDS), the Leibniz Institute for the German Language (IDS), and Wikimedia projects.

News and historical texts shape the dataset

News makes up the largest chunk of the collection, with cultural content coming next. Most of this comes from historical newspaper archives and digitized books from the 1700s to 1900s. Web content is a smaller share, and science and business are underrepresented.

Ad
Ad

Most texts are public domain, and all licenses allow redistribution, modification, and commercial use.

To get the data ready, the team built a multi-step pipeline for quality filtering, deduplication, and fixing text formatting. Since much of the data is from OCR-scanned documents, they used special filters to catch common conversion errors. German characters like umlauts were especially tricky for the software.

The chart shows that science texts use the most technical language, while web content is mostly everyday language. | Image: German Commons

Quality checks cut 46 percent of the original data, mostly non-German texts and very short documents. In the end, 51 percent of the collected data made the cut.

A review of 385,467 text samples found very little toxic content. In categories like violence and discrimination, about 95 percent of texts scored as harmless.

Open source pipeline lets the community build better German AI

The team made their llmdata data processing library open source for full reproducibility. The pipeline is tailored for German and can be expanded by others.

Recommendation
AI research

The next leap in AI depends on agents that learn by doing, not just by reading what humans wrote

German Commons is free on Hugging Face, making it easier to train German language models without worrying about copyright issues.

This release is part of a broader trend in AI toward open, legally compliant datasets. The Common Pile project from the University of Toronto and EleutherAI recently released an 8 TB English-language dataset built entirely from openly licensed sources. Early results show that models trained on this data are competitive, though they still have some gaps with everyday language.

Earlier, the German OpenGPT-X project used Teuken-7B to show how multilingual European AI models can be built. That 7-billion-parameter model was trained on all 24 official EU languages, but the training data did not go through a full license check.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A research group has published the largest openly licensed German text dataset so far, called German Commons, containing 154.56 billion tokens from 35.78 million documents across 41 sources.
  • The corpus mainly includes public domain news and historical texts, has been thoroughly filtered, and is considered largely free of problematic content according to their analysis.
  • Both the German Commons dataset and the custom-built open source pipeline "llmdata" are freely available to support the development of legally compliant German language models.
Sources
Arxiv Hugging Face
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Profile
E-Mail
AI research

Arxiv tightens moderation for computer science papers amid flood of AI-generated review articles

News, tests and reports about VR, AR and MIXED Reality.
What happens next with MIXED My personal farewell to MIXED Meta and Anduril are now jointly developing XR headsets for the US military MIXED-NEWS.com
AI research

A self-rewriting AI from KAUST revives Jürgen Schmidhuber’s vision of a Gödel Machine

AI research

Skyfall-GS turns satellite images into walkable 3D cities

Google News
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

German Commons shows that big AI datasets don’t have to live in copyright limbo

Bank details

IBAN: DE88 2507 0070 0053 0014 00
BIC: DEUTDE2HXXX
Account holder: Deep Content GmbH
Purpose: Support THE DECODER
AI and society
Comment

OpenAI and Microsoft call AGI pointless, then make it the linchpin of billion-dollar deals

AI in practice

Google leans on token metrics, not revenue, adding to bubble talk about AI growth

AI and society

OpenAI restructures under new foundation, Microsoft takes 27 percent stake

Google News