Sweden's National Library trains AI on 500 years of data

Kb-labb / National Library of Sweden

The KBLab data department at the National Library of Sweden combines thousands of works into one data set. This is used to train AI models.

By law, the National Library of Sweden has collected virtually all Swedish-language writings from the past 500 years. A total of 16 petabytes have already been collected, and the collection is growing by 50 terabytes every month.

On this basis, KBLab, the integrated research department established in 2019, has trained more than two dozen AI models. "Before our lab was created, researchers couldn’t access a dataset at the library — they’d have to look at a single object at a time," Börjeson said. "There was a need for the library to create datasets that enabled researchers to conduct quantity-oriented research."

Highly specialized data sets for research

Thanks to this work, researchers will soon be able to create highly specialized datasets, "for example, pulling up every Swedish postcard that depicts a church, every text written in a particular style or every mention of a historical figure across books, newspaper articles and TV broadcasts," according to the Nvidia blog. Hardware from the graphics processor maker was used for the training.

For the first model, KBLab used 20 GB of data, but today it uses about 70 GB, according to Hugging Face. Soon, it will even tackle a whole terabyte of Swedish texts. In addition to Swedish, the dataset will also include Dutch, Norwegian and German. This should improve the performance of the AI models.

Generative text model in development

In addition to the Transformer models that understand Swedish text, KBLab has an AI tool that converts audio to text, allowing the library to transcribe its extensive collection of radio broadcasts so that researchers can search the audio for specific content.

KBLab is also developing generative text models and an AI model to automatically create descriptions of video content. Together with researchers at the University of Gothenburg and the Swedish Academy, KBLab is supporting the modernization of dictionaries.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Sweden's National Library trains AI on 500 years of data

Highly specialized data sets for research

Generative text model in development

Researchers say they may have found a ladder to climb the "data wall"

Researchers push back on Apple study: LRMs can handle complex tasks with the right tools

AI learns math reasoning by playing Snake and Tetris-like games rather than using math datasets

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Sweden's National Library trains AI on 500 years of data

Highly specialized data sets for research

Generative text model in development

Researchers say they may have found a ladder to climb the "data wall"

Researchers push back on Apple study: LRMs can handle complex tasks with the right tools

AI learns math reasoning by playing Snake and Tetris-like games rather than using math datasets