Content
summary Summary

The KBLab data department at the National Library of Sweden combines thousands of works into one data set. This is used to train AI models.

Ad

By law, the National Library of Sweden has collected virtually all Swedish-language writings from the past 500 years. A total of 16 petabytes have already been collected, and the collection is growing by 50 terabytes every month.

On this basis, KBLab, the integrated research department established in 2019, has trained more than two dozen AI models. "Before our lab was created, researchers couldn’t access a dataset at the library — they’d have to look at a single object at a time," Börjeson said. "There was a need for the library to create datasets that enabled researchers to conduct quantity-oriented research."

Highly specialized data sets for research

Thanks to this work, researchers will soon be able to create highly specialized datasets, "for example, pulling up every Swedish postcard that depicts a church, every text written in a particular style or every mention of a historical figure across books, newspaper articles and TV broadcasts," according to the Nvidia blog. Hardware from the graphics processor maker was used for the training.

Ad
Ad

For the first model, KBLab used 20 GB of data, but today it uses about 70 GB, according to Hugging Face. Soon, it will even tackle a whole terabyte of Swedish texts. In addition to Swedish, the dataset will also include Dutch, Norwegian and German. This should improve the performance of the AI models.

Generative text model in development

In addition to the Transformer models that understand Swedish text, KBLab has an AI tool that converts audio to text, allowing the library to transcribe its extensive collection of radio broadcasts so that researchers can search the audio for specific content.

KBLab is also developing generative text models and an AI model to automatically create descriptions of video content. Together with researchers at the University of Gothenburg and the Swedish Academy, KBLab is supporting the modernization of dictionaries.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • To enable scientists to extract highly accurate datasets from several centuries of Swedish texts, the National Library of Sweden is training AI models on thousands of works.
  • The initial AI models are based on about 70 gigabytes of data.
  • However, the library grows by about 50 terabytes every month, so the data training will be expanded to even larger AI models.
Jonathan works as a freelance tech journalist for THE DECODER, focusing on AI tools and how GenAI can be used in everyday work.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.