The KBLab data department at the National Library of Sweden combines thousands of works into one data set. This is used to train AI models.
By law, the National Library of Sweden has collected virtually all Swedish-language writings from the past 500 years. A total of 16 petabytes have already been collected, and the collection is growing by 50 terabytes every month.
On this basis, KBLab, the integrated research department established in 2019, has trained more than two dozen AI models. "Before our lab was created, researchers couldn’t access a dataset at the library — they’d have to look at a single object at a time," Börjeson said. "There was a need for the library to create datasets that enabled researchers to conduct quantity-oriented research."
Highly specialized data sets for research
Thanks to this work, researchers will soon be able to create highly specialized datasets, "for example, pulling up every Swedish postcard that depicts a church, every text written in a particular style or every mention of a historical figure across books, newspaper articles and TV broadcasts," according to the Nvidia blog. Hardware from the graphics processor maker was used for the training.
For the first model, KBLab used 20 GB of data, but today it uses about 70 GB, according to Hugging Face. Soon, it will even tackle a whole terabyte of Swedish texts. In addition to Swedish, the dataset will also include Dutch, Norwegian and German. This should improve the performance of the AI models.
Generative text model in development
In addition to the Transformer models that understand Swedish text, KBLab has an AI tool that converts audio to text, allowing the library to transcribe its extensive collection of radio broadcasts so that researchers can search the audio for specific content.
KBLab is also developing generative text models and an AI model to automatically create descriptions of video content. Together with researchers at the University of Gothenburg and the Swedish Academy, KBLab is supporting the modernization of dictionaries.