Dataset search engine aims to increase transparency of language models

Sensitive data and racist expressions have surfaced in the analysis of training texts of the open-source language model Bloom. A dataset search tool shows this and should improve the transparency of future models.

"Why do LMs say what they say? We often don't know - but now we might get a better idea." With these words, NLP researcher Aleksandra "Ola" Piktus announces her latest project on Twitter.

Together with Hugging Face, the researcher has released the "Roots" search app for the open-source Bloom model, which can be used to search the underlying database of 176 billion parameters in 59 languages.

Bloom was launched in July 2022 and is an open source alternative to OpenAI's GPT-3, especially because it is freely available and multilingual.

Tool aims to raise bar "for next generation" language model

The ability to browse training material without programming knowledge is necessary to develop a common understanding of general problems and raise the bar for next-generation language models, Piktus says.

Using the Roots search engine, she found sensitive private data, language contamination, and fake news. A woman had previously used a similar search engine for AI training images to find private medical photos of herself in the dataset.

Roots searches 1.6 TB of text data in 46 natural and 13 programming languages. Piktus' analysis shows: individual data points vary drastically in length. To compare and rank them, she divided them into equal-sized passages of 128 words and assigned each a unique ID.

Sensitive data is blacked out instead of removed

The OSCAR dataset came to Piktus' attention as a source of particularly large amounts of personal data. To prevent this from becoming public through the search engine, a script is applied that blackens the results. "This way one can inspect the data and observe the problem, but personal information are predominantly removed," the accompanying paper states.

Racism and hate speech can come from movie subtitles

The researchers involved also observed evidence of "low-quality text," such as racial slurs, sexually explicit language, or hate speech, some of which came from records with movie subtitles.

Recommendation

AI research

OpenAI's o3 is less AGI than originally measured

While humans can contextualize this form of language, which is usually used consciously, a language model adopts it without classification, the researchers say.

The current version of the tool is heavily influenced by the user interface of popular search engines. In the future, the researchers plan to display more quantitative information, such as the frequency of certain terms, the number of hits or co-occurrence statistics.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Dataset search engine aims to increase transparency of language models

Tool aims to raise bar "for next generation" language model

Sensitive data is blacked out instead of removed

Racism and hate speech can come from movie subtitles

OpenAI's o3 is less AGI than originally measured

Google DeepMind open-sources AI text watermarking for Gemini

Microsoft's RUBICON tells if your AI coding buddy is actually helping or just slacking off

Language models like GPT-4 memorize more than they reason, study finds

US think tank warns of "reverse brain drain" in China's AI sector

Researchers used AI to manipulate Reddit users, scrapped study after backlash

OpenAI's o3 is less AGI than originally measured

Dataset search engine aims to increase transparency of language models

Tool aims to raise bar "for next generation" language model

Sensitive data is blacked out instead of removed

Racism and hate speech can come from movie subtitles

Share

Bank details