"Shocking amount" of low-quality machine translations on the web could affect LLMs

Jan 18, 2024 Maximilian Schreiner

Machine translations dominate freely available multilingual web content, a new study finds.

A new study from Amazon's AI Lab and UC Santa Barbara shows that a significant amount of multilingual web content in many languages is machine-translated, especially in low-resource languages. The study looked at the quality of translations on the web and found that texts available as translations on the web in many languages are of lower quality than texts available in only one or a few languages. According to the team, this points to machine translation.

In addition, the quality of the original text is often low: much of the content identified is low-quality English content that is then translated into many low-resource languages — presumably to generate advertising revenue.

For the study, the team collected billions of translations and filtered out duplicate sentences. The study resulted in the largest multilingual corpus to date, with 6.4 billion unique sentences in 90 languages.

Study recommends filtering translation data before training AI

The results suggest that machine-translated content makes up a large portion of translations on the Web, especially in low-resource languages, and raise concerns about using such content to train AI models, the team says.

The authors of the study suggest that detecting machine translation e.g. by taking multilingualism into account when filtering training data for AI models could help improve model quality. They also emphasize the need to further investigate the impact of machine-translated content on the training and performance of AI models.

Sources:

Arxiv