Content
summary Summary

Machine translations dominate freely available multilingual web content, a new study finds.

A new study from Amazon's AI Lab and UC Santa Barbara shows that a significant amount of multilingual web content in many languages is machine-translated, especially in low-resource languages. The study looked at the quality of translations on the web and found that texts available as translations on the web in many languages are of lower quality than texts available in only one or a few languages. According to the team, this points to machine translation.

In addition, the quality of the original text is often low: much of the content identified is low-quality English content that is then translated into many low-resource languages — presumably to generate advertising revenue.

For the study, the team collected billions of translations and filtered out duplicate sentences. The study resulted in the largest multilingual corpus to date, with 6.4 billion unique sentences in 90 languages.

Ad
Ad

Study recommends filtering translation data before training AI

The results suggest that machine-translated content makes up a large portion of translations on the Web, especially in low-resource languages, and raise concerns about using such content to train AI models, the team says.

The authors of the study suggest that detecting machine translation e.g. by taking multilingualism into account when filtering training data for AI models could help improve model quality. They also emphasize the need to further investigate the impact of machine-translated content on the training and performance of AI models.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new study shows that a significant amount of multilingual web content is machine-translated, especially in low-resource languages, which can affect the quality of AI models trained on such data.
  • The study found that texts translated into many languages were of lower quality than texts available in only one or a few languages - an indication that machine translation was used.
  • The authors of the study recommend filtering out machine translation from training data and further investigating the impact of machine-translated content on the performance of AI models.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.