Ad
Skip to content

Google Deepmind develops open-source AI to tackle biases in evaluating language models

Image description
Midjourney prompted by THE DECODER

Researchers have developed an AI system that automatically evaluates the text quality of large language models. It's a family of models called "FLAMe," and they outperform commercial systems like GPT-4 in many areas.

Researchers at Google DeepMind, Google, and UMass Amherst have developed new models for automatically scoring AI-generated text. The models, called "FLAMe" (Foundational Large Autorater Models), have been trained to evaluate the quality of generated texts across several categories.

Such automated evaluation is becoming increasingly important as human evaluation is time-consuming and costly, and as AI texts become more widespread. Previous AI-based scoring systems often suffered from bias or used copyrighted data.

FLAMe, on the other hand, has been trained on over 5.3 million human ratings from 102 different tasks. These cover areas such as general writing quality, factual accuracy, mathematical reasoning, and programming. The data comes exclusively from publicly available sources with open licenses.

Ad
DEC_D_Incontent-1

In tests, FLAMe outperformed commercial systems like GPT-4 and Claude 3 in 8 out of 12 evaluation tasks. The system performed particularly well in assessing factual accuracy and attribution. Here, FLAMe achieved an overall score of 81.1 percent, while GPT-4 reached 80.6 percent.

Researchers release FLAMe for free

The researchers also developed a variant specifically optimized for reward modeling called FLAMe-RM. This achieved an accuracy of 87.8 percent in the RewardBench test, a standard benchmark for reward models - surpassing GPT-4 and GPT-4o. Such reward models can be used to align models with human preferences, for example in reinforcement learning with human feedback.

According to the scientists, a key advantage of FLAMe is its lower bias compared to commercial systems. Tests have shown that FLAMe is less susceptible to biases from text lengths or irrelevant contextual information.

The researchers see FLAMe as an important step towards developing open and transparent evaluation systems for AI-generated texts. They plan to make the training data and models publicly available to enable further research in this field.

Ad
DEC_D_Incontent-2

However, the scientists also point out potential risks: Excessive use of such automated evaluation systems could lead to neglecting human perspectives. There's also a risk that the systems may amplify existing biases in the training data.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

  • Over 20 percent launch discount.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder