Google Deepmind develops open-source AI to tackle biases in evaluating language models

Midjourney prompted by THE DECODER

Researchers have developed an AI system that automatically evaluates the text quality of large language models. It's a family of models called "FLAMe," and they outperform commercial systems like GPT-4 in many areas.

Researchers at Google DeepMind, Google, and UMass Amherst have developed new models for automatically scoring AI-generated text. The models, called "FLAMe" (Foundational Large Autorater Models), have been trained to evaluate the quality of generated texts across several categories.

Such automated evaluation is becoming increasingly important as human evaluation is time-consuming and costly, and as AI texts become more widespread. Previous AI-based scoring systems often suffered from bias or used copyrighted data.

FLAMe, on the other hand, has been trained on over 5.3 million human ratings from 102 different tasks. These cover areas such as general writing quality, factual accuracy, mathematical reasoning, and programming. The data comes exclusively from publicly available sources with open licenses.

In tests, FLAMe outperformed commercial systems like GPT-4 and Claude 3 in 8 out of 12 evaluation tasks. The system performed particularly well in assessing factual accuracy and attribution. Here, FLAMe achieved an overall score of 81.1 percent, while GPT-4 reached 80.6 percent.

Researchers release FLAMe for free

The researchers also developed a variant specifically optimized for reward modeling called FLAMe-RM. This achieved an accuracy of 87.8 percent in the RewardBench test, a standard benchmark for reward models - surpassing GPT-4 and GPT-4o. Such reward models can be used to align models with human preferences, for example in reinforcement learning with human feedback.

According to the scientists, a key advantage of FLAMe is its lower bias compared to commercial systems. Tests have shown that FLAMe is less susceptible to biases from text lengths or irrelevant contextual information.

The researchers see FLAMe as an important step towards developing open and transparent evaluation systems for AI-generated texts. They plan to make the training data and models publicly available to enable further research in this field.

However, the scientists also point out potential risks: Excessive use of such automated evaluation systems could lead to neglecting human perspectives. There's also a risk that the systems may amplify existing biases in the training data.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

Google Deepmind develops open-source AI to tackle biases in evaluating language models

Researchers release FLAMe for free

OpenAI's o3 is less AGI than originally measured

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Researchers train AI to generate long-form text using only reinforcement learning

Stanford researchers find AI agents improve when guided by past successes

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

Google Deepmind develops open-source AI to tackle biases in evaluating language models

Researchers release FLAMe for free

Share

Bank details