Content
summary Summary

Researchers have developed an AI system that automatically evaluates the text quality of large language models. It's a family of models called "FLAMe," and they outperform commercial systems like GPT-4 in many areas.

Ad

Researchers at Google DeepMind, Google, and UMass Amherst have developed new models for automatically scoring AI-generated text. The models, called "FLAMe" (Foundational Large Autorater Models), have been trained to evaluate the quality of generated texts across several categories.

Such automated evaluation is becoming increasingly important as human evaluation is time-consuming and costly, and as AI texts become more widespread. Previous AI-based scoring systems often suffered from bias or used copyrighted data.

FLAMe, on the other hand, has been trained on over 5.3 million human ratings from 102 different tasks. These cover areas such as general writing quality, factual accuracy, mathematical reasoning, and programming. The data comes exclusively from publicly available sources with open licenses.

Ad
Ad

In tests, FLAMe outperformed commercial systems like GPT-4 and Claude 3 in 8 out of 12 evaluation tasks. The system performed particularly well in assessing factual accuracy and attribution. Here, FLAMe achieved an overall score of 81.1 percent, while GPT-4 reached 80.6 percent.

Researchers release FLAMe for free

The researchers also developed a variant specifically optimized for reward modeling called FLAMe-RM. This achieved an accuracy of 87.8 percent in the RewardBench test, a standard benchmark for reward models - surpassing GPT-4 and GPT-4o. Such reward models can be used to align models with human preferences, for example in reinforcement learning with human feedback.

According to the scientists, a key advantage of FLAMe is its lower bias compared to commercial systems. Tests have shown that FLAMe is less susceptible to biases from text lengths or irrelevant contextual information.

The researchers see FLAMe as an important step towards developing open and transparent evaluation systems for AI-generated texts. They plan to make the training data and models publicly available to enable further research in this field.

However, the scientists also point out potential risks: Excessive use of such automated evaluation systems could lead to neglecting human perspectives. There's also a risk that the systems may amplify existing biases in the training data.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at Google DeepMind, Google, and UMass Amherst have developed AI systems called FLAMe that can automatically rate the quality of AI-generated text. It was trained with more than 5.3 million human ratings from 102 different tasks.
  • In tests, FLAMe outperformed commercial systems such as GPT-4 and Claude-3 on 8 out of 12 evaluation tasks. FLAMe scored 81.1 percent on factual accuracy and mapping, while GPT-4 scored 80.6 percent.
  • The researchers see FLAMe as an important step in the development of open and transparent AI text scoring systems. They plan to make the training data and models publicly available, but also point out potential risks, such as the neglect of human perspectives.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.