Researchers have developed an AI system that automatically evaluates the text quality of large language models. It's a family of models called "FLAMe," and they outperform commercial systems like GPT-4 in many areas.
Researchers at Google DeepMind, Google, and UMass Amherst have developed new models for automatically scoring AI-generated text. The models, called "FLAMe" (Foundational Large Autorater Models), have been trained to evaluate the quality of generated texts across several categories.
Such automated evaluation is becoming increasingly important as human evaluation is time-consuming and costly, and as AI texts become more widespread. Previous AI-based scoring systems often suffered from bias or used copyrighted data.
FLAMe, on the other hand, has been trained on over 5.3 million human ratings from 102 different tasks. These cover areas such as general writing quality, factual accuracy, mathematical reasoning, and programming. The data comes exclusively from publicly available sources with open licenses.
In tests, FLAMe outperformed commercial systems like GPT-4 and Claude 3 in 8 out of 12 evaluation tasks. The system performed particularly well in assessing factual accuracy and attribution. Here, FLAMe achieved an overall score of 81.1 percent, while GPT-4 reached 80.6 percent.
Researchers release FLAMe for free
The researchers also developed a variant specifically optimized for reward modeling called FLAMe-RM. This achieved an accuracy of 87.8 percent in the RewardBench test, a standard benchmark for reward models - surpassing GPT-4 and GPT-4o. Such reward models can be used to align models with human preferences, for example in reinforcement learning with human feedback.
According to the scientists, a key advantage of FLAMe is its lower bias compared to commercial systems. Tests have shown that FLAMe is less susceptible to biases from text lengths or irrelevant contextual information.
The researchers see FLAMe as an important step towards developing open and transparent evaluation systems for AI-generated texts. They plan to make the training data and models publicly available to enable further research in this field.
However, the scientists also point out potential risks: Excessive use of such automated evaluation systems could lead to neglecting human perspectives. There's also a risk that the systems may amplify existing biases in the training data.