summary Summary

Prometheus 2, a freely available language model, has been optimized to evaluate other language models, catching up with commercial models such as GPT-4.

These evaluations allow researchers and developers to objectively measure and compare the performance of their language models and receive detailed feedback on strengths and weaknesses for targeted improvements, helping to continuously enhance the quality and reliability of language models.

Until now, proprietary models such as GPT-4 have often been used for these evaluations, but they lack transparency, are difficult to control, and are not affordable for many, according to a research team led by Seungone Kim of KAIST AI. Kim's team developed Prometheus 2 to provide an independent, transparent, and detailed evaluation of language models for everyone.

Prometheus 2 can perform evaluations similar to humans and GPT-4, mastering the two most common evaluation methods: direct evaluation, assigning scores on a scale, and pairwise comparison, deciding which of two responses is better.

Prometheus 2 can score answers directly or select the better of two answers. | Image: Kim et al.

It can also evaluate on user-defined criteria, not limited to general aspects such as helpfulness and harmlessness, allowing for optimization for specific applications, the researchers report.

For example, a medical advice chatbot can be trained and tested on criteria such as trustworthiness, empathy, and professional correctness, enabling the development of high-quality language models for different applications, the team explained.

A new data set and mixed weights

To train Prometheus 2, the researchers created a new pairwise comparison dataset called the "Preference Collection," which contains more than 1,000 different evaluation criteria beyond basic characteristics.

They found that the best results came from training two separate models - one for direct ratings based on the Feedback Collection dataset, and one for pairwise comparisons based on the existing Preference Collection dataset - and then combining their learned weights.

In tests with eight datasets (four for direct ratings, four for pairwise comparisons), Prometheus 2 achieved the highest agreement with human judgments and commercial language models of all freely available rating models.


Although it lags behind GPT-4 and Claude 3 Opus in many tests, it can significantly close the gap with proprietary models, the researchers report.

Prometheus 2 can evaluate generated text as well as GPT-4 and Opus 3, but offers much more transparency and is potentially cheaper. The table shows the results for direct evaluation. | Image: Kim et al.

Prometheus 2 supports independent and transparent evaluation of language models for everyone, contributing to greater fairness and accessibility in the field, according to Kim's team. The code and data are available on Github.

The Prometheus 2 models (7B & 8x7B) are available from HuggingFace. According to the team, the faster 7B model achieves 80 percent of the evaluation performance of the 8x7B model, is on par with Mistral's Mixtral-8x7B, and better than Meta's Llama 2 70B.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Prometheus 2 is a freely available language model that can evaluate other language models as well as commercial models such as GPT-4, but is more transparent and potentially cheaper.
  • The model was trained on two separate datasets - one for direct scores and one for pairwise comparisons. By combining the learned weights, the researchers achieved the best results.
  • In tests on eight datasets, Prometheus 2 achieved the highest agreement with human judgments of any freely available model. This makes it possible for anyone to perform an independent and detailed evaluation of language models.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.