Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

Prometheus 2, a freely available language model, has been optimized to evaluate other language models, catching up with commercial models such as GPT-4.

These evaluations allow researchers and developers to objectively measure and compare the performance of their language models and receive detailed feedback on strengths and weaknesses for targeted improvements, helping to continuously enhance the quality and reliability of language models.

Until now, proprietary models such as GPT-4 have often been used for these evaluations, but they lack transparency, are difficult to control, and are not affordable for many, according to a research team led by Seungone Kim of KAIST AI. Kim's team developed Prometheus 2 to provide an independent, transparent, and detailed evaluation of language models for everyone.

Prometheus 2 can perform evaluations similar to humans and GPT-4, mastering the two most common evaluation methods: direct evaluation, assigning scores on a scale, and pairwise comparison, deciding which of two responses is better.

Prometheus 2 can score answers directly or select the better of two answers. | Image: Kim et al.

It can also evaluate on user-defined criteria, not limited to general aspects such as helpfulness and harmlessness, allowing for optimization for specific applications, the researchers report.

For example, a medical advice chatbot can be trained and tested on criteria such as trustworthiness, empathy, and professional correctness, enabling the development of high-quality language models for different applications, the team explained.

A new data set and mixed weights

To train Prometheus 2, the researchers created a new pairwise comparison dataset called the "Preference Collection," which contains more than 1,000 different evaluation criteria beyond basic characteristics.

They found that the best results came from training two separate models - one for direct ratings based on the Feedback Collection dataset, and one for pairwise comparisons based on the existing Preference Collection dataset - and then combining their learned weights.

In tests with eight datasets (four for direct ratings, four for pairwise comparisons), Prometheus 2 achieved the highest agreement with human judgments and commercial language models of all freely available rating models.

Recommendation

AI in practice

Anthropic releases Claude 4 with new safety measures targeting CBRN misuse

Although it lags behind GPT-4 and Claude 3 Opus in many tests, it can significantly close the gap with proprietary models, the researchers report.

Prometheus 2 can evaluate generated text as well as GPT-4 and Opus 3, but offers much more transparency and is potentially cheaper. The table shows the results for direct evaluation. | Image: Kim et al.

Prometheus 2 supports independent and transparent evaluation of language models for everyone, contributing to greater fairness and accessibility in the field, according to Kim's team. The code and data are available on Github.

The Prometheus 2 models (7B & 8x7B) are available from HuggingFace. According to the team, the faster 7B model achieves 80 percent of the evaluation performance of the 8x7B model, is on par with Mistral's Mixtral-8x7B, and better than Meta's Llama 2 70B.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

A new data set and mixed weights

Anthropic releases Claude 4 with new safety measures targeting CBRN misuse

Ilya Sutskever says, "We have the compute, we have the team, and we know what to do"

OpenAI to tap 4.5 GW of Oracle data center power for Stargate AI project

Google launches Veo 3 Fast worldwide, letting Gemini Pro users generate videos up to 720p

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

A new data set and mixed weights

Share

Bank details