Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

Prometheus 2, a freely available language model, has been optimized to evaluate other language models, catching up with commercial models such as GPT-4.

These evaluations allow researchers and developers to objectively measure and compare the performance of their language models and receive detailed feedback on strengths and weaknesses for targeted improvements, helping to continuously enhance the quality and reliability of language models.

Until now, proprietary models such as GPT-4 have often been used for these evaluations, but they lack transparency, are difficult to control, and are not affordable for many, according to a research team led by Seungone Kim of KAIST AI. Kim's team developed Prometheus 2 to provide an independent, transparent, and detailed evaluation of language models for everyone.

Prometheus 2 can perform evaluations similar to humans and GPT-4, mastering the two most common evaluation methods: direct evaluation, assigning scores on a scale, and pairwise comparison, deciding which of two responses is better.

Prometheus 2 can score answers directly or select the better of two answers. | Image: Kim et al.

It can also evaluate on user-defined criteria, not limited to general aspects such as helpfulness and harmlessness, allowing for optimization for specific applications, the researchers report.

For example, a medical advice chatbot can be trained and tested on criteria such as trustworthiness, empathy, and professional correctness, enabling the development of high-quality language models for different applications, the team explained.

A new data set and mixed weights

To train Prometheus 2, the researchers created a new pairwise comparison dataset called the "Preference Collection," which contains more than 1,000 different evaluation criteria beyond basic characteristics.

They found that the best results came from training two separate models - one for direct ratings based on the Feedback Collection dataset, and one for pairwise comparisons based on the existing Preference Collection dataset - and then combining their learned weights.

In tests with eight datasets (four for direct ratings, four for pairwise comparisons), Prometheus 2 achieved the highest agreement with human judgments and commercial language models of all freely available rating models.

Recommendation

AI in practice

Nvidia positions GR00T N1 to dominate robotics ecosystem

Although it lags behind GPT-4 and Claude 3 Opus in many tests, it can significantly close the gap with proprietary models, the researchers report.

Prometheus 2 can evaluate generated text as well as GPT-4 and Opus 3, but offers much more transparency and is potentially cheaper. The table shows the results for direct evaluation. | Image: Kim et al.

Prometheus 2 supports independent and transparent evaluation of language models for everyone, contributing to greater fairness and accessibility in the field, according to Kim's team. The code and data are available on Github.

The Prometheus 2 models (7B & 8x7B) are available from HuggingFace. According to the team, the faster 7B model achieves 80 percent of the evaluation performance of the 8x7B model, is on par with Mistral's Mixtral-8x7B, and better than Meta's Llama 2 70B.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

A new data set and mixed weights

Nvidia positions GR00T N1 to dominate robotics ecosystem

Cohere raises $500 million in new funding, pushing its valuation to $6.8 billion

Gartner predicts that by 2028, one in four job applicant profiles will be fake

Google has launched a user-focused memory function for Gemini

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

Open-source model Prometheus 2 can evaluate other language models nearly as well as GPT-4

A new data set and mixed weights

Share

Bank details