Prometheus 2, a freely available language model, has been optimized to evaluate other language models, catching up with commercial models such as GPT-4.
These evaluations allow researchers and developers to objectively measure and compare the performance of their language models and receive detailed feedback on strengths and weaknesses for targeted improvements, helping to continuously enhance the quality and reliability of language models.
Until now, proprietary models such as GPT-4 have often been used for these evaluations, but they lack transparency, are difficult to control, and are not affordable for many, according to a research team led by Seungone Kim of KAIST AI. Kim's team developed Prometheus 2 to provide an independent, transparent, and detailed evaluation of language models for everyone.
Prometheus 2 can perform evaluations similar to humans and GPT-4, mastering the two most common evaluation methods: direct evaluation, assigning scores on a scale, and pairwise comparison, deciding which of two responses is better.
It can also evaluate on user-defined criteria, not limited to general aspects such as helpfulness and harmlessness, allowing for optimization for specific applications, the researchers report.
For example, a medical advice chatbot can be trained and tested on criteria such as trustworthiness, empathy, and professional correctness, enabling the development of high-quality language models for different applications, the team explained.
A new data set and mixed weights
To train Prometheus 2, the researchers created a new pairwise comparison dataset called the "Preference Collection," which contains more than 1,000 different evaluation criteria beyond basic characteristics.
They found that the best results came from training two separate models - one for direct ratings based on the Feedback Collection dataset, and one for pairwise comparisons based on the existing Preference Collection dataset - and then combining their learned weights.
In tests with eight datasets (four for direct ratings, four for pairwise comparisons), Prometheus 2 achieved the highest agreement with human judgments and commercial language models of all freely available rating models.
Although it lags behind GPT-4 and Claude 3 Opus in many tests, it can significantly close the gap with proprietary models, the researchers report.
Prometheus 2 supports independent and transparent evaluation of language models for everyone, contributing to greater fairness and accessibility in the field, according to Kim's team. The code and data are available on Github.
The Prometheus 2 models (7B & 8x7B) are available from HuggingFace. According to the team, the faster 7B model achieves 80 percent of the evaluation performance of the 8x7B model, is on par with Mistral's Mixtral-8x7B, and better than Meta's Llama 2 70B.