Content
summary Summary

Nvidia has introduced a new large language model that outperforms others in alignment benchmarks. The company achieved this through a special training procedure combining evaluation and preference models.

Ad

The new model, called Llama-3.1-Nemotron-70B-Instruct, is based on Meta's open-source Llama 3.1 model. Nvidia optimized it to provide helpful answers to user queries by combining different training methods.

However, the results only show that the answers align better with human preferences, not that the content is necessarily more accurate. In fact, the Nemotron variant performs slightly worse than the base model on the MMLU Pro benchmark, which tests factual knowledge.

Nvidia created two new datasets for training: HelpSteer2 and HelpSteer2-Preference. HelpSteer2 contains over 20,000 prompt-response pairs. Multiple annotators rated each response on a 1-5 scale for criteria like helpfulness, correctness, and coherence. HelpSteer2-Preference adds comparisons between two answers to the same prompt. Annotators indicated which answer they preferred and how strong their preference was.

Ad
Ad

Combining reward models

Nvidia used these datasets to train two types of reward models: regression models and Bradley-Terry models. Regression models like SteerLM learn to assign values for different criteria to individual responses. Bradley-Terry models learn from preference comparisons to maximize the reward difference between two responses.

The researchers found that combining both approaches yielded the best results. They first trained a SteerLM regression model using only helpfulness ratings. This model then served as the starting point for a scaled Bradley-Terry model, which also considered the strength of preferences between responses.

To fine-tune the language model to the learned rewards, Nvidia used the REINFORCE algorithm. Unlike the commonly used PPO (Proximal Policy Optimization), REINFORCE estimates the value of an action more stably and without bias, according to the team.

Improved helpfulness and longer responses

The final Llama-3.1-Nemotron-70B-Instruct model achieved first place in several benchmarks: Arena Hard, AlpacaEval 2 LC, and GPT-4-Turbo MT-Bench. It outperformed top models like GPT-4 and Claude 3.5 Sonnet. In Arena Hard, it scored 85.0, well ahead of the starting model Llama-3.1-70B-Instruct at 55.7.

Tabelle mit Leistungsmetriken für verschiedene KI-Modelle, REINFORCE zeigt höchste Werte in MT Bench und AlpacaEval.
The REINFORCE model stands out with particularly high scores in benchmarks that assess the usefulness of the answers.

The new model also produces longer responses, averaging 2,200 characters compared to about 1,800 for other models.

Recommendation

Nemotron passes the strawberry test

The improvements are evident in specific applications. For example, Llama-3.1-Nemotron-70B-Instruct can correctly answer the question "How many r in strawberry?" by going through the letters one by one and counting the "r"s. The original model and commercial competitors often gave the wrong answer to this question.

Tabelle vergleicht Antworten verschiedener KI-Modelle auf die Frage "How many r in strawberry?", REINFORCE-Modell gibt detaillierteste Antwort.
This comparison table reveals the performance of different AI models when answering a seemingly simple question. Only the REINFORCE model shows a deeper understanding of the task.

Nvidia emphasizes that the new model demonstrates techniques for improving helpfulness in general applications. However, it has not been optimized for specialized domains like mathematics.

The Llama-3.1-Nemotron-70B-Instruct model is available for free testing on HuggingChat and at Nvidia.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Nvidia has introduced a new large language model called Llama-3.1-Nemotron-70B-Instruct, which has been optimized to provide helpful answers to user queries. It combines different training methods such as regression and Bradley-Terry models.
  • Nvidia used two self-generated datasets to create the training data: HelpSteer2 with over 20,000 scored prompt response pairs and HelpSteer2-Preference with comparisons between every two responses to the same prompt. The combination of both approaches produced the best results.
  • In alignment benchmarks such as Arena Hard, AlpacaEval 2 LC and GPT-4-Turbo MT-Bench, Llama-3.1-Nemotron-70B-Instruct achieved first place in each case, outperforming top models such as GPT-4o and Claude 3.5 Sonnet. The model is available for free in HuggingChat or from Nvidia.
Jonathan works as a technology journalist who focuses primarily on how easily AI can already be used today and how it can support daily life.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.