Nvidia improves Meta's Llama model with new training approach

Nvidia has introduced a new large language model that outperforms others in alignment benchmarks. The company achieved this through a special training procedure combining evaluation and preference models.

The new model, called Llama-3.1-Nemotron-70B-Instruct, is based on Meta's open-source Llama 3.1 model. Nvidia optimized it to provide helpful answers to user queries by combining different training methods.

However, the results only show that the answers align better with human preferences, not that the content is necessarily more accurate. In fact, the Nemotron variant performs slightly worse than the base model on the MMLU Pro benchmark, which tests factual knowledge.

Nvidia created two new datasets for training: HelpSteer2 and HelpSteer2-Preference. HelpSteer2 contains over 20,000 prompt-response pairs. Multiple annotators rated each response on a 1-5 scale for criteria like helpfulness, correctness, and coherence. HelpSteer2-Preference adds comparisons between two answers to the same prompt. Annotators indicated which answer they preferred and how strong their preference was.

Combining reward models

Nvidia used these datasets to train two types of reward models: regression models and Bradley-Terry models. Regression models like SteerLM learn to assign values for different criteria to individual responses. Bradley-Terry models learn from preference comparisons to maximize the reward difference between two responses.

The researchers found that combining both approaches yielded the best results. They first trained a SteerLM regression model using only helpfulness ratings. This model then served as the starting point for a scaled Bradley-Terry model, which also considered the strength of preferences between responses.

To fine-tune the language model to the learned rewards, Nvidia used the REINFORCE algorithm. Unlike the commonly used PPO (Proximal Policy Optimization), REINFORCE estimates the value of an action more stably and without bias, according to the team.

Improved helpfulness and longer responses

The final Llama-3.1-Nemotron-70B-Instruct model achieved first place in several benchmarks: Arena Hard, AlpacaEval 2 LC, and GPT-4-Turbo MT-Bench. It outperformed top models like GPT-4 and Claude 3.5 Sonnet. In Arena Hard, it scored 85.0, well ahead of the starting model Llama-3.1-70B-Instruct at 55.7.

Tabelle mit Leistungsmetriken für verschiedene KI-Modelle, REINFORCE zeigt höchste Werte in MT Bench und AlpacaEval. — The REINFORCE model stands out with particularly high scores in benchmarks that assess the usefulness of the answers.

The new model also produces longer responses, averaging 2,200 characters compared to about 1,800 for other models.

Recommendation

AI research

AI models might need to scale down to scale up again

Nemotron passes the strawberry test

The improvements are evident in specific applications. For example, Llama-3.1-Nemotron-70B-Instruct can correctly answer the question "How many r in strawberry?" by going through the letters one by one and counting the "r"s. The original model and commercial competitors often gave the wrong answer to this question.

Tabelle vergleicht Antworten verschiedener KI-Modelle auf die Frage "How many r in strawberry?", REINFORCE-Modell gibt detaillierteste Antwort. — This comparison table reveals the performance of different AI models when answering a seemingly simple question. Only the REINFORCE model shows a deeper understanding of the task.

Nvidia emphasizes that the new model demonstrates techniques for improving helpfulness in general applications. However, it has not been optimized for specialized domains like mathematics.

The Llama-3.1-Nemotron-70B-Instruct model is available for free testing on HuggingChat and at Nvidia.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Nvidia improves Meta's Llama model with new training approach

Combining reward models

Improved helpfulness and longer responses

AI models might need to scale down to scale up again

Nemotron passes the strawberry test

Nvidia can resume exports of its H20 AI chip to China after a US policy reversal

Captive to industry, robots now dream of work, with no electric sheep in sight

Nvidia G-Assist uses 8 billion parameter model and runs without cloud connection

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Nvidia improves Meta's Llama model with new training approach

Combining reward models

Improved helpfulness and longer responses

Nemotron passes the strawberry test

Share

Bank details