Content
summary Summary

The Allen Institute for AI has released Tülu 3 405B, an open source language model that reportedly matches or exceeds the performance of DeepSeek V3 and GPT-4o. The team credits much of this success to a new training approach called RLVR.

Ad

Built on Llama 3.1, the model uses "Reinforcement Learning with Verifiable Rewards" (RLVR), which only rewards the system when it produces verifiably correct answers. This approach works particularly well for mathematical tasks where results can be easily checked, according to AI2.

Image: Allen AI

Training the 405 billion-parameter model pushed technical limits, requiring 32 compute nodes with 256 GPUs working together. Each training step took 35 minutes, and the team had to use workarounds like a smaller helper model to manage the computational demands. The project faced ongoing technical hurdles that needed constant attention - insights rarely shared by companies developing similar models.

Significant performance improvements demonstrated

AI2 says Tülu outperforms other open source models like Llama 3.1 405B Instruct and Nous Hermes 3 405B, despite having to end training early due to computing constraints. It also matches or exceeds the performance of DeepSeek V3 and GPT-4o. The training process combined Supervised Finetuning, Direct Preference Optimization, and RLVR - an approach that shows similarities to Deepseek's R1 training, particularly in how according to the team reinforcement learning benefited larger models more.

Ad
Ad

Users can test the model in the AI2 Playground, with code available on GitHub and models on Hugging Face.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • The Allen Institute for AI has released Tülu 3 405B, an open-source language model that matches or surpasses the performance of DeepSeek V3 and GPT-4o, thanks to a new training approach called Reinforcement Learning with Verifiable Rewards (RLVR).
  • RLVR only rewards the system when it produces verifiably correct answers, which works particularly well for mathematical tasks. Training the 405 billion-parameter model pushed technical limits, requiring 32 compute nodes with 256 GPUs working together, with each training step taking 35 minutes.
  • Despite having to end training early due to computing constraints, Tülu outperforms other open-source models like Llama 3.1 405B Instruct and Nous Hermes 3 405B. Users can test the model in the AI2 Playground, with code available on GitHub and models on Hugging Face.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.