Google's experimental AI model Gemini 2.0 Flash Thinking has jumped ahead of its competitors, scoring impressive results in math, science, and general performance tests.
According to testing platform lmarena.ai, the latest version of Gemini has made significant gains in the Chatbot Arena, improving its score by 17 points since December 2024. This puts it ahead of competitors like OpenAI's GPT-4o models and Anthropic's Claude 3.5 Sonnet.
The model shows improvements across nearly all categories, taking the lead in complex tasks, programming, and creative writing. The only area where it still needs work is style control - how it formats its outputs.
Under the hood, Google says they've added new features like code execution and expanded the model's context window to handle up to one million tokens. They've also improved how well the model's thinking process lines up with its final responses.
Google relies on years of experience with planning systems
Google DeepMind's CEO Demis Hassabis says this progress builds on more than ten years of work with AI planning systems, going all the way back to AlphaGo. By combining these tried-and-true planning methods with modern foundation models, they've seen particularly strong results in math and science testing.
Our latest update to our Gemini 2.0 Flash Thinking model (available here: https://t.co/Rr9DvqbUdO) scores 73.3% on AIME (math) & 74.2% on GPQA Diamond (science) benchmarks. Thanks for all your feedback, this represents super fast progress from our first release just this past... pic.twitter.com/cM1gNwBoTO
- Demis Hassabis (@demishassabis) January 21, 2025
This update follows the first version of Flash 2.0 Thinking, which Google launched in December 2024. That version introduced explicit thought processes that help the model improve its reasoning, and it also performed well in testing.