Microsoft puts GPT-4 ahead of Gemini Ultra again, using Google's own tricks

Microsoft claims that GPT-4, combined with a special prompting strategy, outperforms Google Gemini Ultra in the language understanding benchmark MMLU (Measuring Massive Multitask Language Understanding).

Medprompt is a prompting strategy recently introduced by Microsoft that was originally developed for medical challenges. However, Microsoft researchers have found that it is also suitable for more general applications.

By running GPT-4 with a modified version of Medprompt, Microsoft has now achieved a new State-of-the-Art (SoTA) score on the MMLU benchmark.

Microsoft's announcement is special in that Google highlighted Ultra's new top score on the MMLU benchmark during the big reveal of its new Gemini AI system last week.

Microsoft tricks back: Complex prompts improve benchmark performance

Google's messaging at the time of Gemini's launch was somewhat misleading: the Ultra model achieved the best result in the MMLU benchmark to date, but with a more complex prompting strategy than is usual in this benchmark. With the standard prompting strategy (5-shot), Gemini Ultra performs worse than GPT-4.

The GPT-4 performance in the MMLU now reported by Microsoft with Medprompt+ reaches a record high of 90.10 percent, surpassing Gemini Ultra's 90.04 percent.

To achieve this result, Microsoft researchers extended Medprompt to Medprompt+ by adding a simpler prompt method to Medprompt and formulating a strategy for deriving a final answer that combines answers from both the basic Medprompt strategy and the simpler prompt method.

The MMLU Benchmark is a comprehensive test of general knowledge and reasoning. It contains tens of thousands of items from 57 subject areas, including mathematics, history, law, computer science, engineering, and medicine. It is considered the most important benchmark for language models.

When Microsoft measures performance, GPT-4 outperforms Gemini Ultra on even more benchmarks

In addition to the MMLU benchmark, Microsoft has published results for other benchmarks that show the performance of GPT-4 compared to Gemini Ultra using simple prompts common to these benchmarks. GPT-4 is said to outperform Gemini Ultra in several benchmarks using this measurement method, including GSM8K, MATH, HumanEval, BIG-Bench-Hard, DROP, and HellaSwag.

Recommendation

AI research

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

Microsoft publishes Medprompt and similar prompting strategies in a GitHub repository called Promptbase. The repository contains scripts, general tools, and information to help reproduce the results and improve the performance of the base models.

The mostly small differences in the benchmarks are unlikely to matter much in practice; they are mainly used by Microsoft and Google for PR purposes. However, what Microsoft is emphasizing here, and what was already apparent when Ultra was announced, is that the two models are on par.

This could mean that OpenAI is either ahead of Google - or that it is very difficult to develop a much more capable LLM than GPT-4. It could be that LLM technology in its current form has already reached its limits, as Bill Gates recently suggested. GPT-4.5 or GPT-5 from OpenAI might provide some clarity here.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Microsoft puts GPT-4 ahead of Gemini Ultra again, using Google's own tricks

Microsoft tricks back: Complex prompts improve benchmark performance

When Microsoft measures performance, GPT-4 outperforms Gemini Ultra on even more benchmarks

Meta's latest model highlights the challenge AI faces in long-term planning and causal reasoning

SciArena lets scientists compare LLMs on real research questions

Microsoft’s MAI-DxO boosts AI diagnostic accuracy and cuts costs by nearly 70 percent

Researchers say they may have found a ladder to climb the "data wall"

Cloudflare CEO Matthew Prince sees trouble ahead for the open web

New Othello experiment supports the world model hypothesis for large language models

ChatGPT might be draining your brain, MIT warns - what ‘cognitive debt’ means for you

Microsoft puts GPT-4 ahead of Gemini Ultra again, using Google's own tricks

Microsoft tricks back: Complex prompts improve benchmark performance

When Microsoft measures performance, GPT-4 outperforms Gemini Ultra on even more benchmarks

Share

Bank details