GPT-4 dominates other LLMs in "real-world pragmatic missions," study finds

Midjourney prompted by THE DECODER

There are many commercial and open-source generative text AIs. A benchmark developed specifically for testing assistance tasks now shows that GPT-4 stands out in this segment.

"AgentBench" is a standardized test specifically designed to measure the assistance capabilities of large language models in "real-world pragmatic missions". All tests are conducted in real-time interactive environments. This makes the benchmark particularly suitable for what it is intended to measure: how well a large language model can handle a variety of everyday tasks in a total of eight areas.

Operating System: Here, LLMs must perform tasks related to the use of a computer operating system.
Database: This environment is about how well LLMs can work with databases.
Knowledge Graph: This environment tests how well LLMs can work with knowledge graphs.
Digital card game: This tests how well LLMs can understand digital card games and develop strategies.
Lateral thinking puzzles: This challenge tests how creative LLMs can be in solving problems. It requires them to think outside the box.
Budgets: This scenario is about tasks that occur in a budget, based on the Alfworld dataset.
Internet Shopping: This scenario tests how well LLMs perform on tasks related to online shopping.
Web browsing: Based on the Mind2Web dataset, this scenario tests how well LLMs can perform tasks related to using the Internet.

Video: AgentBench

GPT-4 is by far the most capable AI model

In total, the team of researchers from Tsinghua University, Ohio State University, and UC Berkeley tested 25 large language models from multiple providers, as well as open-source models.

The result: OpenAI's GPT-4 achieved the highest overall score of 4.41 and was ahead in almost all disciplines. Only in the web shopping task, GPT-3.5 came out on top.

The Claude model from competitor Anthropic follows closely with an overall score of 2.77, ahead of OpenAI's free GPT-3.5 Turbo model. The average score of the commercial models is 2.24.

The gap between GPT-4 and the commercial and open-source models, in general, is even more pronounced: The average performance score of open-source models is only 0.42 points. They generally fail in all complex tasks and are far behind GPT-3.5, the researchers write. OpenChat based on Llama-13B is the best-performing OS model with 1.15 points.

We unveil that while top LLMs are becoming capable of tackling complex real-world missions, for open-sourced competitiors there are still a long way to go.

From the paper

The preliminary results list is missing many commercial models such as PaLM 2 from Google, Claude 2 or the models from Aleph Alpha as well as other open-source models.

The research team is therefore making a toolkit, the datasets, and the benchmark environment available to the research community on Github for more extensive performance comparisons. More information and a demo are available on the project website.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI research

GPT-4 dominates other LLMs in "real-world pragmatic missions," study finds

GPT-4 is by far the most capable AI model

Study reveals major reasoning flaws in smaller AI language models

OpenAI's new method shows how GPT-4 "thinks" in human-understandable concepts

GPT-4 fails at simple tasks that humans can easily solve

GPT-4 Turbo's best new feature doesn't work very well

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

GPT-4 dominates other LLMs in "real-world pragmatic missions," study finds

GPT-4 is by far the most capable AI model

Share

Bank details