Content
summary Summary

Is GPT-4 Turbo "smarter" than GPT-4, as announced by Sam Altman? And what does that even mean?

Ad

OpenAI's latest AI model GPT-4 Turbo is available as a preview via the OpenAI API and directly in ChatGPT. Since the preview was released on 6 November, the model has already been updated once.

According to OpenAI CEO Sam Altman, GPT-4 Turbo is "much faster", "smarter" and, above all, cheaper than GPT-4. While speed and price are obvious, the "smartness" of the new model is the subject of heated debate on forums and social media. Some report obvious performance losses and see the GPT-4's capabilities as steadily declining since the first version, others report improvements, and still others report gains in some use cases and losses in others.

GPT-4 vs. GPT-4 Turbo in code benchmarks

The developers of Mentat AI, an AI-based coding assistant, tested the new model on coding tasks. GPT-4 (gpt-4-0314) solved 86 out of 122 tasks, while GPT-4 Turbo (gpt-4-1106-preview) solved 84 out of 122 tasks. However, a closer look at the results showed that GPT-4 solved 76 items in the first trial and 10 in the second trial, while GPT-4 Turbo solved only 56 items in the first trial and 28 in the second trial.

Ad
Ad
Image: Mentat AI

The team, therefore, suspects that GPT-4 memorized a large proportion of the training tasks and that this knowledge may have been lost in GPT-4 Turbo due to optimizations such as distillation. To test this theory, the team repeated the benchmarks without showing the models the instructions for each task, just the names of the tasks, the functions, and a reference to the source of the tests. Without instructions, an AI model can only solve the tasks if it has memorized them.

GPT-4 was able to solve almost 60 percent of the tasks, GPT-4 Turbo only 30 percent. The team interprets these results as a clear indication that GPT-4 memorized more tasks than GPT-4 Turbo. According to these results, GPT-4 could therefore have a kind of "memory bonus" that gives it an advantage in some benchmarks - and in practice. Of course, this could also be a disadvantage in other use cases, for example, if it spits out memorized blocks of code instead of looking for a more efficient solution.

Aider, another AI coding assistant, also tested the new model with coding tasks. As expected, GPT-4 Turbo is significantly faster than previous GPT-4 models. Precise measurements are not yet possible due to the current limitations of OpenAI. It also appears to be better at generating correct code at the first attempt. It solves 53% of code tasks on the first try, while previous models only solve 46-47% of tasks on the first try. In addition, GPT-4 Turbo generally seems to perform similarly (~62%) to the old models (63-64%) after being given a second chance to correct errors by checking the error output of the test suite.

Image: Aider

The tested GPT-4 model is expected to be decommissioned in June 2024.

GPT-4 Turbo probably uses chain-of-thought

In another test, X user Vlad, CEO of KagiHQ, shows the performance of the two models in the PyLLM benchmark. According to the results, GPT-4 Turbo has an accuracy of 87% compared to 52% for GPT-4 and is almost five times faster with 48 versus 10 tokens per second. GPT-4 Turbo is also 30% cheaper in practice. According to Vlad, it could be even cheaper - but the output is on average 2 times more verbose than GPT-4. As a possible explanation for this jump in performance, another user points out that GPT-4 Turbo seems to automatically use chain-of-thought prompting in the background - which would also explain the longer output. But even with CoT, GPT-4's accuracy is only just under 60%.

Recommendation

X user Jeffrey Wang, co-founder of Metaphor Systems, tested the new model on SAT reading tests where the model has to provide answers to text. He found that GPT-4 Turbo made significantly more mistakes than GPT-4.

OpenAI's lack of transparency could be an opportunity for the competition

So is GPT-4 Turbo "smarter"? There is no clear answer yet - at least not if you take "smarter" to mean "better". However, there is no clear leap in performance yet. The open-ended nature of the term - "smarter" can mean many things, including greater resource efficiency in relation to performance - is a deliberate choice. Altman would probably have said that GPT-4 Turbo is smarter or more powerful than GPT-4 if that were clearly the case.

So the focus of the presentation was on the new "smart" features: GPT-4 Turbo can handle more text, has a JSON mode and more modalities integrated, can call functions more reliably and has more up-to-date knowledge.

In addition, unlike GPT-4, OpenAI has not yet published a model map or benchmarks for GPT-4 Turbo - a trend that began with the last update of GPT-4, whose capabilities were also repeatedly criticized. This lack of transparency means that it is up to the community to figure out which model is best suited for which tasks, and how to control it most accurately. The examples given here are not sufficient for this, due to the small sample size. Things get even more complicated in ChatGPT, where OpenAIs model is interfacing with other systems.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The question of which model is better, and not just 'smarter', therefore remains open - as does whether benchmarks can be used to answer it at all. It is unclear, to say the least, to what extent benchmarks always reflect real-world experience. Nevertheless, it would be desirable for end users if OpenAI were a little more open about the specific improvements and capabilities of the new models so that they can make an informed decision. If OpenAI does not take on this role, competitors such as Google could. The lack of transparency could be an opportunity for Google to use Gemini to highlight clear advantages over OpenAI's offering.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI's GPT-4 Turbo is "faster", "smarter" and cheaper than GPT-4, according to CEO Sam Altman, but the "smartness" of the new model is being debated on forums and social media.
  • Tests by Mentat AI and Aider, AI-based coding assistants, show mixed results in terms of GPT-4 Turbo's performance compared to GPT-4, with some tasks being solved better and others worse.
  • OpenAI has not yet published a model map or benchmarks for GPT-4 Turbo, leaving it up to the community to figure out which model is best suited for which tasks. This lack of transparency could be an opportunity for competitors such as Google.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.