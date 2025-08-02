Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.

OpenAI is getting ready to roll out GPT-5, the follow-up to GPT-4 from March 2023. But according to The Information, the new model is shaping up to be more of a modest upgrade than a game-changer.

Internal testing shows some progress in areas like programming, math, and following complex instructions—for example, automating customer service workflows. Still, the step up from GPT-4 is much smaller than the jump from GPT-3 (2020) to GPT-4 (2023).

According to people familiar with the evaluations, GPT-5 produces more user-friendly applications and is better at managing its own compute usage. But overall, the advances are reportedly modest.

This plateau has been predicted for a while. Microsoft co-founder Bill Gates was already forecasting this in the fall of 2023, while LLM critic Gary Marcus, former OpenAI chief scientist Ilya Sutskever, and Meta's Yann LeCun have repeatedly argued that the Transformer-based LLM architecture is reaching its limits.

One example: OpenAI originally developed an internal model codenamed "Orion" to directly succeed GPT-4o. But Orion didn't deliver the kind of gains OpenAI was looking for, so it was released as GPT-4.5 in early 2025 instead—a clear sign the company wasn't ready to call it GPT-5. For good reason. GPT-4.5 barely made a mark, ran slower and cost more than GPT-4o, and quickly disappeared from the spotlight.

A central problem, according to The Information, was that changes in pretraining worked for smaller models but didn't scale well as the models got bigger. At the same time, OpenAI was running out of high-quality web data to train on. Orion stalled before it could earn the GPT-5 name.

As recently as June 2025, none of the models in development at OpenAI were considered strong enough to be called GPT-5, according to sources cited by The Information.

This challenge isn't unique to OpenAI. For example, the latest launch of Anthropic's Claude 4 models also delivered only modest improvements overall, except for a noticeable boost in coding performance. Anthropic is already using a hybrid architecture that combines a large language model with specialized reasoning components, an approach that OpenAI may adopt for GPT-5 as well.

Reasoning models show powerful capabilities but are difficult to control

Alongside its main models, OpenAI has also been working on so-called reasoning models or "large reasoning models" (LRMs), which tend to perform better on complex tasks when given more compute. These models could become useful tools for math, web search, and programming—or they might point to a new direction for language models entirely. Still, there are open questions about how well they generalize and how much energy they require.

OpenAI's biggest breakthrough in this area came at the end of 2023 with the Q* model, which reportedly cracked math problems it hadn't seen before. Building on this, OpenAI developed the o1 and o3 models, both based on GPT-4o and aimed at specialized applications.

Both o1 and o3 were trained using reinforcement learning (RL), but the o3 "teacher model" got much more compute and direct access to web and code sources. During RL training, the model generates answers to expert-level questions and improves itself by comparing its responses to human solutions.

But when these models were adapted for chat, o3 reportedly lost some of its capability. As one source told The Information, the chat version had to be "dumbed down" because it wasn't trained enough for real conversation, which hurt performance in both chat and API settings.

In April, the ARC-AGI benchmark highlighted this issue: the public o3 version did worse on a tough puzzle test than the internal base model, showing that many original reasoning abilities didn't survive the transition to chat.

The o3-pro model is a good example of how tricky this balance is. Experts rated o3-pro higher than o3 for science, programming, and business tasks, but in daily life it struggled with simple stuff. For instance, replying to "Hi, I'm Sam Altman" took several minutes and racked up $80 in compute costs for a trivial answer—a textbook case of overthinking. GPT-5 is supposed to land somewhere between specialized reasoning and practical conversation.

GPT-5 aims to lay the groundwork for more autonomous AI agents

Even with these technical setbacks, GPT-5 is meant to drive progress in "agentic" systems - applications where an AI can carry out multiple steps on its own. The new model should be able to follow complex instructions more efficiently, with less human oversight. (GPT-4.1 was already moving in this direction.)

GPT-5 is also said to be more capable than GPT-4o without using much more compute. Internal tests show it's better at gauging which tasks need more or less computing power, which could make processes more efficient and help avoid the kind of overthinking seen in models like o3-pro.

For OpenAI, even small improvements in GPT-5 may be enough to keep customers and investors engaged. The company is still growing fast, despite high operating costs. In coding-related AI, where Anthropic currently leads with its Claude models, OpenAI is hoping to regain ground with GPT-5.

OpenAI is increasingly relying on reinforcement learning, especially a "universal verifier" that automatically rates the quality of model responses—even for subjective tasks like creative writing.

This universal verifier was also used in the OpenAI model that recently won gold at the International Mathematical Olympiad. OpenAI researcher Jerry Tworek has suggested that this RL system could form the basis for general artificial intelligence (AGI).