summary Summary

A study examines how ChatGPT behavior changed between March and June and finds some significant changes - and not for the better.

Researchers from Stanford University and UC Berkeley evaluated the performance of an older and a newer version of GPT-3.5 and GPT-4 on four different tasks:

  1. Solving math problems
  2. Answering tricky/dangerous questions
  3. generating code
  4. visual thinking

The study shows that the performance and behavior of the two versions differ, sometimes dramatically. For example, GPT-4 (March 2023) is able to recognize prime numbers with 97.6% accuracy, while GPT-4 (June 2023) fails (2.4% accuracy) and ignores the chain-of-thought prompt. GPT-3.5 (June 2023), on the other hand, performs significantly better than GPT-3.5 (March 2023) on this task.

GPT-4 produces significantly less directly executable code in June

There is also a significant drop in the direct executability of code: For GPT-4, the percentage of directly executable generations dropped from 52% in March to 10% in June. The drop for GPT-3.5 was also significant (from 22% to 2%). The reason: In March, both GPT-4 and GPT-3.5 followed the user's instructions ("just the code") and thus produced directly executable generations.


In June, however, they added additional triple quotes before and after the code, making it not directly executable without manual intervention. The quality of the generated code appears to be at a similar level, but the team did not perform a detailed comparison.

The team also shows that GPT-4 answers fewer tricky questions in June and performs slightly better on visual reasoning tasks - but makes errors in June that the March model does not. The researchers also note a slight improvement for GPT-3.5.

Team recommends companies to implement monitoring analysis for LLM services

Is GPT-4 worse now than it was in March? The study does not provide a clear answer to this question, but it does seem to show that the June version contains bugs that were not present in the older version.

So what is the real message here?

"Our findings demonstrate that the behavior of GPT-3.5 and GPT-4 has varied significantly over a relatively short amount of time," the researchers said. "This highlights the need to continuously evaluate and assess the behavior of LLMs in production applications."


So whether these changes are bugs, as Peter Welinder, VP Product at OpenAI, suggested in a similar example, or evidence of a general decline in quality based on optimizations OpenAI has made to cut costs, remains unclear - and that's a problem because it's opaque to OpenAI's customers.

As a result, the team recommends that users or companies using LLM services as part of their workflow implement a similar monitoring analysis for their applications based on the research presented. To support this process and further research into language model drift, the team is making the evaluation and ChatGPT data available on GitHub.

OpenAI reacts to the research, "looking into it"

Responding to the research. Logan Kilpatrick, currently lead of developer relations at OpenAI, said the company is aware of the reported regressions and is looking into them. He also called for a public OpenAI eval set to test against these known regression cases as new models are released.

In a separate post, he thanked everyone for reporting their experiences with GPT-4 model performance. So the company seems to be shifting communication after a few instances of OpenAI staff saying nothing had changed and Welinder sharing his hypothesis that "when you use it more heavily, you start noticing issues you didn't see before."

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Generative AI market still needs quality control

However, the problems OpenAI faces with GPT-4 are not unique to the company. Any company that deploys and regularly updates AI models may be reluctant to communicate changes if they do not directly translate into performance improvements or cost reductions for its customers-and in some cases, the company may not even notice degradation prior to deployment. Case in point: Reports of performance degradation have also surfaced with the generative AI service Midjourney. Those reports show the same structure of personal accounts and speculations.

This all just shows how new and complicated the generative AI product market is. In the future, evaluations like the one Kilpatrick wants to build may help move from anecdotal evidence to clear benchmarks.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • A study by researchers from Stanford University and UC Berkeley reveals significant changes in the performance of GPT-3.5 and GPT-4 models between March and June, with some tasks showing decreased accuracy and direct executability of code.
  • The study highlights the need for continuous evaluation and assessment of language models, as their behavior can vary significantly over a short period.
  • OpenAI's lead of developer relations, Logan Kilpatrick, acknowledged the reported regressions and called for a public OpenAI evaluation set to test against known regression cases as new models are released.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.