Study shows evidence of degradation in ChatGPT's performance since March

A study examines how ChatGPT behavior changed between March and June and finds some significant changes - and not for the better.

Researchers from Stanford University and UC Berkeley evaluated the performance of an older and a newer version of GPT-3.5 and GPT-4 on four different tasks:

Solving math problems
Answering tricky/dangerous questions
generating code
visual thinking

The study shows that the performance and behavior of the two versions differ, sometimes dramatically. For example, GPT-4 (March 2023) is able to recognize prime numbers with 97.6% accuracy, while GPT-4 (June 2023) fails (2.4% accuracy) and ignores the chain-of-thought prompt. GPT-3.5 (June 2023), on the other hand, performs significantly better than GPT-3.5 (March 2023) on this task.

GPT-4 produces significantly less directly executable code in June

There is also a significant drop in the direct executability of code: For GPT-4, the percentage of directly executable generations dropped from 52% in March to 10% in June. The drop for GPT-3.5 was also significant (from 22% to 2%). The reason: In March, both GPT-4 and GPT-3.5 followed the user's instructions ("just the code") and thus produced directly executable generations.

In June, however, they added additional triple quotes before and after the code, making it not directly executable without manual intervention. The quality of the generated code appears to be at a similar level, but the team did not perform a detailed comparison.

The team also shows that GPT-4 answers fewer tricky questions in June and performs slightly better on visual reasoning tasks - but makes errors in June that the March model does not. The researchers also note a slight improvement for GPT-3.5.

Team recommends companies to implement monitoring analysis for LLM services

Is GPT-4 worse now than it was in March? The study does not provide a clear answer to this question, but it does seem to show that the June version contains bugs that were not present in the older version.

So what is the real message here?

"Our findings demonstrate that the behavior of GPT-3.5 and GPT-4 has varied significantly over a relatively short amount of time," the researchers said. "This highlights the need to continuously evaluate and assess the behavior of LLMs in production applications."

Recommendation

AI in practice

OpenAI plans GPT-5 release in "a few months," shifts strategy on reasoning models

So whether these changes are bugs, as Peter Welinder, VP Product at OpenAI, suggested in a similar example, or evidence of a general decline in quality based on optimizations OpenAI has made to cut costs, remains unclear - and that's a problem because it's opaque to OpenAI's customers.

As a result, the team recommends that users or companies using LLM services as part of their workflow implement a similar monitoring analysis for their applications based on the research presented. To support this process and further research into language model drift, the team is making the evaluation and ChatGPT data available on GitHub.

OpenAI reacts to the research, "looking into it"

Responding to the research. Logan Kilpatrick, currently lead of developer relations at OpenAI, said the company is aware of the reported regressions and is looking into them. He also called for a public OpenAI eval set to test against these known regression cases as new models are released.

Thanks for taking the time to do this research! The team is aware of the reported regressions and looking into it.

Side note: it would be cool for research like this to have a public OpenAI eval set. That way, as new models come online, we can test against these known…

— Logan.GPT (@OfficialLoganK) July 19, 2023

In a separate post, he thanked everyone for reporting their experiences with GPT-4 model performance. So the company seems to be shifting communication after a few instances of OpenAI staff saying nothing had changed and Welinder sharing his hypothesis that "when you use it more heavily, you start noticing issues you didn't see before."

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Generative AI market still needs quality control

However, the problems OpenAI faces with GPT-4 are not unique to the company. Any company that deploys and regularly updates AI models may be reluctant to communicate changes if they do not directly translate into performance improvements or cost reductions for its customers-and in some cases, the company may not even notice degradation prior to deployment. Case in point: Reports of performance degradation have also surfaced with the generative AI service Midjourney. Those reports show the same structure of personal accounts and speculations.

This all just shows how new and complicated the generative AI product market is. In the future, evaluations like the one Kilpatrick wants to build may help move from anecdotal evidence to clear benchmarks.

Study shows evidence of degradation in ChatGPT's performance since March

GPT-4 produces significantly less directly executable code in June

Team recommends companies to implement monitoring analysis for LLM services

OpenAI plans GPT-5 release in "a few months," shifts strategy on reasoning models

OpenAI reacts to the research, "looking into it"

Generative AI market still needs quality control

OpenAI's GPT-4 retires at the end of April

OpenAI introduces flexible message limit for GPT-4, which not everyone likes

OpenAI looks into complaints about "lazy" ChatGPT with GPT-4

OpenAI launches new ChatGPT agent that automates complex tasks for Pro, Plus, and Team

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Study shows evidence of degradation in ChatGPT's performance since March

GPT-4 produces significantly less directly executable code in June

Team recommends companies to implement monitoring analysis for LLM services

OpenAI reacts to the research, "looking into it"

Generative AI market still needs quality control

Share

Bank details