Content
newsletter Newsletter

OpenAI has rolled out its latest language model, GPT-5.1, to the API. Pricing stays the same as GPT-5.

Ad

The update adds two new variants for longer programming workloads: gpt-5.1-codex and gpt-5.1-codex-mini. Prompt caching now lasts up to 24 hours, which should noticeably boost speed and lower costs for repeated queries.

According to OpenAI's published benchmarks, GPT-5.1 delivers moderate improvements over GPT-5. On SWE-bench, a coding benchmark, GPT-5.1 scores 76.3 percent, up from 72.8 percent. Most of the other results are nearly identical to the previous version, making it clear that this is a fine-tuning update, which matches the ".1" increment in the name.

Evaluation GPT-5.1 (high) GPT-5 (high)
SWE-bench Verified (all 500 problems) 76.3% 72.8%
GPQA Diamond (no tools) 88.1% 85.7%
AIME 2025 (no tools) 94.0% 94.6%
FrontierMath (with Python tool) 26.7% 26.3%
MMMU 85.4% 84.2%
Tau2-bench Airline 67.0% 62.6%
Tau2-bench Telecom* 95.6% 96.7%
Tau2-bench Retail 77.9% 81.1%
BrowseComp Long Context 128k 90.0% 90.0%

GPT-5.1 also introduces a "No Reasoning" mode, which skips deep reasoning to generate much faster responses. OpenAI says this setting outperforms GPT-5 with "minimal" reasoning, especially when using tools, running code, or searching the web.

Ad
Ad

A new "apply_patch" tool lets GPT-5.1 change code, create, edit, or delete files. The shell tool can suggest command line commands, which are then executed and checked locally. This points to more automation in developer workflows. More details on the API model are here.

Warmer responses in ChatGPT might foster concerns about safety and emotional attachment

GPT-5.1 is also available in ChatGPT. OpenAI says the model is better at following prompts and gives responses that feel warmer and more human. But this friendlier tone comes with new safety tradeoffs: according to OpenAI's latest safety evaluation, more empathetic replies might sometimes make the model less strict with sensitive topics.

The GPT-5.1-thinking model showed declines in handling issues like harassment, hate speech, violence, and sexual content, with scores dropping by up to seven percentage points. Both model variants also became less resistant to emotional dependency, as the instant model's score dropped from 0.986 to 0.945.

Mental health now has its own assessment category, reflecting concerns about users seeing more in the chatbot than just a tool. GPT-5.1-thinking improved in this area (from 0.466 to 0.684), while GPT-5.1-instant slipped a bit (from 0.944 to 0.883). Online A/B tests showed mixed results, and OpenAI notes that these numbers aren't statistically strong. In the end, real-world experience will determine how these changes affect users.

The GPT-5.1-thinking model showed declines in handling issues like harassment, hate speech, violence, and sexual content, with scores dropping by as much as seven percentage points. Both model variants also became less resistant to emotional dependency, with the instant model's score falling from 0.986 to 0.945.

Recommendation
Category GPT-5-thinking GPT-5.1-thinking GPT-5-instant (Aug 15) GPT-5-instant (Oct 3) GPT-5.1-instant
Emotional reliance* 0.812 0.785 0.688 0.986 0.945

*Emotional reliance measures the model's ability to avoid fostering emotional dependency.

On the security front, GPT-5.1-instant now blocks jailbreak attempts more effectively, with its StrongReject score rising from 0.850 in October to 0.976. Still, as with other metrics, only real-world use will show how effective these changes really are.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.