OpenAI is updating its coding environment with GPT-5.1-Codex-Max, a new model designed to handle massive amounts of context and solve complex engineering tasks that take hours to complete.
OpenAI has released GPT-5.1-Codex-Max, its latest "agentic" coding model. The company says the system is built specifically for "long-running, detailed work" and replaces the previous GPT-5.1-Codex as the standard across all Codex interfaces.
In the Coding Benchmark SWE-Bench Verified test, GPT-5.1-Codex-Max is projected to reach a top score of 77.9 percent, placing it ahead of Anthropic and Google's recently released Gemini 3. The model also improved from 66.3 percent to 79.9 percent in the internal "SWE-Lancer IC SWE" benchmark, according to the blog post.
| GPT-5.1-Codex (high) | GPT-5.1-Codex-Max (xhigh) | |
|---|---|---|
| SWE-Bench Verified (n=500) | 73.7% | 77.9% |
| SWE-Lancer IC SWE | 66.3% | 79.9% |
| TerminalBench 2.0 | 52.8% | 58.1% |
OpenAI says the new model uses 30 percent fewer "thinking tokens" than its predecessor while maintaining the same quality. It also runs 27 to 42 percent faster on real-world tasks. For work where latency doesn't matter, there's a new Extra High reasoning mode that allocates even more time to thinking.
According to OpenAI, GPT-5.1-Codex-Max is also the first model specifically trained to work effectively in Windows environments, a move meant to improve how it handles command-line tasks. The company claims that 95 percent of its engineers use Codex weekly, and the company has seen a 70 percent increase in pull requests since the tool's introduction.
Access is open now for ChatGPT Plus, Pro, Team, Edu, and Enterprise users, with the Max version replacing the older model as the default, sending the previous version into retirement after just a few days. That short-lived predecessor cost $1.25 per million input tokens and $10 per million output tokens; OpenAI hasn't released pricing for the new model yet. API access is expected soon.
For ChatGPT Plus users, limits are set at 45 to 225 local messages and 10 to 60 cloud tasks every five hours. Pro users get significantly more capacity, with limits ranging from 300 to 1,500 local messages and 50 to 400 cloud tasks in the same period.
New technique allows for day-long coding sessions
OpenAI says the model can stay focused on a single assignment for "more than 24 hours" in internal tests, handling tasks like fixing test failures or iterating on implementations. While OpenAI didn't share details on these workloads, the claim mirrors recent statements from Anthropic about its Sonnet 4.5 model working for extended periods.
To manage these long sessions, OpenAI uses a process called "compaction." When the model fills its context window, it automatically compresses the session history. The system summarizes relevant information and discards unimportant details, allowing the AI to retain the core task and key steps over millions of tokens. GPT-5.1-Codex-Max is the first model natively trained to work this way across multiple context windows.
On the security front, OpenAI says this is its most capable cybersecurity model to date, though it still falls below the internal "High Capability" threshold. While the company plans to support defenders with tools like Aardvark, it warns developers to double-check the agent's work before deployment, as the model cannot replace human review.
According to OpenAI, reviewing the agent's work becomes "increasingly important" as Codex takes on longer tasks, largely because these systems still make mistakes. As the model produces more code, verifying and understanding it for future debugging becomes harder, a reality that could negate any efficiency gains. To help with the process, Codex produces terminal logs citing its tool calls and test results, and OpenAI emphasizes that Codex acts as an additional reviewer, not a replacement for human eyes.