LLMs crush coding and math but choke on casual questions, and that's not a contradiction
AI models can solve complex programming tasks in hours but fall apart when faced with basic everyday questions. Andrej Karpathy explains why that's not actually a contradiction.
There are two different ways people think about AI progress right now, according to Karpathy. The first group has tried the free version of ChatGPT or its voice mode and walked away with an opinion shaped by silly mistakes and hallucinations. Those outdated models don't reflect where things actually stand today, Karpathy says.
The second group uses the latest models—like OpenAI's GPT-5.4 Thinking or Claude Opus 4.6—inside capable harnesses like Codex or Claude Code for professional work in programming, math, and research. Progress in these areas has been massive this year, Karpathy says, with models now capable of autonomously restructuring entire codebases or hunting down security vulnerabilities on their own. Karpathy says these two groups are basically talking past each other.
It really is simultaneously the case that OpenAI's free and I think slightly orphaned (?) "Advanced Voice Mode" will fumble the dumbest questions in your Instagram's reels and *at the same time*, OpenAI's highest-tier and paid Codex model will go off for 1 hour to coherently restructure an entire code base, or find and exploit vulnerabilities in computer systems.
Karpathy via X
Karpathy's take points to something bigger: areas like code or math, where you can clearly check whether an answer is right or wrong and specifically reinforce it through reinforcement learning with verifiable rewards, are seeing more and especially measurable gains from AI progress than fuzzy domains like writing or consulting, where there's no clean metric to optimize against.
Why verifiability drives AI progress
This raises a core question in AI research right now: can general intelligence actually emerge from language models, or can these models only be tuned to perform well within specific domains?
Karpathy laid out this structural problem in an earlier essay: in the "Software 2.0" paradigm, what matters isn't whether you can specify a task, but whether you can verify the result. A system can only be trained efficiently through reinforcement learning when it gets automated feedback - pass/fail checks or clear reward signals. "The more a task/job is verifiable, the more amenable it is to automation in the new programming paradigm," Karpathy says.
Last summer, rumors circulated about a universal verifier from OpenAI that would make reinforcement learning work across all domains. So far, nothing concrete has shipped. Meanwhile, Jerry Tworek, one of the key figures behind OpenAI's reinforcement learning strategy, recently left the company and said that "deep learning research is done."
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.
Subscribe now