Content
summary Summary

A new study from Google DeepMind and several US universities shows that most benchmarks for AI-generated code don't really match what developers value.

Ad

Instead of only checking whether code works, the new "Vibe Checker" system also measures how well code follows detailed instructions. The researchers found that combining both functional correctness and instruction following produces results that align much more closely with human preferences.

The main issue is that widely used benchmarks focus on pass@k metrics—meaning they check if code passes unit tests. This approach overlooks the many non-functional requirements developers care about, such as style, documentation, and error handling.

This disconnect is clear in environments like Copilot Arena, where programmers compare different AI models. There, benchmark rankings often show little or even negative correlation with what human evaluators actually prefer.

Ad
Ad

VeriCode: Defining real-world code quality

To address this gap, the researchers created VeriCode, a taxonomy of 30 verifiable code instructions organized into five categories: Coding Style & Conventions, Logic & Code Patterns, Documentation & Commenting, Error Handling & Exception Management, and Library & API Constraints.

Tabelle mit fünf Beispielen aus der VeriCode-Taxonomie mit Kategorie, Prompt, Linter-Regel und Parametern.
Each instruction is linked to a linter check, with options to customize things like line length, branch limits, or docstring style. | Image: Zhong et al.

VeriCode is built from over 800 rules in the Python linter Ruff, filtered down to the most relevant and challenging ones. Each instruction is paired with a deterministic verifier that gives a simple pass/fail result.

A key strength of VeriCode is its flexibility, the researchers say. By adjusting parameters like line length or maximum function branches, hundreds of different variants can be generated from the 30 basic rules.

Vibe Checker: Expanding benchmark coverage

Using VeriCode, the team developed the Vibe Checker testbed. It expands BigCodeBench to BigVibeBench (1,140 real-world programming tasks) and LiveCodeBench to LiveVibeBench (1,055 algorithmic tasks).

For each task, an LLM-based selector chooses relevant, non-conflicting instructions. The evaluation includes two modes: single-turn generation (all instructions at once) and multi-turn editing (instructions added in stages).

Recommendation
Diagramm zeigt zwei Evaluierungsprotokolle für KI-Code-Generierung. Links Single-Turn mit allen Anweisungen in einem Prompt, rechts Multi-Turn mit schrittweiser Anweisungseinführung. Beide werden auf Funktionalität und Instruction Following gemessen.
Both methods test for functional correctness and instruction following. | Image: Zhong et al.

The researchers tested 31 leading large language models from 10 model families. Even though extra instructions do not break code functionality, the pass@1 rate drops for all models. With five instructions, the average pass@1 decreases by 5.85 percent on BigVibeBench and 6.61 percent on LiveVibeBench.

Following several instructions at once remains challenging for advanced models. The top performers only reach 46.75 percent and 40.95 percent success rates with five instructions. Most models drop below 50 percent once three or more instructions are in play.

Tabelle mit Instruction Following Scores für verschiedene KI-Modelle bei 1 bis 5 Anweisungen. Zeigt Single-Turn und Multi-Turn Ergebnisse für BigVibeBench und LiveVibeBench. Farbkodierung visualisiert schlechte Performance unter 50 bzw. 30 Prozent.
Leading AI models struggle to follow multiple instructions simultaneously. The table shows IF scores across both benchmarks. Light red indicates below 50 percent, dark red below 30 percent. | Image: Zhong et al.

Single-turn generation better preserves code functionality, while multi-turn editing leads to somewhat higher rates of instruction adherence. The researchers also observed a "lost-in-the-middle" effect: models are less likely to follow instructions that appear in the middle of the content.

To see how these metrics compare with human preferences, the team matched scores against more than 800,000 human ratings from LMArena. The combination of functional correctness and instruction following was a much stronger predictor of human choice than either measure alone.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

What matters most depends on the context: for everyday programming, instruction following is the main differentiator among advanced models. For competitive algorithmic problems, functional correctness is more important.

Implications for AI training and software development

The study highlights that instruction following is a crucial but often overlooked part of code evaluation. Factoring in these non-functional requirements offers a clearer picture of what works in practice.

This has direct consequences for model training. Currently, pass@k is the primary reward in RLVR (Reinforcement Learning with Verifiable Rewards), which narrows the definition of code quality. VeriCode can provide a scalable and verifiable way to broaden what AI models learn.

The VeriCode taxonomy and its verifiers will be released publicly, and the approach can extend to programming languages beyond Python.

Recent research shows the growing, but complex, role of AI in software development. A Google Cloud survey finds that developers now use AI tools for hours every day. The Stack Overflow Developer Survey reveals a "trust paradox": as AI use increases, confidence in the accuracy of generated code declines. A METR study adds to this concern, showing that experienced open-source developers actually took longer to finish tasks with AI assistance, even though they felt like they were moving faster.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • A new study from US universities and Google Deepmind finds that widely used tests for AI-generated code miss key qualities like style, documentation, and error handling—details that matter in real-world programming.
  • The research team introduces new tools, the VeriCode taxonomy and the Vibe Checker testbed, to measure these overlooked aspects. Their approach matches human preferences much better than previous benchmarks.
  • After reviewing 31 top AI models, the study shows that even the best systems have trouble following several instructions at once, underlining how important instruction following is for producing quality code.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.