The new S* framework enables AI language models to generate more powerful and reliable code.
Researchers at the University of California, Berkeley have created a framework called S* that improves how AI language models generate code. The system combines two different approaches - parallel and sequential scaling - with a new way of selecting the best results.
While generating multiple code snippets at once and picking the best one (parallel scaling) isn't new, the Berkeley team added something extra. They combined it with sequential scaling, where the system continuously improves its solutions through systematic debugging.
The framework introduces a variation of test-time compute as one of its building blocks. Unlike current reasoning models like OpenAI o1, S* incorporates external feedback rather than relying solely on internal reasoning chains. This design makes it compatible with both traditional large language models (LLMs) and newer reasoning models (LRMs).
Using AI to evaluate code solutions
The second key innovation is what the team calls "adaptive input synthesis." In testing, they used GPT-4o mini to generate test inputs for different potential solutions. By running these inputs and analyzing the actual results, the AI can reliably identify the best solution.
The system asks the AI model to create test inputs specifically designed to spot differences between two programs. It uses carefully crafted prompts that tell the model to consider edge cases (like empty inputs or extreme values), generate complex but manageable test cases, and create inputs that could reveal potential errors.
The system then runs both programs using these test inputs and shows the results back to the AI model, which decides which solution works better based on real test outcomes.
S* framework significantly improves performance of small models
The team tested S* with 12 different language models of varying sizes and types, finding consistent improvements across the board: Qwen2.5-7B-Coder-Instruct with S* performed about 10% better than Qwen2.5-32B-Coder-Instruct without it and in some cases, smaller models using S* outperformed larger reasoning models - GPT-4o mini with S* beat o1-Preview. Even powerful reasoning models showed improvement when using the framework.
The framework does have some clear constraints. It's currently optimized only for programming competition tasks and hasn't been tested on more complex software engineering challenges. The team also focused exclusively on improving accuracy, setting aside questions of resource efficiency.
The approach of combining iterative improvements with search capabilities likely contributed to OpenAI's success in the ARC benchmark, where they made multiple parallel queries to their o3 reasoning model and selected the best answers - though the exact method remains unknown. S* follows a similar philosophy and could lead to better code generation capabilities in the future.