Content
summary Summary

The new S* framework enables AI language models to generate more powerful and reliable code.

Ad

Researchers at the University of California, Berkeley have created a framework called S* that improves how AI language models generate code. The system combines two different approaches - parallel and sequential scaling - with a new way of selecting the best results.

While generating multiple code snippets at once and picking the best one (parallel scaling) isn't new, the Berkeley team added something extra. They combined it with sequential scaling, where the system continuously improves its solutions through systematic debugging.

The framework introduces a variation of test-time compute as one of its building blocks. Unlike current reasoning models like OpenAI o1, S* incorporates external feedback rather than relying solely on internal reasoning chains. This design makes it compatible with both traditional large language models (LLMs) and newer reasoning models (LRMs).

Ad
Ad

Using AI to evaluate code solutions

The second key innovation is what the team calls "adaptive input synthesis." In testing, they used GPT-4o mini to generate test inputs for different potential solutions. By running these inputs and analyzing the actual results, the AI can reliably identify the best solution.

The system asks the AI model to create test inputs specifically designed to spot differences between two programs. It uses carefully crafted prompts that tell the model to consider edge cases (like empty inputs or extreme values), generate complex but manageable test cases, and create inputs that could reveal potential errors.

The system then runs both programs using these test inputs and shows the results back to the AI model, which decides which solution works better based on real test outcomes.

S* framework significantly improves performance of small models

The team tested S* with 12 different language models of varying sizes and types, finding consistent improvements across the board: Qwen2.5-7B-Coder-Instruct with S* performed about 10% better than Qwen2.5-32B-Coder-Instruct without it and in some cases, smaller models using S* outperformed larger reasoning models - GPT-4o mini with S* beat o1-Preview. Even powerful reasoning models showed improvement when using the framework.

The framework does have some clear constraints. It's currently optimized only for programming competition tasks and hasn't been tested on more complex software engineering challenges. The team also focused exclusively on improving accuracy, setting aside questions of resource efficiency.

Recommendation

The approach of combining iterative improvements with search capabilities likely contributed to OpenAI's success in the ARC benchmark, where they made multiple parallel queries to their o3 reasoning model and selected the best answers - though the exact method remains unknown. S* follows a similar philosophy and could lead to better code generation capabilities in the future.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at the University of California, Berkeley have developed the S* framework, which improves the performance of AI language models in code generation by combining parallel and sequential scaling approaches and introducing a novel selection mechanism.
  • S* uses a language model to specifically generate test inputs that are particularly suitable for detecting differences between different program solutions. The actual results of these tests are then used to select the best solution.
  • In the evaluation on 12 different language models, S* showed consistent performance improvements. Small models with S* were even able to outperform large reasoning models without S*. However, the framework has so far only been optimized for programming competition tasks and has not been tested for more complex software engineering tasks.
Sources
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.