A new study from Stanford University finds that AI agents can get much better at solving complex tasks simply by learning from their own successful experiences.
So far, building effective AI agents has often meant a lot of manual effort: fine-tuned prompts, handpicked sample sets, or specialized action spaces. These methods work, but they are time-consuming and hard to scale. The Stanford team proposes a much simpler alternative by letting agents improve themselves, learning from what has worked in the past.
Their method builds on a ReAct architecture, where a language model creates a plan for each task, then observes, reasons, and acts. The difference is that at each step, the agent draws examples from a database filled not with hand-chosen samples but with successful trajectories from earlier tasks, all collected automatically by the system. In this context, a trajectory is the full sequence of steps an AI agent takes to solve a problem.
Self-generated examples are enough
Even a straightforward version of this approach, called Traj-Bootstrap, leads to a clear jump in success rates across three benchmarks. For ALFWorld, accuracy rises from 73% to 89%. Wordcraft goes from 55% to 64%, and InterCode-SQL from 75% to 79%.
This improvement comes from a positive feedback loop. Successful examples help with new tasks, which then produce more successful examples. The system learns from itself and keeps getting better, with no extra training data or model tuning required.
Two ways to build a better database
Not every collected trajectory helps, and some can even make things worse. To fix this, the researchers developed two selection strategies.
DB-Selection runs several databases in parallel. Every time the database size doubles, only the most successful one is kept, while the least effective is dropped. This evolutionary approach quickly boosts results, pushing the ALFWorld success rate up to 91%.
Exemplar-Selection rates each trajectory by how often it helps solve new problems. This method works especially well for Wordcraft, raising success to 72%, and for InterCode-SQL, boosting it to 81%.
A bit of human input remains useful. The system performs better if the initial database includes a few handpicked examples to get the agent started in the right direction. Without these, performance drops, according to the team.
GPT-4o-mini with Traj beats GPT-4o
On ALFWorld, Traj-Bootstrap with the smaller GPT-4o-mini actually outperforms the larger GPT-4o by a percentage point. Using DB-Selection, the system matches the performance of more complex, hierarchical setups that rely on manually defined observation and action spaces.
The method is also efficient compared to strategies where an agent gets multiple guesses per task. A Traj-Bootstrap-trained agent matches the baseline system's performance in a single attempt, while the baseline needs three or four tries.
The study shows: It is not the model size that matters, but the quality of the data. Instead of constantly building new models or optimizing prompts, it is often enough to collect good examples and select them wisely. This is in line with a trend seen in other areas of generative AI.