OpenAI's 2017 research paper "World of Bits" ended with a clear-eyed assessment: "We showed that while standard supervised and reinforcement learning techniques can be applied to achieve adequate results across these environments, the gap between agents and humans remains large, and welcomes additional modeling advances."
That paper outlined a long-term vision for the company, one that's now inching closer to reality with the new ChatGPT agent. Casey Chu, a member of the development team, confirmed in a recent interview that this goal never faded: "This project has a very long lineage, dating back to around 2017. Our codename is 'World of Bits 2' for the computer use part." The lineage stretches back even further - in 2016, OpenAI released a blog post about the related training environment Universe.
But the way OpenAI tries to close that "large gap" has fundamentally changed. The biggest shift is the starting point: instead of beginning from scratch, the new agent is built on top of a massive, unsupervised, pretrained foundation model. That baseline competence is now required for everything that follows. "Before we apply Reinforcement Learning, the model must be good enough to achieve a basic completion of tasks" says Issa Fulford.
According to OpenAI, reinforcement learning is very data-efficient
OpenAI now relies on reinforcement learning (RL) for crucial fine-tuning, calling the process extremely data-efficient: "The scale of the data is minuscule compared to the scale of pre-training data. We are able to teach the model new capabilities by curating these much smaller, high-quality datasets," Fulford explains. These datasets are made up of dynamic collections of difficult, targeted tasks. The team starts by defining what they want the agent to accomplish, then designs training scenarios accordingly. "We work backwards from the use cases we want to solve to train the model and build the product," Fulford adds.
When it comes to hands-on training, the agent faces these tasks and has to figure out solutions without being told how. As Chu puts it, "We essentially give the model all these tools, lock it in a room, and it experiments. We don't tell it when to use what tool, it figures that out by itself." The mechanism driving this experimental learning is simple but effective: a reward system based on the outcome. Edward Sun explains: "As long as you can grade the task—judge whether the model's performance on the result was good or not—you can reliably train the model to become even better at it.."
Massive scaling of computing power
This approach, where only the final result needs to be evaluated, is far more efficient than collecting thousands of human demonstrations for every mouse click and keystroke. It lets OpenAI train agents across hundreds of thousands of virtual machines at once, allowing them to independently discover the best solutions to complex problems.
The "further advances" called for in the 2017 paper didn't come from a new algorithm, but from scaling up on every level. "Essentially, the scale of the training has changed," Chu says. " I don't know the exact multiplier, but it must be something like 100,000x in terms of compute."
For now, OpenAI says the agent still shouldn't be used for critical tasks.