Mind's Eye: How physics data improves large language models

Google combines language models with a physics simulator. The hybrid AI system scores new bests in physical reasoning benchmarks.

Large language models like OpenAI's GPT-3 or Google's PaLM cannot reason reliably. This is a central fact in the debate about the role of Deep Learning on the way to more general forms of artificial intelligence.

Methods like chain-of-thought prompting, more training data and larger models like PaLM led to better results in benchmarks, but not to a fundamental breakthrough.

Researchers are therefore experimenting with hybrid approaches that use the language capabilities of AI models to issue queries to specialized, external libraries or systems.

OpenAI's WebGPT was an early example of this. More recently, AI researcher Sergey Karayev has connected GPT-3 to a Python interpreter that can perform exact mathematical calculations or make API requests.

Google links PaLM with MuJoCo simulator

Researchers at Google are now going a step further by linking large language models to a physics simulator. Because current language models are trained exclusively with written text, they lack the grounded experience of humans in the real world, the team writes.

The resulting inability to relate language to the physical world leads to knowledge being misrepresented, which in turn leads to obvious errors in reasoning.

The UTOPIA benchmark includes questions for language models that require physical reasoning to answer. | Image: Google

To ground language models in the physical world, the researchers propose to send text prompts from a language model to a MuJoCo physics simulation and then use the results as part of the input to the language model. The team calls this paradigm "Mind's Eye" and is testing it with the UTOPIA physical reasoning benchmark developed for this purpose.

Mind's Eye: text to code to simulation

Mind's Eye consists of three modules: The language model passes questions whose answers require physical reasoning to a text-to-code language model trained with 200,000 text-code pairs in the style of the UTOPIA benchmark.

Recommendation

AI research

AI models might need to scale down to scale up again

The generated code is passed to MuJoCo, where it is executed. The result is then converted into text. Finally, this text is transferred to the input window of the language model and serves as input for the final response of the model.

Google relies on Deepmind's MuJoCo simulator for a physical grounding of language models. | Image: Google

Using this method, Google researchers achieve massive leaps in performance: Google's PaLM 540B achieves 92.5 percent instead of 39.4 percent in UTOPIA with Mind's Eye, and OpenAI's InstructGPT achieves 99.1 percent instead of 68.6 percent. On average, the accuracy of the tested language models increases by 27.9 in zero-shot and 46 percent in the few-shot scenario.

Mind's Eye leads to significantly better results in the UTOPIA benchmark than other methods. | Image: Google

Google's Mind's Eye expands the possibilities of language models

In addition to the scalability of the approach and the simulation-based reasoning, Mind's Eye is also very efficient: Since the method delegates domain-specific knowledge to external expert modules (in this case MuJoCo), this knowledge is decoupled from the reasoning capability.

The size of the required language model can thus be significantly reduced - the network does not need to remember domain-specific knowledge. In practice, the team was able to show that smaller models with Mind's Eye achieve the performance of models 100 times larger in the UTOPIA benchmark.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

We conclude that Mind’s Eye is not only effective and scalable but also efficient, as it is able to boost the reasoning performance of small-scale LMs significantly, requiring neither handcrafted prompts nor costly fine-tuning.

Mind's Eye Paper

The team believes that the idea of relying on a simulation pipeline for reasoning can easily be extended to other areas - especially where simulations already exist. For example, economic change or thermodynamics simulations could be used.

"The dynamic nature of Mind’s Eye where we generate grounding evidence unlocks the scaling potential of these models," the paper states.

In other words, the potential of large AI models is not yet exhausted, and techniques like Mind's Eye can unlock new capabilities without requiring fundamentally new AI technologies.

Mind's Eye: How physics data improves large language models

Google links PaLM with MuJoCo simulator

Mind's Eye: text to code to simulation

AI models might need to scale down to scale up again

Google's Mind's Eye expands the possibilities of language models

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Researchers say they may have found a ladder to climb the "data wall"

Researchers train AI to generate long-form text using only reinforcement learning

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

Mind's Eye: How physics data improves large language models

Google links PaLM with MuJoCo simulator

Mind's Eye: text to code to simulation

Google's Mind's Eye expands the possibilities of language models

Share

Bank details