Content
summary Summary

STEVE-1 is a generative AI model that can perform tasks in Minecraft using text instructions.

AI models that can respond to natural language instructions have become incredibly popular, but creating models that can follow instructions for complex sequential tasks remains a challenge. Researchers have now introduced STEVE-1, an AI assistant that can follow a wide range of short-horizon text and visual instructions in Minecraft.

STEVE-1 builds on two existing AI models - VPT, a foundation model pre-trained on 70,000 hours of Minecraft gameplay, and MineCLIP, which aligns text captions with Minecraft videos. Using an approach inspired by DALL-E 2's unCLIP method, the researchers fine-tuned VPT to follow visual goals encoded by MineCLIP, and then trained a module to translate text prompts into MineCLIP visual embeddings.

Bild: Lifshitz, Paster et al.

This two-step model allows STEVE-1 to follow both text and visual instructions in Minecraft with only $60 of computation and 2,000 labeled examples.

Ad
Ad

STEVE-1 outclasses previous AI agents in Minecraft

In their tests, STEVE-1 significantly outperformed previous AI agents in Minecraft when given relevant instructions, gathering far more resources and exploring farther, and can perform a variety of short-term tasks such as chopping trees, gathering resources, and exploring when prompted with text or images.

The researchers found that chaining prompts improved performance on longer-term tasks, such as crafting items or building structures, from near zero to a success rate of 50 to 70 percent. The team also shows STEVE-1 responding to human instructions in real time, demonstrating its potential as an interactive assistant.

STEVE-1 is a blueprint for "instructable agents in domains beyond Minecraft"

Although, similar to image generation, switching to a longer, more specific prompt dramatically improves STEVE-1's performance on long-horizon tasks, it is similarly unintuitive and time-consuming, and more work needs to be done, the paper states.

Because STEVE-1 works directly from raw pixel input and low-level mouse and keyboard actions, the approach could be applied more broadly to create instructable agents in domains beyond Minecraft, the team said. Future work will focus on improving STEVE-1's ability to handle longer, more complex instructions by incorporating large language models to help the agent plan and execute multistep tasks.

More information and the code is available on the STEVE-1 project page.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • STEVE-1, a generative AI model, can perform tasks in Minecraft using text instructions, building on two existing AI models (VPT and MineCLIP) and outperforming previous AI agents for short-term tasks.
  • By chaining text prompts, STEVE-1 shows improved performance on longer-term tasks, with the researchers demonstrating its potential as an interactive assistant in real-time.
  • The approach used for STEVE-1 can be applied to create instructable agents beyond Minecraft, with future work focusing on improving the AI's ability to handle longer, more complex instructions.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.