Content
summary Summary

Google Labs demonstrated new capabilities of its multimodal Gemini 2.0 Flash model, showing how users can now interact with images through simple taps and text prompts. The combination of visual and language processing enables new applications.

Ad

In demonstration videos, Gemini 2.0 shows it can adjust scene lighting based on an image section containing a light switch and the instruction to operate it. The system understands both visual elements and verbal instructions.

The demonstrations are particularly impressive for their consistency. In one example, when a user marks a car door handle, Gemini 2.0 generates an image of the car with an open door. The room's contents or the car's appearance barely change from the original image - only the requested modifications are made.

Ad
Ad

This makes Gemini 2.0 an interesting tool for creatives - but also a promising system for other applications, as Google explains.

Visual planning for robot systems

Gemini 2.0's particular strength lies in visual cause-and-effect recognition. The system can visually plan and evaluate sequences of actions, which could be especially relevant for robotics. The combination of image understanding and language processing allows the system to comprehend and implement complex navigation instructions using an approach called "Visual Chain of Thought" (vCoT).

Google's developers believe this capability for robust and fault-tolerant planning represents an important step in developing adaptive robot systems. The company has been working on using large language models and multimodal models in robotics for some time, for example with the Robotic Transformer.

Last summer, a Google DeepMind team demonstrated how robots could navigate complex environments using Gemini 1.5 Pro's large context window. In a demonstration, the company showed robots navigating unknown environments solely through human instructions, video tutorials, and LLM reasoning - enabling them to return to locations they had been shown before.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Recommendation
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Google Labs demonstrates new ways to interact with Gemini 2.0, which combines visual and language processing. Users can change images, operate light switches, or open car doors by tapping and text prompts.
  • Gemini 2.0 recognizes cause-and-effect relationships and can visually plan sequences of actions. This makes it a promising tool for robotics to understand and implement complex navigation instructions.
  • Google is already working on using large language models and multimodal models like Gemini in robotics. This summer, Deepmind demonstrated how robots can navigate unfamiliar environments using human instructions, videos, and LLM reasoning with Gemini 1.5 Pro.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.