Google Labs demonstrated new capabilities of its multimodal Gemini 2.0 Flash model, showing how users can now interact with images through simple taps and text prompts. The combination of visual and language processing enables new applications.
In demonstration videos, Gemini 2.0 shows it can adjust scene lighting based on an image section containing a light switch and the instruction to operate it. The system understands both visual elements and verbal instructions.
One more thing....@ameliovr and team have been experimenting with Gemini 2.0's native image out capabilities to
? navigate
?? modify
? interact
⚙️simulateinside images using tap & prompt. eg turning on the lights if you tap the lightswitch ?
a ? pic.twitter.com/qI1LN5iAU6
- Simon (@tokumin) December 12, 2024
The demonstrations are particularly impressive for their consistency. In one example, when a user marks a car door handle, Gemini 2.0 generates an image of the car with an open door. The room's contents or the car's appearance barely change from the original image - only the requested modifications are made.
Native image output with Gemini 2.0 Flash is what I've always wanted from an image model, the consistency when iterating is so magical. All without any complex orchestration or develop tooling.pic.twitter.com/VMNUW84gyo
- Logan Kilpatrick (@OfficialLoganK) December 15, 2024
This makes Gemini 2.0 an interesting tool for creatives - but also a promising system for other applications, as Google explains.
Visual planning for robot systems
Gemini 2.0's particular strength lies in visual cause-and-effect recognition. The system can visually plan and evaluate sequences of actions, which could be especially relevant for robotics. The combination of image understanding and language processing allows the system to comprehend and implement complex navigation instructions using an approach called "Visual Chain of Thought" (vCoT).
Google's developers believe this capability for robust and fault-tolerant planning represents an important step in developing adaptive robot systems. The company has been working on using large language models and multimodal models in robotics for some time, for example with the Robotic Transformer.
Last summer, a Google DeepMind team demonstrated how robots could navigate complex environments using Gemini 1.5 Pro's large context window. In a demonstration, the company showed robots navigating unknown environments solely through human instructions, video tutorials, and LLM reasoning - enabling them to return to locations they had been shown before.