Google adds native image generation to Gemini language models

Mar 12, 2025

gemini-2.0-flash-exp prompted by Kelvin Chan

Key Points

Google has released an experimental version of the AI model Gemini 2.0 Flash for developers, which can understand text and generate images directly based on the input.
Gemini's unique capability lies in its ability to simultaneously process text, analyze content logically, and interpret images, enabling it to create more precise and realistic images compared to pure image models.
Google highlights Gemini's strengths in maintaining consistency across multiple images when depicting figures and scenes, as well as its ability to accurately incorporate text elements into the generated images.

Gemini's ground-up multimodal training enables direct image generation, potentially offering more precise outputs compared to traditional image models.

Google has enabled native image generation capabilities in its Gemini 2.0 Flash language model for developers. According to a company blog post, developers can now test this feature through Google AI Studio and the Gemini API across all supported regions. The integration process requires minimal code, with Google providing an experimental version of Gemini 2.0 Flash (gemini-2.0-flash-exp) for testing.

Built-in multimodal processing enhances accuracy

What sets Gemini's image generation apart is its multimodal foundation. The Large Multimodal Model (LML) combines text understanding, enhanced reasoning, and multimodal input processing to generate more accurate images than traditional image generation models.

The model sometimes shows impressive accuracy in translating prompts to images. | Image: Gemini 2.0 Flash Exp by THE DECODER

Google says the model can create consistent visual narratives by combining text and images, maintaining character and setting consistency across multiple images. It also enables conversational image editing through multiple dialogue steps, making it particularly useful for iterative improvements while maintaining context throughout the conversation.

Video: via Oriol Vinyals

The model's built-in world knowledge helps create realistic and accurate images, though Google notes this knowledge, while extensive, isn't absolute. The system also excels at incorporating text into images, with internal benchmarks showing superior text integration compared to leading competitor models.

OpenAI rumored to launch multimodal image features in March

OpenAI already showed what's possible in this space with its GPT-4o model back in May 2024. Like Gemini, GPT-4o is built as a native multimodal AI system that can process text, audio, image, and video inputs while generating various outputs including text, audio, and images.

The company demonstrated a range of capabilities, from visual storytelling and detailed character designs to creative typography and realistic 3D rendering. While these features haven't been released to the public yet, industry sources suggest OpenAI will launch them in March 2025 - timing that seems even more likely following Google's announcement. OpenAI employees have also hinted at upcoming image generation features.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: Google