Content
summary Summary

Greg Brockman, co-founder of OpenAI, shared an image on X generated by GPT-4o that demonstrates the potential of the model's image generation capabilities.

Ad

The image appears photorealistic, and the handwritten text on the panel is grammatically correct and coherent. Brockman does not reveal the prompt, but the panel caption was likely part of it.

AI image generator Ideogram proves that image models can render text accurately, though not yet with the complexity Brockman shows in the image. DALL-E 3 and Midjourney are barely capable of displaying text.

Ein Mann steht mit dem Rücken zur Kamera vor einer grünen Tafel und beschriftet diese mit Informationen zu einem KI-Modell.
The image is photorealistic, and the OpenAI logo on the back of the T-shirt and the handwritten panel text are rendered correctly. This goes beyond the capabilities of existing image models. | Image: Greg Brockman via X

GPT-4o has this kind of image rendering capability because it has been trained for multimodality from the ground up, unlike GPT-4 with DALL-E 3, which is a language model linked to an image model.

Ad
Ad

GPT-4o has a number of other multimodal capabilities. It can accept text, audio, images, and video as input, and produce text, audio, and images as output, in any combination. This allows the generation of visual stories, detailed and consistent character designs, creative typography, and even 3D renderings.

GPT-4o masters image and text generation in combination. | Image: OpenAI

Multimodal capabilities such as audio and images will be phased in over the coming months. The individual features are still undergoing red teaming and further safety testing. It is not yet known whether OpenAI will release the additional features under a separate brand, as with DALL-E, or simply as a feature of GPT-4o.

A little anecdote: OpenAI communicated GPT-4o so poorly at launch that many believed the new audio functionality was already available in ChatGPT and not just the language model. OpenAI CEO Sam Altman subsequently had to clear up this widespread misconception.

But because of the presentation, some users discovered the ChatGPT audio feature that had been available for months, thinking it was the new audio feature demonstrated by OpenAI, and posted enthusiastic demonstrations of the "next big AI thing" on social media. This is AI progress outpacing its influencers.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Greg Brockman, co-founder of OpenAI, shared a photorealistic image generated by GPT-4o that demonstrates the model's advanced image generation capabilities, particularly in rendering grammatically correct and meaningful text.
  • Built from the ground up to be multimodal, GPT-4o can accept any combination of text, audio, image, and video as input and generate any combination of text, audio, and image as output, enabling visual storytelling, detailed character design, creative typography, and 3D rendering.
  • OpenAI plans to gradually roll out the multimodal capabilities of GPT-4o, which is currently available for free as a text and code model in ChatGPT, in the coming months after further safety testing.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.