Content
summary Summary

While OpenAI's presentation of GPT-4o focused primarily on its speech and audio capabilities, the model's true distinction is its multimodal foundation.

GPT-4o accepts any combination of text, audio, images, and video as input, and produces any combination of text, audio, and images as output.

Following GPT-4o's unveiling, OpenAI demonstrated some of these capabilities in its blog.

For example, the model can generate visual narratives. In one example, a robot writes several diary entries on a typewriter. The story is written on a sheet of typewriter paper, which is shown in an image. The model can take the scene further and show the robot tearing up the page because it is unhappy with what it has written.

Ad
Ad
Image: OpenAI

The model can also create detailed and consistent character designs for movies or stories. One demo creates the robot Geary, who plays baseball, programs, rides a bike, and cooks.

Image: OpenAI

Another example is Sally, a postal worker who delivers a letter and then panics because it is being chased by a golden retriever, one of the most dangerous species on earth.

Image: OpenAI

GPT-4o can also generate various creative typographic styles. For example, it can output a poem directly as a designed sheet of paper or even develop new fonts.

Image: OpenAI

In another example, GPT-4o designs a commemorative coin, such as the one issued to celebrate the release of GPT-4 in 2023. Futuristic or old Victorian typefaces are also supported.

Image: OpenAI

GPT-4o can even generate 3D renderings of objects such as the OpenAI logo or sculptures and display them from different angles.

Recommendation
Image: OpenAI

Finally, the demos show that the model seems to understand causal relationships in images to some extent. The task is to visualize three cubes with the letters "G", "P", and "T" stacked on top of each other. The order of the letters (G, P, T) and the order of the colors (red, blue, green) must be correct.

This test has long been considered a benchmark for image generators' understanding of the world, and GPT-4o seems to pass it reliably over many trials in this particular demonstration, with only one example showing a slight inconsistency in color order (6/7).

Image: OpenAI

Other examples include placing logos on objects, transcribing and summarizing audio and video material, and transforming photos of people into new situations like movie posters or cartoon styles.

Image: OpenAI

The wealth of capabilities makes it clear that GPT-4o is a step toward an Omni model that understands and generates text, images, audio, and eventually video.

Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Prafulla Dhariwal, Head of Omni Development, describes Omni as the first "first natively fully multimodal model" and its introduction as a "huge, org-wide effort".

The new tokenization indicates that the model has been developed from scratch. GPT-4o could be a precursor of sorts to GPT-5, which has now been released as a standalone model. According to speculation, Omni has been in the works since 2022.

It remains to be seen when and to what extent the individual multimodal capabilities will be available in ChatGPT or via API, and how they will compare to standalone models for 3D or image generation, for example.

As OpenAI itself says: "Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations."

OpenAI promises an iterative rollout, with audio capabilities coming first.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • OpenAI introduced GPT-4o primarily as an audio assistant. But it is more: a native multimodal AI model that accepts any combination of text, audio, image, and video as input and can generate text, audio, and images as output. This results in many new generation possibilities.
  • Demonstrations show the diverse capabilities of GPT-4o, including generating visual stories, detailed character designs, creative typography, realistic 3D renderings, and understanding causal relationships in images.
  • An iterative rollout of each multimodal capability is planned, with audio capabilities available first. GPT-4o is planned for 2022 and may be a precursor to GPT-5.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.