OpenAI explains some secret sauce of DALL-E 3 and shares Midjourney comparison

Oct 20, 2023

DALL-E 3 prompted by THE DECODER

OpenAI publishes a paper on the new image AI DALL-E 3, explaining why the new image AI follows prompts much more accurately than comparable systems.

As part of the full rollout of DALL-E 3, OpenAI publishes a paper about DALL-E 3: It addresses the question of why DALL-E 3 can follow prompts so accurately compared to existing systems. The answer is in the title of the paper already: "Improving Image Generation with Better Captions"

Prior to the actual training of DALL-E 3, OpenAI trained its own AI image labeler, which was then used to relabel the image dataset for training the actual DALL-E 3 image system. During the relabeling process, OpenAI paid particular attention to detailed descriptions.

Before training DALL-E 3, OpenAI trained three image models experimentally with three annotation types: human, short synthetic, and detailed synthetic.

Das Bild zeigt oben die menschliche Beschriftung, darunter eine kurze synthetische Bildgenerierung und ganz unten die generierten detaillierten Beschriftungen, wie sie für die Trainingsbilder von DALL-E 3 generiert wurden. — The image shows the human annotation at the top, a short synthetic image generation below, and the generated detailed annotations as generated for the training images of DALL-E 3 at the bottom. | Image: OpenAI

Even the short synthetic annotations significantly outperformed human annotations in benchmarks. The long descriptive annotations performed even better.

CLIP-Bewertungen für Text-Bild-Modelle, die auf verschiedene Beschriftungstypen trainiert wurden. | Bild: OpenAI — CLIP scores for text-image models trained on different annotation types. | Image: OpenAI

OpenAI also experimented with a mix of different synthetic and human annotation styles. However, the higher the percentage of machine annotation, the better the image generation. For example, DALL-E 3 contains 95 percent machine annotations and 5 percent human annotations.

Prompt following: DALL-E 3 is ahead of Midjourney 5.2 and Stable Diffusion XL

OpenAI tested the prompt following accuracy of DALL-E 3 in synthetic benchmarks and with human testers. In all synthetic benchmarks, DALL-E 3 outperforms its predecessor, DALL-E 2, and Stable Diffusion XL, in most cases by a significant margin.

More relevant is the human evaluation in the dimensions Prompt following, Style and Coherence. In particular, the result for Prompt following is clearly in favor of DALL-E 3 compared to Midjourney.

But OpenAI's new image AI also performs significantly better than Midjourney 5.2 in terms of style and coherence, with the open-source image AI Stable Diffusion XL falling even further behind. According to OpenAI, DALL-E 3 still has problems locating objects in space (left, right, behind, etc.).

In a footnote, OpenAI points out that the image labeling innovation is only part of what's new in DALL-E 3, which has "many improvements" over DALL-E 2. Thus, the clear advantage of DALL-E 3 over competing systems is not solely due to synthetic image labeling. OpenAI does not address the other improvements of DALL-E 3 in the paper.

Don't count out Midjourney just yet

As a heavy Midjourney user, I am impressed with DALL-E 3's ability to follow my prompts fairly accurately. For THE DECODER, we use a lot of AI-generated illustrations. The more accurate they are to the subject of the article, the better. That's why I've largely switched to DALL-E 3 at this point.

In terms of image quality, however, I still see Midjourney ahead. DALL-E 3 sometimes tends to have a generic stock-photo look. Especially in photorealistic scenes, DALL-E 3 is worse. People often look synthetic. In addition, Midjourney gives me much more creative leeway in terms of content and technique when it comes to prompting.

Midjourney also wants to improve the accuracy of the prompting in v6 and could catch up with DALL-E 3. DALL-E 2 by OpenAI also set new standards, but was quickly overtaken by the market.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

AI news without the hype
Curated by humans.

Over 20 percent launch discount.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

OpenAI explains some secret sauce of DALL-E 3 and shares Midjourney comparison

Prompt following: DALL-E 3 is ahead of Midjourney 5.2 and Stable Diffusion XL

Don't count out Midjourney just yet

AI News Without the Hype – Curated by Humans

AI news without the hypeCurated by humans.

AI news without the hype
Curated by humans.