summary Summary

OpenAI publishes a paper on the new image AI DALL-E 3, explaining why the new image AI follows prompts much more accurately than comparable systems.


As part of the full rollout of DALL-E 3, OpenAI publishes a paper about DALL-E 3: It addresses the question of why DALL-E 3 can follow prompts so accurately compared to existing systems. The answer is in the title of the paper already: "Improving Image Generation with Better Captions"

Prior to the actual training of DALL-E 3, OpenAI trained its own AI image labeler, which was then used to relabel the image dataset for training the actual DALL-E 3 image system. During the relabeling process, OpenAI paid particular attention to detailed descriptions.

Before training DALL-E 3, OpenAI trained three image models experimentally with three annotation types: human, short synthetic, and detailed synthetic.

Das Bild zeigt oben die menschliche Beschriftung, darunter eine kurze synthetische Bildgenerierung und ganz unten die generierten detaillierten Beschriftungen, wie sie für die Trainingsbilder von DALL-E 3 generiert wurden.
The image shows the human annotation at the top, a short synthetic image generation below, and the generated detailed annotations as generated for the training images of DALL-E 3 at the bottom. | Image: OpenAI

Even the short synthetic annotations significantly outperformed human annotations in benchmarks. The long descriptive annotations performed even better.

CLIP-Bewertungen für Text-Bild-Modelle, die auf verschiedene Beschriftungstypen trainiert wurden. | Bild: OpenAI
CLIP scores for text-image models trained on different annotation types. | Image: OpenAI

OpenAI also experimented with a mix of different synthetic and human annotation styles. However, the higher the percentage of machine annotation, the better the image generation. For example, DALL-E 3 contains 95 percent machine annotations and 5 percent human annotations.

Prompt following: DALL-E 3 is ahead of Midjourney 5.2 and Stable Diffusion XL

OpenAI tested the prompt following accuracy of DALL-E 3 in synthetic benchmarks and with human testers. In all synthetic benchmarks, DALL-E 3 outperforms its predecessor, DALL-E 2, and Stable Diffusion XL, in most cases by a significant margin.

Synthetic benchmarks. | Image: OpenAI

More relevant is the human evaluation in the dimensions Prompt following, Style and Coherence. In particular, the result for Prompt following is clearly in favor of DALL-E 3 compared to Midjourney.

Evaluation by humans. | Image: OpenAI

But OpenAI's new image AI also performs significantly better than Midjourney 5.2 in terms of style and coherence, with the open-source image AI Stable Diffusion XL falling even further behind. According to OpenAI, DALL-E 3 still has problems locating objects in space (left, right, behind, etc.).


In a footnote, OpenAI points out that the image labeling innovation is only part of what's new in DALL-E 3, which has "many improvements" over DALL-E 2. Thus, the clear advantage of DALL-E 3 over competing systems is not solely due to synthetic image labeling. OpenAI does not address the other improvements of DALL-E 3 in the paper.

Don't count out Midjourney just yet

As a heavy Midjourney user, I am impressed with DALL-E 3's ability to follow my prompts fairly accurately. For THE DECODER, we use a lot of AI-generated illustrations. The more accurate they are to the subject of the article, the better. That's why I've largely switched to DALL-E 3 at this point.

In terms of image quality, however, I still see Midjourney ahead. DALL-E 3 sometimes tends to have a generic stock-photo look. Especially in photorealistic scenes, DALL-E 3 is worse. People often look synthetic. In addition, Midjourney gives me much more creative leeway in terms of content and technique when it comes to prompting.

Midjourney also wants to improve the accuracy of the prompting in v6 and could catch up with DALL-E 3. DALL-E 2 by OpenAI also set new standards, but was quickly overtaken by the market.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • OpenAI has published a paper on DALL-E 3, showing how the system follows prompts more accurately than other systems by using better image labels.
  • OpenAI first trained its own AI image labeler, and then generated detailed synthetic descriptions for the image dataset used to train DALL-E 3.
  • DALL-E 3 outperforms its predecessor, DALL-E 2, and other competing systems on synthetic benchmarks and human judgments, particularly in prompt following.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.