OpenAI explains some secret sauce of DALL-E 3 and shares Midjourney comparison

OpenAI publishes a paper on the new image AI DALL-E 3, explaining why the new image AI follows prompts much more accurately than comparable systems.

As part of the full rollout of DALL-E 3, OpenAI publishes a paper about DALL-E 3: It addresses the question of why DALL-E 3 can follow prompts so accurately compared to existing systems. The answer is in the title of the paper already: "Improving Image Generation with Better Captions"

Prior to the actual training of DALL-E 3, OpenAI trained its own AI image labeler, which was then used to relabel the image dataset for training the actual DALL-E 3 image system. During the relabeling process, OpenAI paid particular attention to detailed descriptions.

Before training DALL-E 3, OpenAI trained three image models experimentally with three annotation types: human, short synthetic, and detailed synthetic.

Das Bild zeigt oben die menschliche Beschriftung, darunter eine kurze synthetische Bildgenerierung und ganz unten die generierten detaillierten Beschriftungen, wie sie für die Trainingsbilder von DALL-E 3 generiert wurden. — The image shows the human annotation at the top, a short synthetic image generation below, and the generated detailed annotations as generated for the training images of DALL-E 3 at the bottom. | Image: OpenAI

Even the short synthetic annotations significantly outperformed human annotations in benchmarks. The long descriptive annotations performed even better.

CLIP-Bewertungen für Text-Bild-Modelle, die auf verschiedene Beschriftungstypen trainiert wurden. | Bild: OpenAI — CLIP scores for text-image models trained on different annotation types. | Image: OpenAI

OpenAI also experimented with a mix of different synthetic and human annotation styles. However, the higher the percentage of machine annotation, the better the image generation. For example, DALL-E 3 contains 95 percent machine annotations and 5 percent human annotations.

Prompt following: DALL-E 3 is ahead of Midjourney 5.2 and Stable Diffusion XL

OpenAI tested the prompt following accuracy of DALL-E 3 in synthetic benchmarks and with human testers. In all synthetic benchmarks, DALL-E 3 outperforms its predecessor, DALL-E 2, and Stable Diffusion XL, in most cases by a significant margin.

More relevant is the human evaluation in the dimensions Prompt following, Style and Coherence. In particular, the result for Prompt following is clearly in favor of DALL-E 3 compared to Midjourney.

But OpenAI's new image AI also performs significantly better than Midjourney 5.2 in terms of style and coherence, with the open-source image AI Stable Diffusion XL falling even further behind. According to OpenAI, DALL-E 3 still has problems locating objects in space (left, right, behind, etc.).

Recommendation

AI in practice

Update

OpenAI's new 'o1' model thinks longer to give smarter answers

In a footnote, OpenAI points out that the image labeling innovation is only part of what's new in DALL-E 3, which has "many improvements" over DALL-E 2. Thus, the clear advantage of DALL-E 3 over competing systems is not solely due to synthetic image labeling. OpenAI does not address the other improvements of DALL-E 3 in the paper.

Don't count out Midjourney just yet

As a heavy Midjourney user, I am impressed with DALL-E 3's ability to follow my prompts fairly accurately. For THE DECODER, we use a lot of AI-generated illustrations. The more accurate they are to the subject of the article, the better. That's why I've largely switched to DALL-E 3 at this point.

In terms of image quality, however, I still see Midjourney ahead. DALL-E 3 sometimes tends to have a generic stock-photo look. Especially in photorealistic scenes, DALL-E 3 is worse. People often look synthetic. In addition, Midjourney gives me much more creative leeway in terms of content and technique when it comes to prompting.

Midjourney also wants to improve the accuracy of the prompting in v6 and could catch up with DALL-E 3. DALL-E 2 by OpenAI also set new standards, but was quickly overtaken by the market.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

OpenAI explains some secret sauce of DALL-E 3 and shares Midjourney comparison

Prompt following: DALL-E 3 is ahead of Midjourney 5.2 and Stable Diffusion XL

OpenAI's new 'o1' model thinks longer to give smarter answers

Don't count out Midjourney just yet

Grok 4 is not officially instructed to follow Musk’s views but often does on sensitive subjects

OpenAI’s head of ChatGPT says AI will not displace doctors but will displace not going to the doctor

Mistral says its latest coding model beats Gemini 2.5 Pro in both power and price

AI coding can make developers slower even if they feel faster

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

"Cat attack" on reasoning model shows how important context engineering is

OpenAI explains some secret sauce of DALL-E 3 and shares Midjourney comparison

Prompt following: DALL-E 3 is ahead of Midjourney 5.2 and Stable Diffusion XL

Don't count out Midjourney just yet

Share

Bank details