newsletter Newsletter

DALL-E 2 from OpenAI generates credible photos and drawings, but it can't write - or can it? A study shows that the words generated by DALL-E 2 on images are not a random jumble of letters, but can have meaning.

In April, OpenAI released initial details on DALL-E 2, an AI system that generates impressive images. The results were far beyond what artificial intelligence could do up to that point.

As with the GPT-3 language model, DALL-E 2 started in a closed beta phase. Meanwhile, approved users have generated over three million images with DALL-E 2. OpenAI now aims to activate 1,000 new accesses every week.

DALL-E 2 has problems with text

OpenAI's image system generates partially photo-realistic scenes, such as fake vacation pictures, teddy bears in Picasso style, or an antique statue of a man tripping over a cat. There seem to be hardly any limits to creativity.


But DALL-E 2 also has weaknesses, such as when the system incorrectly arranges colored cubes on a picture contrary to the instructions, mixes concepts like supermarket and renaissance, or understands "operated" in the medical sense instead of, for example, operating a machine.

Moreover, DALL-E 2 has trouble putting text on an image. Something Google's new image AI Imagen has ahead of OpenAI's product. An example from OpenAI's related research: instead of writing "Deep Learning" on a sign in a generated image, the AI texts "Deinp Lerpt" or "Diep Deep."

The sign should say Deep Learning. DALL-E 2 writes seemingly meaningless fantasy words instead. | Image: OpenAI

In numerous other tests, DALL-E 2 produced only fantasy words. The reason is probably OpenAI's static multimodal CLIP model, which is part of the DALL-E 2 architecture. Google's Imagen, on the other hand, relies on a large language model with a better understanding of language.

Does DALL-E 2 have a hidden vocabulary?

Now, researchers at the University of Texas have shown that DALL-E 2's strange strings are probably not as random as previously thought. In numerous experiments, they were able to show that DALL-E 2 has developed a hidden vocabulary that appears in images with text. These supposed fantasy words can in turn be used to control the AI system.

For example, the input "Two farmers talking about vegetables, with subtitles" generates an image with seemingly meaningless text.


But when the text "Vicootes" is used as input to DALL-E 2, the system generates images of vegetables. The text "Apoploe vesrreaitars", on the other hand, generates images of birds.

"It seems that the farmers are talking about birds, messing with their vegetables," co-author Giannis Daras said on Twitter.

"Vicootes" produces vegetable images, "Apoploe vesrreaitais" those of birds. | Image: Daras et al. | OpenAI DALL-E 2

Using the same method, the researchers find other examples of DALL-E-specific vocabulary: "Wa ch zod ahaakes rea" produces images of seafood, "Apoploe vesrreaitais" not only of birds but also of insects, depending on the style - so the term seems to encompass flying objects.

"Contarra ccetnxniams luryca tanniounons" means mostly insects. Taken together, the input "Apoploe vesrreaitais eating Contarra ccetnxniams luryca tanniounons" thus produces images of birds eating insects.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Vögel, die Insekten fressen. Von DALL-E 2 generierte Bilder.
"Apoploe vesrreaitais eating Contarra ccetnxniams luryca tanniounons" | Image: Daras et al. | OpenAI DALL-E 2

However, it is difficult to find such robust examples, the authors write. Often, one and the same word generates different images that, at first glance, have nothing in common.

Nevertheless, the discovery of a DALL-E vocabulary creates new and interesting challenges regarding the model's security and interpretability. Currently, speech systems filter text inputs for DALL-E 2 and detect those that violate OpenAI's guidelines. The seemingly nonsensical inputs with DALL-E vocabulary could be used to bypass these filters, the authors said.

The researchers plan to find out more about DALL-E 2's hidden vocabulary in the next step.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.