Prompt transformation makes ChatGPT OpenAI's covert moderator for DALL-E 3

OpenAI's new image AI DALL-E 3 is currently being deployed in ChatGPT and Bing Image Creator. OpenAI is documenting its efforts to prevent users from generating potentially harmful or offensive images.

The documentation shows that the integration of DALL-E 3 into ChatGPT is as much a safety measure as it is a convenience. This is because ChatGPT can use so-called "prompt transformations" to check user prompts for possible content violations, and then rewrite them in such a way that the violations can be bypassed if they appear unintentional. It's essentially a stealth moderation system.

In particularly blatant cases, ChatGPT will block the prompt, especially if the prompt contains terms that are on the OpenAI block lists. These lists come from experience with DALL-E 2 and beta testing of DALL-E 3.

They include, for example, names of famous people or artists whose style should be used as a template - a popular prompt method in DALL-E 2 and similar image generation systems, which is seen as controversial by some artists. In addition, ChatGPT's moderation API intervenes in prompts that violate OpenAI's content guidelines.

For boundary setting and testing, OpenAI also relied on red teaming, in which assigned individuals attempted to give DALL-E 3 the wrong ideas through targeted prompts. Red teaming helps identify new risks and evaluate mitigations for known risks.

Proprietary image classifier for "racy" images

For sexist or otherwise "racy" content, OpenAI has trained an image output classifier to detect suspicious patterns in images and stop generation. A previous version of DALL-E 3 without this filter was able to generate violent, religious, and copyright-infringing images - all in one image.

An image from an earlier DALL-E 3 version without protections. | Image: Kaamalauppias, Discord

The function of the filter is illustrated in the following example, where an earlier version of DALL-E 3 generated an image of "an individual enjoying a leisurely picnic in the park". In a previous version of DALL-E 3, this person was a muscular, almost naked man because, well, why not? In the launch version, the food is the focus of the image, not the person.

According to OpenAI, the risk of such unwanted (the prompt did not ask for a muscular, naked man) or offensive images has been reduced to 0.7 percent for the launch version of DALL-E 3. However, such filters are controversial. With DALL-E 2 and similar systems, parts of the AI art scene complained about too much interference in the generation, limiting artistic freedom. According to OpenAI, the classifier will be further optimized to achieve an optimal balance between risk mitigation and output quality.

Sexual contexts are moderated away or constrained by DALL-E 3's image classifier. | Image: OpenAI

Bias and disinformation

Like DALL-E 2, DALL-E 3 contains cultural biases, usually in favor of Western culture, OpenAI writes, especially in unspecified prompts where parameters such as nationality, culture, or skin color are not determined by the user.

Recommendation

AI in practice

OpenAI launches o1 and ChatGPT Pro for $200 per month

In particular, OpenAI relies on prompt transformation to make unspecified prompts more specific, so that the image and the input match as closely as possible. However, the company acknowledges that even these prompt transformations may contain social biases or create new ones.

DALL-E 3 also generates photorealistic images or images of well-known people that could be used for disinformation. A large portion of these images would be rejected by the system, or images would be generated that were not convincing, OpenAI writes.

However, the Red Team did find certain prompt methods, such as "in the style of a CCTV video," that could be used to trick DALL-E 3's protection systems. Sounds like there is no complete protection against fake AI images.

Using blocklists, prompt transformations, and the output classifier, tests with 500 synthetic prompts showed that OpenAI's safety measures were able to reduce the generation of public figures to zero when explicitly requested by name in a prompt.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Random or implicit generation of images of famous people ("a famous pop star") was reduced to 0.7 percent in alpha testing with 500 targeted prompts. Of these prompts, 33.8 percent were rejected by the ChatGPT component and 29.0 percent were rejected by the image generation component (i.e., the output classifier). The remaining images did not contain public figures.

Copyright and bioweapons

The issue of AI imagery and copyrights is currently being debated, as early DALL-3 users have managed to create trademarked objects or characters in contexts that are unlikely to please the companies that own them.

For example, users of the Bing Image Creator with DALL-E 3 support can generate the Nickelodeon mascot "Spongebob" flying in a plane toward two skyscrapers that resemble the World Trade Center. Meta has similar problems with its new AI stickers, such as a sticker of Mickey Mouse holding a machine gun.

Image: Bing Image Creator, DALL-E 3 / via 404media.co

Acknowledging the complexity of the problem, OpenAI says that while the risk mitigation measures it has put in place would limit misuse in these scenarios as well, it is "not able to anticipate all permutations that may occur."

"Some common objects may be strongly associated with branded or trademarked content, and may therefore be generated as part of rendering a realistic scene," OpenAI writes.

On the other hand, OpenAI says that using DALL-E 3 to generate potentially dangerous images, such as for building weapons or visualizing harmful chemical substances, is unproblematic. Here, the Red Team found such inaccuracies in the tested disciplines of chemistry, biology, and physics that DALL-E 3 is simply unsuitable for this application.

Prompt transformation makes ChatGPT OpenAI's covert moderator for DALL-E 3

Proprietary image classifier for "racy" images

Bias and disinformation

OpenAI launches o1 and ChatGPT Pro for $200 per month

Copyright and bioweapons

ChatGPT's automatic prompt rewriting reduces DALL-E 3's performance, study finds

OpenAI upgrades DALL-E 3 instead of rolling out GPT-4o's (much better) imaging capabilities

DALL-E 4 could be much better than DALL-E 3

Kimi-K2 is the next open-weight AI milestone from China after Deepseek

New Energy-Based Transformer architecture aims to bring better "System 2 thinking" to AI models

Musk unveils Grok 4 as xAI’s new AI model that beats OpenAI and Google on major benchmarks

Prompt transformation makes ChatGPT OpenAI's covert moderator for DALL-E 3

Proprietary image classifier for "racy" images

Bias and disinformation

Copyright and bioweapons

Share

Bank details