OpenAI's GPT-4o mini is built from the ground up to resist the most common LLM attack

Midjourney prompted by THE DECODER

OpenAI's GPT-4o mini is designed to make language models cheaper, faster, and possibly safer. The model supports the new instruction hierarchy.

All large language models (LLMs) are vulnerable to prompt injection attacks and jailbreaks. Attackers replace the original instructions of the models with their own malicious prompts.

The simplest and most common command is to tell an LLM-based chatbot to ignore previous prompts and follow new instructions instead. It requires no IT skills - just a single keystroke in the chat window to execute the attack. That's what makes it so potentially dangerous.

In April 2024, OpenAI introduced the instruction hierarchy method as a countermeasure. It assigns different priorities to instructions from developers (highest priority), users (medium priority) and third-party tools (low priority).

The researchers distinguish between "aligned instructions," which match the higher-priority instructions, and "misaligned instructions," which contradict them. When instructions conflict, the model follows the highest priority instructions and ignores conflicting lower priority instructions.

GPT-4o mini is the first OpenAI model trained to behave this way from the ground up and is available via API. OpenAI says this "makes the model's responses more reliable and helps make it safer to use in applications at scale."

The company hasn't published benchmarks on GPT-4o mini's improved safety. A first unofficial test by Edoardo Debenedetti shows a 20 percent improvement in defense against such attacks compared to GPT-4o. However, other models like Anthropic's Claude Opus perform similarly well or better.

This improvement roughly matches what OpenAI reported when introducing the method for an adapted GPT-3.5. Resistance to jailbreaking reportedly increased by up to 30 percent, and up to 63 percent for system prompt extraction. Due to its higher performance, GPT-4o should inherently be more robust against attacks than GPT-3.5, resulting in a smaller overall improvement.

Of course, improved security does not mean that the model is no longer vulnerable - the first GPT-4o mini jailbreaks are already making the rounds.

Join our community

Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

Recommendation

AI in practice

OpenAI's GPT-4o mini is built from the ground up to resist the most common LLM attack

Here's every Apple Intelligence update Apple announced at WWDC 25

Persona vectors allow Anthropic to steer language model behaviors like sycophancy and evil

Respect instead of sarcasm: study uses AI for better political debates

Google says AI content is fine, and SEO basics still apply to AI-powered search

OpenAI launches GPT-5 as a unified system with adaptive reasoning for complex tasks

Google Deepmind's Genie 3 creates interactive 3D worlds that stay consistent for "multiple minutes"

Google upgrades Gemini with Deep Think and flags early warning risks

OpenAI's GPT-4o mini is built from the ground up to resist the most common LLM attack

Share

Bank details