Ad
Skip to content

Apple Intelligence pushes hallucinated stereotypes to millions of devices unprompted

Image description
Nano Banana Pro prompted by THE DECODER

Key Points

  • AI Forensics analyzed more than 10,000 Apple Intelligence summaries and identified systematic biases in how the system handles identity-related content.
  • The system tends to omit the ethnicity of white protagonists more often than that of other groups and reinforces gender stereotypes when processing ambiguous texts.
  • Given its deployment across hundreds of millions of devices, Apple Intelligence could fall under the EU AI Act's classification as a systemic risk model—yet Apple has so far not signed the voluntary Code of Practice.

Apple Intelligence automatically summarizes notifications, text messages, and emails on hundreds of millions of iPhones, iPads, and Macs. A new independent investigation by the non-profit AI Forensics analyzed more than 10,000 of these AI-generated summaries, and the findings reveal systematic bias baked right into the feature.

According to Apple's technical report, the on-device model has roughly three billion parameters and runs locally. AI Forensics gained access through Apple's own developer framework, the same interface Apple offers third-party developers for integration.

Side-by-side comparison showing Apple Intelligence summarizing the same text differently based on ethnicity: the summary for a "hispanic man" retains the ethnicity label, while the summary for a "white man" drops it.
The same text in two versions: For a "Hispanic man," Apple Intelligence keeps the ethnicity in the summary. For a "white man," it drops it. | Image: Forensics

Whiteness functions as an invisible default

The researchers wrote 200 fictitious news stories with explicitly mentioned ethnicities and created four variations of each. Every version was summarized ten times, producing 8,000 summaries in total. For white protagonists, the system mentioned ethnicity in only 53 percent of cases. That number jumped to 64 percent for Black protagonists, 86 percent for Hispanics, and 89 percent for Asians. In practice, whiteness functions as an invisible default, while other ethnicities get flagged as noteworthy.

Bar chart showing how often Apple Intelligence mentions ethnicity by group: white protagonists at 53 percent, Black at 64 percent, Hispanic at 86 percent, and Asian at 89 percent.
Apple Intelligence mentions the ethnicity of white protagonists far less often than that of other groups. | Image: Forensics

In the gender analysis, based on 200 real BBC headlines, women's first names were kept in 80 percent of summaries, compared to just 69 percent for men. Men were more often referenced by surname only—something research links to higher perceived status.

Ad
DEC_D_Incontent-1

Ambiguous text in, stereotypes out

What happens with ambiguous texts is especially revealing, according to the report. The researchers built more than 70,000 scenarios featuring two people with different professions and an ambiguous pronoun. A correct summary should preserve that ambiguity.

Apple Intelligence didn't. In 77 percent of cases, the system pinned the pronoun on a specific person, even though the original text left it open. Two-thirds of these made-up assignments fell along gender stereotype lines; the system preferred to assign "she" to the nurse and "he" to the surgeon.

Scatter plot comparing Apple Intelligence's gender assignments for ambiguous texts against the actual share of women in US professions, showing the system assigns roles like nurse and secretary to female pronouns and mechanic and architect to male pronouns.
Apple Intelligence's gender assignments for ambiguous texts almost perfectly mirror the actual share of women in US professions. The system assigns roles like nurse and secretary to female pronouns and mechanic and architect to male ones. | Image: Forensics

Across eight other social dimensions, the system hallucinated assignments that weren't in the original text 15 percent of the time. Nearly three-quarters of those matched common prejudices. A Syrian student got linked to terrorism, a pregnant applicant was labeled unfit for work, and a person with short stature was portrayed as incompetent. None of that was in the source text.

Example showing Apple Intelligence assigning a positive test result to a gay client when the original text leaves it ambiguous which of two clients tested positive.
The original text does not specify which of the two clients tested positive. Apple Intelligence assigns the result to the gay client. | Image: Forensics
Example showing Apple Intelligence attributing incompetence to a panelist with short stature when the original text only says one of two panelists is unfamiliar with the topic.
The text names two panelists and says one is unfamiliar with the topic. Apple Intelligence pins the incompetence on the person with short stature. | Image: Forensics

A smaller Google model proves the problem isn't inevitable

These distortions aren't just a side effect of the task, according to AI Forensics. For comparison, the researchers tested Google's Gemma3-1B, an open-weight model with only a third of the parameters. In identical scenarios, Gemma3-1B hallucinated just six percent of the time, compared to 15 percent for Apple. And when Google's model did hallucinate, the assignments matched stereotypes in only 59 percent of cases versus 72 percent for Apple.

Ad
DEC_D_Incontent-2

AI Forensics also frames the findings in regulatory terms. Apple's model clears the threshold for classification as "General Purpose AI" under the EU AI Act. Given its reach, it could even qualify as a model with systemic risk. Apple hasn't signed the voluntary Code of Practice but benefits from a two-year transition period, according to the report.

It's well documented that large language models reproduce social biases. A University of Michigan study, for example, found that models consistently perform better when given male or gender-neutral roles rather than female ones.

But what makes Apple Intelligence different is that users don't even have to type a prompt or open a chat window. These biased summaries just show up unprompted on lock screens, in message threads, and in inboxes. The system effectively wedges itself between sender and recipient without anyone asking it to.

Earlier in 2025, Apple Intelligence already made headlines for generating fake news summaries attributed to the BBC and the New York Times. Apple turned off summaries for news apps in response. But personal and professional messages weren't affected by that fix, even though the same distortions are at play there, according to AI Forensics.

Apple's broader AI strategy isn't faring much better. The Siri upgrades originally promised alongside Apple Intelligence have largely failed to ship, and the company still hasn't delivered on several key commitments. Most recently, reports surfaced that Apple is turning to Google's Gemini to power its devices and Siri.

AI News Without the Hype – Curated by Humans

As a THE DECODER subscriber, you get ad-free reading, our weekly AI newsletter, the exclusive "AI Radar" Frontier Report 6× per year, access to comments, and our complete archive.

Source: AI Forensics