Content
summary Summary

A research team at the University of Oxford set out to make language models sound warmer and more empathetic, but ran into some unexpected side effects.

Ad

The researchers tested five different language models of varying sizes and architectures: Llama-8B, Mistral-Small, Qwen-32B, Llama-70B, and GPT-4o. They used a dataset of 1,617 conversations and 3,667 human-LLM message pairs, all biased toward warmer, more empathetic responses. During training, they rewrote original model answers into friendlier versions while keeping the substance the same.

Multi-part graphic on LLM fine-tuning: warmth across epochs, chat examples and confirmed false beliefs in sadness
Across five LLMs, response "warmth" jumps through the second training epoch, then levels off, so the researchers analyzed models after two epochs. | Image: Ibrahim et al.

The models trained for warmth consistently made more mistakes than the originals, with error rates jumping anywhere from 10 to 30 percent. The warmer models were more likely to reinforce conspiracy theories, repeat false information, and give questionable medical advice.

The researchers tested four areas: factual knowledge, resistance to misinformation, susceptibility to conspiracy theories, and medical knowledge. The original models had error rates between 4 and 35 percent, but the "warmer" versions saw those rates rise by an average of 7.43 percent.

Ad
Ad
Streudiagramme vergleichen Fehlerquoten warmer vs ursprünglicher Modelle in vier Aufgaben und emotionalen Kontexten über fünf KI-Architekturen.
Models trained for warmth show higher error rates across the board—especially for emotionally charged questions built on false assumptions. | Image: Ibrahim et al.

This pattern held true for every architecture and model size the team tested. The researchers called it "a systematic rather than model-specific phenomenon."

Flattery and emotional bias

There was another side effect: the warmer models became much more likely to agree with users, even when the users were wrong. The study found that these models affirmed false beliefs about 40 percent more often than the original models.

This tendency was especially strong when users expressed emotions. For emotionally charged questions, warmer models made 12.1 percent more mistakes, compared to 6.8 percent for more neutral queries.

Box plots: Error rates of warm models vs. original models for pure questions, with context, and with incorrect user assumptions.
Models trained for warmth had much higher error rates on interpersonal questions, especially when paired with incorrect assumptions. | Image: Ibrahim et al.

The effect was most dramatic when users showed sadness; in those cases, the reliability gap between warmer and original models nearly doubled to 11.9 percent. When users expressed admiration instead, the gap shrank to 5.23 percent.

To check their results, the team ran control tests using general knowledge and math benchmarks, as well as safety tasks. Here, the warmer models performed about the same as the originals. The "training for warmth" didn't make the models less intelligent—their basic knowledge and reasoning skills stayed intact.

Recommendation
Warm fine-tuned vs. original models on MMLU, GSM8K, and AdvBench: similar scores, slight MMLU decrease with llama-8b.
On MMLU, GSM8K, and AdvBench, models trained for warmth score nearly the same as the originals. | Image: Ibrahim et al.

The researchers also tried training two models in the opposite direction, making them respond in a "colder," less empathetic style. These versions proved at least as reliable, and sometimes even improved by up to 13 percent. Even simple system prompts nudging models toward warmth produced similar, though smaller and less consistent, effects compared to full fine-tuning.

What this means for AI alignment

The team says these findings matter for building and governing human-like AI systems. They reveal a fundamental trade-off: optimizing for one positive trait can undermine another.

Current evaluation practices, the researchers argue, might miss risks like these, since standard benchmarks don't pick up on them. They call for new development and oversight frameworks as AI systems become more involved in everyday life.

The debate over warm vs. cold LLM behavior isn't just theoretical. In April, OpenAI rolled back a GPT-4o update because it flattered users too much and encouraged problematic behavior. When OpenAI later released GPT-5, it was criticized for being too "cold" compared to the "warmer" GPT-4o. After user backlash, the company tweaked GPT-5 to sound friendlier. But as this study suggests, those changes may come at the cost of reliability.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers at the University of Oxford trained five large language models to respond with more warmth and empathy.
  • The models that were fine-tuned for warmth made significantly more mistakes, including spreading more misinformation and giving more problematic medical advice, as well as showing more flattering behavior.
  • The study cautions that optimizing language models for traits like empathy can weaken their performance in other important areas.
Sources
Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.