Ad
Skip to content

Making AI chatbots helpful weakens their ability to simulate human behavior, large-scale study finds

Image description
Nano Banana Pro prompted by THE DECODER

A large-scale study shows that the training process turning raw language models into helpful chatbots also weakens their ability to mimic human behavior. The effect gets worse with each new generation.

Language models are increasingly used as stand-ins for human test subjects to predict reactions to policy measures, simulate clinical training for psychiatrists, or model how students learn.

A new study from an international research consortium, including scientists from Helmholtz Munich, arrives at an inconvenient finding: the very training steps that turn language models into useful assistants make them worse at modeling human behavior.

The study builds on Psych-201, a new dataset of transcripts from behavioral experiments. It covers about 208,000 participants and roughly 26 million individual responses from hundreds of experiments, several times larger than any previous collection of its kind.

Each data point captures a participant's full run through an experiment, along with detailed metadata like age, nationality, questionnaire responses, and other traits. The dataset was assembled through an open research collaboration involving researchers from more than 35 institutions.

Base models beat their fine-tuned counterparts

The researchers compared models from the Qwen3Llama3, and OLMo 3 families, testing both base models and their various post-trained variants. Base models are trained only to predict the next word in text.

From there, extra training produces the versions tuned for instruction-following, step-by-step reasoning, or image processing. The metric: how well each model predicts the actual answers human participants gave.

Three line charts for Qwen3, OLMo3.X, and Llama3.X plot negative log-likelihood against model size. The blue base model line consistently sits below all post-trained variants.
Across all three model families, base models predict human behavior better than their post-trained assistant versions. | Image: Binz et al.

The result holds across all families and sizes. Base models predict human behavior better than their post-trained descendants. The effect shows up for every common training objective, hitting hardest with reasoning models, followed by instruction tuning and vision extensions. In nearly every head-to-head comparison, the base model outperforms its specialized variant.

One obvious counter-explanation: maybe assistant models just answer more deterministically and fail to capture the natural spread of human behavior. The researchers tested this with an accuracy analysis on a subset of tasks with discrete answer options. Post-trained models still performed worse, making higher determinism unlikely as the sole explanation.

The gap widens with every generation

While base models steadily improve from Qwen2 through Qwen2.5 to Qwen3, getting better at predicting human behavior with each generation, the gap to their derived assistant models keeps growing. Ongoing advances in post-training are making the divergence from human behavior worse.

Left: a line chart showing declining negative log-likelihood from Qwen2 to Qwen3.5. Right: a box plot of post-training misalignment, which rises from Qwen2 to Qwen3.5.
While base models improve with each generation, the gap to their post-trained variants keeps growing. | Image: Binz et al.

The biggest distortion shows up in language tasks and reasoning. The researchers offer a plausible explanation: base models are, at their core, models of human language and therefore well-calibrated for language processing tasks. Post-training techniques like reinforcement learning from human feedback push them away from that original objective toward more user-friendly or normatively correct answers.

The same thing happens with reasoning. Human decisions are shaped by heuristics and systematic biases that base models apparently pick up. Reasoning training optimizes for logically correct answers instead, overwriting exactly the human quirks that matter for behavioral simulation.

Heatmap of post-training misalignment by experimental domain. Average values range from 0.06 for economic games up to 0.18 for psycholinguistics and 0.12 for reasoning.
Post-training distortion shows up across all domains, but hits hardest in language tasks and reasoning. | Image: Binz et al.

A popular shortcut doesn't work

A second finding concerns a widely used technique: giving language models participant-specific information to put them into a particular role. In the study, this took the form of an interview format where demographic details about each person were prepended before the experiment. Where available, the prompts included age, gender, nationality, education, clinical diagnoses, and questionnaire scores.

Two scatter plots for base and instruct models show the metadata benefit of persona prompts. The values cluster near zero.
Prepending real demographic profiles does practically nothing to improve predictions of individual behavior. | Image: Binz et al.

The effect was practically zero. That held even when the analysis was limited to developmental psychology experiments, where age-related differences should be informative. Earlier work had shown that persona prompts can produce human-like response distributions at the population level. But the new study questions whether they actually predict individual behavior or just look plausible on the surface.

Centaur shows targeted training can still help

The authors see their findings as a variation of a known problem: extra training toward specific goals can degrade abilities acquired during pretraining. To test whether this is a hard limit, they looked at Centaur - a model specifically fine-tuned on a portion of the behavioral data.

Centaur showed much higher agreement with human behavior even on new tasks that weren't part of its training. So extra training can help, but only when it targets behavioral modeling rather than logical correctness.

For research practice, the takeaway is clear: the convenient, readily available assistant models aren't automatically the best choice for behavioral simulations. The researchers recommend either raw base models or variants trained specifically for behavioral simulation. Code and data are available on Hugging Face and GitHub.

That chatbot models have their pitfalls as digital test subjects isn't new. A recent study of nine open-source language models found that optimizing for more human-sounding output comes at the cost of factual precision, and a classifier unmasked AI responses with 70 to 80 percent accuracy. The persona trick also worked worse than expected.

Another study found that models can barely pose as weak or strong learners on command, with their hit rates shifting by less than a percentage point. And when it comes to reasoning, a deep gap persists anyway: an analysis of more than 170,000 reasoning traces showed that reasoning models think differently than humans, falling into a kind of sequential autopilot.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.
Subscribe to The Decoder