Content
summary Summary

AI researcher Andrej Karpathy says reinforcement learning from human feedback (RLHF) is not enough to train language models with human-like problem-solving capabilities.

Ad

RLHF was considered the "secret ingredient" that made ChatGPT so understanding, obedient, natural, and ultimately fascinating to users, and was a big part of its success. RLHF is usually the final stage of LLM training, after pre-training and Supervised Fine-tuning (SFT).

However, Karpathy views RLHF more as a bottleneck and stopgap solution, since conventional reinforcement learning is not yet feasible for LLMs.

Karpathy contrasts this with DeepMind's AlphaGo, which used "real RL". The system played simulated Go games against itself, maximizing wins until it outperformed top human players. It learned superhuman performance without human intervention by optimizing its neural network based on game outcomes.

Ad
Ad

RLHF is more about mimicking human preferences than actually solving problems

If AlphaGo had used RLHF instead, human evaluators would compare the states of the Go boards and pick preferences, Karpathy writes. About 100,000 such comparisons would train a "reward model" that mimics this human "vibe check."

Reinforcement learning would then optimize for "good vibes" according to the reward model. Karpathy argues this approach would cause problems in Go, as vibes could be misleading.

The LLM reward model works similarly - it ranks answers that human raters statistically seem to like. It's a proxy for what looks good to humans, not actual problem solving. Models can also learn quickly to exploit the reward function. And while RL works in constrained environments like Go, true RL for open-ended language tasks remains elusive because the goal is difficult to define.

"How do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking," Karpathy writes.

But solving this problem could enable language models that match human problem-solving abilities, Karpathy adds. Google Deepmind recently published a paper arguing that open-endedness is the foundation for AGI.

Recommendation

Andrej Karpathy is one of several senior AI experts to leave OpenAI this year. He recently launched his own startup for AI in education.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Andrej Karpathy, former OpenAI researcher, criticizes the effectiveness of reinforcement learning from human feedback (RLHF) in training AI language models.
  • He sees RLHF as a bottleneck and stopgap solution, since traditional reinforcement learning (RL), i.e. machine feedback based on clearly defined goals, is not yet applicable to LLMs.
  • With RLHF, human reviewers are asked about their preferences and trained on a "reward model" that is just a "vibe check" and not a real goal, Karpathy argues. He compares RLHF to the training of Deepmind's AlphaGo, which was trained with "real RL.
Sources
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.