Content
summary Summary

Researchers working with AI company Anthropic have developed a new method called Internal Coherence Maximization (ICM) that fine-tunes language models using only their own outputs. The approach could help—or even replace—human oversight for complex tasks.

Ad

Traditionally, large language models are fine-tuned using human supervision, such as example answers or feedback. But as models grow larger and their tasks more complicated, human oversight becomes less reliable, argue researchers from Anthropic, Schmidt Sciences, Independet, Constellation, New York University, and George Washington University in a new study.

Their solution is an algorithm called Internal Coherence Maximization, or ICM, which trains models without external labels—relying solely on internal consistency.

The model evaluates itself—and learns from it

ICM is based on a simple idea: a language model like Claude or Llama should figure out for itself which answer to a question is correct, and it does so using two main criteria.

Ad
Ad

The first is mutual predictability. This means the model checks whether it can reliably infer the correct answer to a new question based on its answers to similar previous questions. If the model recognizes patterns from similar cases, it can apply them to new answers, creating an internal coherence—a set of answers that fit together and reflect a shared understanding.

The second is logical consistency. Here, the model checks its own responses for contradictions. For example, if the model labels two different solutions to the same math problem as "correct," even though the results differ, that's a logical conflict. ICM works to actively avoid these kinds of contradictions.

Image: Wen et al.

By combining these two principles, the model essentially cross-checks itself: it looks for a set of answers that are mutually consistent and from which each answer can be derived from the others. This lets the language model leverage its existing knowledge to make better decisions—entirely without external guidance.

The process starts with a small set of randomly labeled examples. From there, the model iteratively evaluates new answers, searches for contradictions, and adjusts its judgments as needed.

Outperforming human labels—on some tasks

In tests on three established benchmarks—TruthfulQA (truthfulness), GSM8K (math accuracy), and Alpaca (helpfulness)—ICM performed at least as well as traditional training with "gold" labels or human supervision.

Recommendation

On Alpaca, which uses especially subjective criteria like helpfulness and harmlessness, ICM even outperformed training with human-annotated data. According to the researchers, this suggests that language models already internalize these concepts—they just need the right way to activate them.

Another experiment tested whether a model could determine an author's gender from a text. While humans identified the correct gender 60% of the time, ICM achieved 80% accuracy. The model was not specifically trained to detect gender; it simply relied on its existing language knowledge.

A language assistant without human training

The team also used ICM to train a reward model—again, with no human labels. This reward model was then used for reinforcement learning to train the Claude 3.5 Haiku chatbot.

The ICM-trained chatbot won 60% of head-to-head comparisons with a version trained under human supervision. The study's authors say this is strong evidence that ICM can scale beyond research and work in production settings.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

There are limits, though. ICM only works for concepts the model already knows. In a test where the model was supposed to learn a personal preference for poems mentioning "sun", ICM failed—performance was no better than random. The method also struggles with long inputs, since many examples need to fit inside the model's context window.

The researchers believe ICM could be a way to better align language models with human values—without inheriting human flaws like bias or inconsistency, especially for complex tasks where even people struggle to provide reliable labels. One of the paper's coauthors is security researcher Jan Leike, who recently left OpenAI's Superalignment team before its breakup and publicly criticized the company’s direction.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Anthropic and partner institutions have introduced Internal Coherence Maximization (ICM), a method that fine-tunes language models using only their own outputs rather than human supervision, relying on internal consistency and logical coherence.
  • In benchmark tests, ICM matched or exceeded the performance of models trained with human-labeled data, excelling especially in subjective tasks like helpfulness and harmlessness, and even outperformed humans in identifying author gender from text.
  • While ICM shows promise for automating oversight and aligning models with human values, it is limited to concepts the model already understands and struggles with tasks that require learning new preferences or handling long inputs.
Max is the managing editor of THE DECODER, bringing his background in philosophy to explore questions of consciousness and whether machines truly think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.