Content
summary Summary

Scientists from Meta, UC Berkeley, and NYU have created a new technique to improve how large language models (LLMs) approach general tasks. Called "Thought Preference Optimization" (TPO), the method aims to make AI systems consider their responses more carefully before answering.

Ad

"We argue that “thinking” should have broad utility," the researchers explain. "For example, in a creative writing task, internal thoughts can be used to plan overall structure and characters."

This approach differs from previous "chain-of-thought" (CoT) prompting techniques, which have mainly been used for math and logic tasks. The researchers cite OpenAI's new o1 model as support for their thesis that thinking can benefit a wider range of tasks.

Training without additional data

TPO overcomes the challenge of limited training data containing human thought processes. It works by:

Ad
Ad

1. Asking the model to generate thought steps before answering
2. Creating multiple outputs
3. Using an evaluator model to assess only the final answers
4. Training the model through preference optimization based on those evaluations

The thought steps themselves are not directly evaluated - only their results. The researchers hope better answers will require improved thought processes, allowing the model to implicitly learn more effective reasoning.

Flowchart: Thought Preference Optimization for LLMs, showing process from input through thought generation to optimal response selection.
This diagram illustrates the Thought Preference Optimization (TPO) process for Large Language Models (LLMs). This method enhances AI response quality through iterative evaluation and selection of thought patterns. |  Image: Wu et al.

This method differs significantly from OpenAI's approach with the o1 model. While the exact training process for o1 is unclear, it likely involved high-quality training data with explicit thought processes. Additionally, o1 actively "thinks" by outputting its thought steps as text for analysis.

Improvements across some categories

When tested on benchmarks for general instruction following, a Llama 3 8B model using TPO outperformed versions without explicit reasoning. On the AlpacaEval and Arena-Hard benchmarks, TPO achieved win rates of 52.5% and 37.3% respectively.

The improvements weren't limited to traditional reasoning tasks. TPO showed gains in areas not typically associated with explicit thinking, such as general knowledge, marketing, or health.

Recommendation

"This opens up a new opportunity to develop Thinking LLMs aimed at general instruction following rather than specializing in more narrow technical fields," the researchers conclude.

However, the team notes the current setup isn't suitable for math problems, where performance actually declined compared to the baseline model. This suggests that different approaches may be needed for highly specialized tasks.

Future work could focus on making the length of thoughts more controllable and investigating the effects of thinking on larger models.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers from Meta, Berkeley and NYU have developed a new method called "Thought Preference Optimization" (TPO) to get language models to "think" before answering. The goal is to improve performance on general tasks.
  • TPO works by asking the model to generate a thought process before answering. An evaluator model only evaluates the answers, not the thoughts. These ratings are used to train the model using preference optimization.
  • In tests with a Llama 3 8B model, TPO showed improvements in various categories such as reasoning, problem-solving, general knowledge and marketing. In mathematical tasks, however, performance deteriorated compared to the initial model.
Sources
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.