Content
summary Summary

A novel prompting method called "Highlighted Chain of Thought" (HoT) helps large language models better explain their reasoning and makes their answers easier for humans to verify.

Ad

The approach works in two steps: First, the AI reformulates the original question and marks important facts using XML tags. Then, it generates an answer that references these highlighted facts, creating clear connections between the question and response.

This structured approach forces models to more carefully consider the facts presented, which may reduce hallucinations, according to the researchers. The color-coded highlights also make it faster for humans to verify the AI's reasoning.

Three-step example: transformation of a text task using XML tags and highlighting for better readability.
HoT structures math problems by marking key information with XML tags, making the content easier to read and understand. | Image: Nguyen et al.

The research team used 15 human-annotated question-answer pairs to train AI models to independently generate highlights through prompting. Testing shows HoT improves AI accuracy across various tasks. At its best, the technique achieved improvements up to 15 percent, varying by model and benchmark.

Ad
Ad

Compared to the traditional chain-of-thought (CoT) method used to train current reasoning models like OpenAI o1, HoT increased accuracy by 1.6 percentage points for arithmetic tasks, 2.58 points for question-answering, and 2.53 points for logical reasoning.

Comparison tables and examples: HoT vs CoT performance on various LLMs, with detailed task examples to visualize the methodology.
Test results show HoT outperforming traditional CoT methods, with the biggest gains on AQUA (+14.64) and StrategyQA (+15.07) benchmarks. | Image: Nguyen et al.

The researchers tested HoT across five AI models: GPT-4o, Gemini-1.5-Pro, Gemini-1.5-Flash, Llama-3.1-70B, and Llama-3.1-405B. They evaluated 17 different task types covering arithmetic, reading comprehension, and logical thinking.

Reasoning models showed little to no benefit from HoT in testing, and in some cases performed worse, with Deepseek-R1 actually showing slightly decreased performance. The researchers attribute this to the example-based prompting approach, which can lead to poorer results with reasoning models.

Mixed results for human verification

Human testers completed verification tasks 25 percent faster with highlighted answers. However, the highlighting had an unexpected effect on trust: Users became more likely to accept AI answers, even incorrect ones.

With highlighting, humans correctly identified accurate answers 84.5 percent of the time, compared to 78.8 percent without highlighting. However, their ability to spot wrong answers dropped from 72.2 percent to 54.8 percent when highlighting was present. Tests using AI models as verifiers showed no clear improvement.

Recommendation
While HoT speeds up human verification of AI responses, it also makes people more likely to miss AI mistakes. | Image: Nguyen et al.

The researchers remain optimistic about HoT's potential to make AI systems more transparent and comprehensible, though they acknowledge more research is needed on how highlighting affects user trust.

The method also has technical limitations. Smaller models such as Llama-3.1-8B and Qwen-2.5-Coder-32B struggle to follow tagging instructions, often tagging results incorrectly or simply repeating examples. The research also found that moving tags to random phrases significantly affects accuracy, highlighting the importance of consistent tagging between questions and answers.

Looking ahead, the team plans to train AI models to generate HoT answers directly rather than using prompt examples, which could make the method more effective and widely applicable.

The research paper is available on the preprint server arXiv and on a project page. The researchers make their code and data available on Github.

Ad
Ad
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
Summary
  • Researchers have introduced the "Highlighted Chain of Thought" (HoT) prompting technique, which prompts AI models to highlight their responses using XML tags, making them more comprehensible to users.
  • When tested on five AI models, including GPT-4 and Gemini-1.5, HoT increased accuracy across various task types by as much as 15 percent. Human evaluators were able to review the highlighted responses 25 percent more quickly, but also placed greater trust in them, even when the answers were incorrect.
  • The HoT method is not yet consistently effective with smaller models like Llama-3.1-8B, as they struggle to accurately follow the prompts. In the future, the research team plans to directly train AI models to generate HoT responses.
Sources
Matthias is the co-founder and publisher of THE DECODER, exploring how AI is fundamentally changing the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.