summary Summary

Chinese researchers have created a technique that enables large language models (LLMs) to recognize and filter out irrelevant information in text-based tasks, leading to significant improvements in their logical reasoning abilities.


The research team from Guilin University of Electronic Technology and other institutions developed the GSMIR dataset, which consists of 500 elementary school math problems intentionally injected with irrelevant sentences. GSMIR is derived from the existing GSM8K dataset.

Tests on GSMIR showed that GPT-3.5-Turbo and GPT-3.5-Turbo-16k could identify irrelevant information in up to 74.9% of cases. However, the models were unable to automatically exclude this information once it was detected before solving a task.

Recognizing and filtering irrelevant information - and only then responding

To address this, the researchers developed the two-stage "Analysis to Filtration Prompting" (ATF) method. First, the model analyzes the task and identifies irrelevant information by examining each sub-sentence. It then filters out this information before starting the actual reasoning process.

The two-step ATF prompt process. First it analyzes, then it filters, and only then the model responds. | Image: Jiang et al.

Using ATF, the accuracy of LLMs in solving tasks with irrelevant information approached their performance on the original tasks without such distractions. The method worked with all tested prompting techniques.

The combination of ATF with "Chain-of-Thought Prompting" (COT) was particularly effective. For GPT-3.5-Turbo, accuracy increased from 50.2% without ATF to 74.9% with ATF – an improvement of nearly 25 percentage points.

Benchmark results comparing various prompting methods with and without ATF. The methods tested include standard, instructed, chain-of-thought (with and without examples), and least-to-most prompting. GSM8K-SLC represents the GSMIR data set without irrelevant information. The study presents two tables, although their differences are unclear. Most likely, the upper table shows results for GPT-3.5-Turbo-16k and the lower table shows results for GPT-3.5-Turbo, but the labeling is incorrect. Both tables show that ATF consistently improved accuracy across all prompting methods when solving data set tasks containing irrelevant information. | Image: Jiang et al.

The smallest improvement came when ATF was combined with Standard Prompting (SP), where accuracy increased by only 3.3 percentage points. The researchers suggest that this is because SP's accuracy on the original questions was already very low at 18.5%, with most errors likely due to calculation errors rather than irrelevant information.

Because the ATF method is specifically designed to reduce the impact of irrelevant information, but not to improve the general computational ability of LLMs, the effect of ATF in combination with SP was limited.

With other prompting techniques, such as COT, which better support LLMs in correctly solving reasoning tasks, ATF was able to improve performance more significantly because irrelevant information accounted for a larger proportion of errors.


The study has some limitations. Experiments were conducted only with GPT-3.5, and the researchers only examined tasks containing a single piece of irrelevant information. In real-world scenarios, problem descriptions may contain multiple confounding factors.

In approximately 15% of cases, irrelevant information was not recognized as such. More than half of these instances involved "weak irrelevant information" that did not impact the model's ability to arrive at the correct answer.

This suggests that ATF is most effective for "strong irrelevant information" that significantly interferes with the reasoning process. Only 2.2% of cases saw relevant information incorrectly classified as irrelevant.

Despite these limitations, the study shows that language models' logical reasoning abilities can be enhanced by filtering out irrelevant information through prompt engineering. While the ATF method could help LLMs better handle noisy real-world data, it does not address their fundamental weaknesses in logic.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • Researchers at Guilin University of Electronic Technology have developed a technique that helps large language models (LLMs) identify and remove irrelevant information in text-based tasks, significantly improving their reasoning capabilities.
  • The two-step "Analysis to Filtration Prompting" (ATF) method first analyzes the task and identifies irrelevant information by examining each sub-sentence. It then filters out this information before the model begins its reasoning process. When combined with Chain-of-Thought Prompting (COT), the accuracy of GPT-3.5-Turbo improved by nearly 25 percentage points, from 50.2% to 74.9%.
  • The study has limitations. Only GPT-3.5 variants were tested, and the tasks each contained only one piece of irrelevant information. Real-world scenarios often involve multiple confounding factors.
Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.