Researchers have conducted a systematic analysis of large language models' capabilities for inductive and deductive reasoning. The study reveals both surprising strengths and clear limitations of these AI systems.
Scientists from the University of California, Los Angeles and Amazon carried out a new study examining the reasoning abilities of large language models (LLMs) in greater detail. For the first time, the researchers made a systematic distinction between inductive and deductive reasoning.
Inductive reasoning involves deriving general rules from specific observations, while deductive reasoning applies general rules to particular cases. The study aimed to determine which type of reasoning poses a greater challenge for LLMs.
To isolate inductive reasoning, the researchers developed a new method called "SolverLearner." This approach has models learn a function from a few examples that maps inputs to outputs. External programs then apply this function to avoid mixing it with deductive reasoning.
The results show that LLMs like GPT-4 achieve near-perfect performance in inductive reasoning using SolverLearner, with 100 percent accuracy in most cases. However, the models struggle more with deductive reasoning, especially on "counterfactual" tasks that deviate from typical training data.
For instance, the models handled arithmetic tasks in the decimal system well, but had difficulties when calculating in other number systems. They also showed weaknesses when analyzing sentences with unusual word order or spatial orientation in modified coordinate systems.
The researchers conclude that deductive reasoning presents a greater challenge for current LLMs. The ability to correctly apply given rules depends heavily on how often similar tasks appear in the training process.
Language models, grokking and architectural adaptations
The study confirms both the strengths and limitations of current AI language models. It demonstrates that these systems have impressive abilities in recognizing patterns and deriving rules. However, they still struggle to correctly apply learned rules to new situations.
For the tests, the team did not use prompting methods like chain-of-thought, which somewhat improve the models' ability to make deductive inferences but cannot raise it to a satisfactory level. The new OpenAI model o1 was not included in the testing.
A separate study by researchers from Ohio State University and Carnegie Mellon University also recently examined the logical reasoning capabilities of Transformer models. They analyzed whether the models can acquire the ability to draw implicit conclusions through "grokking," particularly in composition and comparison tasks.
The results indicate that the models acquire the ability to make implicit inferences in both types of tasks through extended training beyond the point of overfitting. However, they were only able to generalize to unseen examples in comparison tasks. The researchers attribute this difference to the internal structure of the learned circuits and recommend adjustments to the transformer architecture, which also showed a qualitative improvement in an initial experiment.