I was reading about the challenges that large language models (LLMs) face despite their impressive progress in recent years. I came across this research paper on Not All LLM Reasoners Are Created Equal by Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, and Rishabh Agarwal. It is from Mila, Google DeepMind, and Microsoft Research. This paper talks about Complex Reasoning in LLMs.
Talking about the progress: Large language models (LLMs) have made our (students, working professionals and more) lives easier in handling complex tasks such as high school and college-level math problems. This impressive performance has led many to believe that LLMs have also mastered simpler grade-school math, as measured by benchmarks like GSM8K. However, when we dig deep into their abilities, it reveals a different story, particularly when we focus on the smaller, more cost-efficient models. While seemingly powerful, smaller LLMs show surprising weaknesses when tested on more complex problems requiring multi-step reasoning.
The study assessed how well LLMs can solve math problems that build on one another, where the solution to one problem directly impacts the next. This type of evaluation goes beyond the standard single-question tests and exposes the limitations of LLMs, particularly the smaller ones. The results showed a significant performance gap when these models were tasked with solving paired problems as compared to solving individual problems independently. Surprisingly, it was more prominent in smaller, specialised models, often praised for efficiency and speed. While they perform well in simple tasks, their ability to handle multi-step or compositional reasoning problems is limited, making them less reliable in real-world applications.
The research explains why smaller LLMs, despite being efficient and successful in basic tasks, struggle with complex reasoning. One major reason is that these models get distracted by additional context. They also have difficulty with “second-hop reasoning,” which involves using the solution of the first problem to inform the second. This weakness is not caused by common issues like test-set leakage, where models have seen test problems during training. Instead, it stems from their inability to maintain focus and logically connect different parts of a problem.
Instruction-tuning, where models are fine-tuned to follow human instructions, is a common strategy to improve performance. However, its effectiveness varies across different model sizes. Smaller models show inconsistent improvements, indicating that their training methods may need adjustment. When fine-tuned on grade-school math problems, smaller models often overfit, becoming too specialized to the training data and failing to generalize to new problems.
In summary, while smaller LLMs can offer good performance at a lower cost, their brittleness in handling complex, multi-step reasoning tasks limits their practical use, especially in scenarios requiring consistent, reliable performance across various problems.
Let X be the answer to the Q1:
Q1: There are 27 unicorns left in the world. One-third of them are in the Scottish Highlands. Two-thirds of the Scottish unicorns are female. How many female Scottish unicorns are there? Solve it and use the value of X to solve Q2. Explain your answer step by step.
Q2: Zack’s locker is half as big as Timothy’s locker. Peter’s locker is 1/4 as big as Zack’s locker. If Peter’s locker is X cubic inches, how big is Timothy’s locker in cubic inches?
The answer of Question-1 (Q1) is a variable X in Question-2 (Q2). The model has to be able to solve the first question correctly in order to solve the second question. The new final answer of Q2 is calculated by modifying its code-form solution and executing it.
According to the given graph:
The graph highlights a critical reasoning gap in current models. Although models can achieve high accuracy on individual reasoning questions (GSM8K), their performance significantly degrades when these questions are chained together in a compositional manner. This suggests that improving models’ ability to handle compositional reasoning tasks is a key challenge in advancing machine reasoning capabilities.
The graph compares language models (like AI models that understand and generate text). Some of these models are “open-weight,” meaning anyone can use and study them, while others are “closed-source,” meaning only the creators can access them.
The graph’s main focus is on the “reasoning gap.” It measures how well each model performs reasoning tasks—like solving problems or understanding logic—compared to a standard baseline (a reference point).
The graph basically shows how good or bad different models are at reasoning, and whether they’re open to everyone or kept private doesn’t matter in this case.
The chart shows that smaller, math-specialized, and cost-efficient models tend to have greater reasoning gaps, suggesting they may not generalise well across broader reasoning tasks. In contrast, larger models like GPT-4o and others in the LLAMA or GPT family tend to perform better across the board in reasoning tasks, narrowing the gap.
The exploration of compositional grade-school math (GSM) in the research context offers a deeper insight into the challenges large language models (LLMs) face when solving interconnected reasoning problems. Each question in compositional GSM consists of two parts: Question-1 and Question-2. The answer to Question-1 becomes a variable, referred to as X, used in solving Question-2. This unique design forces models to maintain consistency and accuracy across chained questions, adding complexity to the task beyond traditional single-question formats. Researchers ensure that the modified questions remain logical and practical by verifying them through large-scale generation and manual review processes.
A core concept introduced in this study is the Reasoning Gap, which quantifies the discrepancy between expected model performance on individual tasks and their performance on compositional tasks. The reasoning gap is calculated as:
Δ=Scomp−S1×S2
where Scomp represents the model’s accuracy on compositional tasks, while S1 and S2 represent the accuracies on the respective components (Question-1 and Question-2). A significant reasoning gap indicates that the model struggles with maintaining performance when chaining reasoning tasks together.
Cheaper models (low-cost) generally perform similarly to their high-cost counterparts on the simpler original GSM8K test. However, they struggle significantly more with the compositional GSM test. The reasoning gap increases for cheaper models. This indicates that cost-efficient LLMs may handle simpler tasks well but are less capable of managing more complex, compositional reasoning tasks.
The experiments were conducted using various models, such as GPT-4o, LLAMA, Gemini, and Mistral, to assess their ability to solve three test sets: the original GSM8K, the modified GSM8K (with the substitution of X), and the compositional GSM. The models were tested using an 8-shot prompt strategy, as outlined in Zhang et al. (2024), with the same approach applied to both the original and modified GSM8K test sets. A similar prompt was developed for the compositional GSM test set to maintain consistency across the experiments. The study evaluated a variety of models, including GPT-4o, GPT-4o mini, LLAMA3, Phi, Gemini, Gemma2, Mistral, and math-specialized models like Numina-7B and Mathstral-7B.
The research highlights three key findings:
Large language models (LLMs) have shown difficulty handling compositional tasks, especially in mathematical problem-solving, such as GSM8K. A prevalent hypothesis attributes these struggles to benchmark leakage. This occurs when models are exposed to test data during training, which can artificially inflate performance metrics. Studies indicate that leakage may lead to overestimating LLMs’ abilities in solving mathematical tasks. This is evident in models evaluated on GSM1K or variations of MATH problems. An evaluation was conducted to determine if leakage affects performance. It compared LLMs’ ability to solve modified GSM tasks with the original GSM8K benchmark. The results suggest that leakage is not the primary issue, as models displayed similar accuracy across both versions.
Moreover, the core of the problem lies in how LLMs handle multi-step reasoning and maintain context. The study notes several critical areas where models falter:
This analysis underscores the need for more robust methods of improving compositional reasoning in LLMs. Current approaches, such as instruction tuning and math specialization, offer some benefits. However, they are insufficient to address the reasoning gaps in compositional tasks. Researchers may need to rethink how models are trained. The focus should be on developing more generalized reasoning abilities rather than optimizing for specific benchmarks.
Furthermore, the study suggests alternative techniques. One such technique is code-based reasoning. In code-based reasoning, models generate executable code to solve problems. This approach could offer a path forward. While this approach shows promise, especially for smaller models, the broader challenge remains. How can we ensure that LLMs maintain coherence and accuracy across complex, multi-step reasoning tasks?
Smaller LLMs, while efficient and effective for simple tasks, could improve with complex, multi-step reasoning, especially in compositional tasks where answers must be linked across questions. This “reasoning gap” limits their reliability in real-world applications. Larger models like GPT-4 perform better but at a higher cost, highlighting the need for improved training methods to enhance reasoning abilities in smaller, more cost-effective models.
In conclusion, this research sheds light on the limitations of current LLMs in handling compositional reasoning tasks. As LLMs continue to evolve, addressing the reasoning gap in compositional GSM will be crucial for advancing their ability to tackle more complex and interconnected problems in real-world applications.
If you are looking for a Generative AI course online then, explore: GenAI Pinnacle Program.
Ans. LLMs, or Large Language Models, excel at handling tasks like high school and college-level math problems. However, while they perform well on simple math tasks, they often struggle with complex, multi-step reasoning tasks, especially smaller, cost-efficient models.
Ans. Compositional reasoning requires solving interconnected problems where the solution to one part impacts the next. Smaller LLMs struggle with “second-hop reasoning,” which involves using an earlier solution to solve subsequent parts, leading to errors in multi-step problems.
Ans. Smaller models are often less capable of handling compositional reasoning tasks, showing significant performance drops when required to link answers across multiple steps. Larger models like GPT-4 perform better but come with higher computational costs.
Ans. The reasoning gap measures the discrepancy between a model’s performance on individual tasks and its performance on compositional tasks. A large reasoning gap indicates the mo
Ans. Researchers suggest that training methods need to be improved. Techniques like instruction-tuning and math specialization help but aren’t enough. One possible path forward for enhancing multi-step reasoning capabilities is code-based reasoning, where models generate executable code to solve problems.