Do Smaller Models Struggle Often in High Order Thinking?

Pankaj Singh Last Updated : 18 Oct, 2024
11 min read

I was reading about the challenges that large language models (LLMs) face despite their impressive progress in recent years. I came across this research paper on Not All LLM Reasoners Are Created Equal by Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, and Rishabh Agarwal. It is from Mila, Google DeepMind, and Microsoft Research. This paper talks about Complex Reasoning in LLMs.

Talking about the progress: Large language models (LLMs) have made our (students, working professionals and more) lives easier in handling complex tasks such as high school and college-level math problems. This impressive performance has led many to believe that LLMs have also mastered simpler grade-school math, as measured by benchmarks like GSM8K. However, when we dig deep into their abilities, it reveals a different story, particularly when we focus on the smaller, more cost-efficient models. While seemingly powerful, smaller LLMs show surprising weaknesses when tested on more complex problems requiring multi-step reasoning.

The study assessed how well LLMs can solve math problems that build on one another, where the solution to one problem directly impacts the next. This type of evaluation goes beyond the standard single-question tests and exposes the limitations of LLMs, particularly the smaller ones. The results showed a significant performance gap when these models were tasked with solving paired problems as compared to solving individual problems independently. Surprisingly, it was more prominent in smaller, specialised models, often praised for efficiency and speed. While they perform well in simple tasks, their ability to handle multi-step or compositional reasoning problems is limited, making them less reliable in real-world applications. 

Complex Reasoning in LLMs: Do Smaller Models Struggle?

Overview

  • Smaller LLMs struggle with complex multi-step reasoning tasks.
  • Performance drops significantly when LLMs handle interconnected problems.
  • Instruction-tuning provides inconsistent improvements for smaller models.
  • Reasoning gaps limit smaller models’ reliability in real-world applications.
  • Math-specialized models still face difficulties with compositional reasoning.
  • Improving multi-step reasoning requires better training approaches.

Why Smaller LLMs Struggle with Complex Reasoning?

The research explains why smaller LLMs, despite being efficient and successful in basic tasks, struggle with complex reasoning. One major reason is that these models get distracted by additional context. They also have difficulty with “second-hop reasoning,” which involves using the solution of the first problem to inform the second. This weakness is not caused by common issues like test-set leakage, where models have seen test problems during training. Instead, it stems from their inability to maintain focus and logically connect different parts of a problem.

Instruction-tuning, where models are fine-tuned to follow human instructions, is a common strategy to improve performance. However, its effectiveness varies across different model sizes. Smaller models show inconsistent improvements, indicating that their training methods may need adjustment. When fine-tuned on grade-school math problems, smaller models often overfit, becoming too specialized to the training data and failing to generalize to new problems.

In summary, while smaller LLMs can offer good performance at a lower cost, their brittleness in handling complex, multi-step reasoning tasks limits their practical use, especially in scenarios requiring consistent, reliable performance across various problems.

GSM8K Accuracy
Source: Link

Example Problem from the Compositional GSM Test

Let X be the answer to the Q1: 

Q1: There are 27 unicorns left in the world. One-third of them are in the Scottish Highlands. Two-thirds of the Scottish unicorns are female. How many female Scottish unicorns are there? Solve it and use the value of X to solve Q2. Explain your answer step by step. 

Q2: Zack’s locker is half as big as Timothy’s locker. Peter’s locker is 1/4 as big as Zack’s locker. If Peter’s locker is X cubic inches, how big is Timothy’s locker in cubic inches?

The answer of Question-1 (Q1) is a variable X in Question-2 (Q2). The model has to be able to solve the first question correctly in order to solve the second question. The new final answer of Q2 is calculated by modifying its code-form solution and executing it.

According to the given graph:

  1. GSM8K Accuracy: This represents the performance of models on the GSM8K dataset, which is a standard reasoning benchmark consisting of single-question problems. The score on this axis is the geometric mean of the model’s accuracy on individual components of the questions, 𝑆1 and 𝑆2.
  2. Compositional GSM Accuracy: This is a more challenging task where two questions from the GSM8K dataset are chained together. The answer to the first question (Q1) becomes a variable in the second question (Q2). For a model to get the compositional GSM problem correct, it must answer both questions correctly. Thus, the compositional accuracy is 𝑆1 × 𝑆2.

Key Observations

  • Most models fall below the 𝑦 = 𝑥 2 trend line (dashed curve): This line shows the expected performance if a model’s compositional accuracy were the product of its accuracies on Q1 and Q2. Most points falling below it suggest a reasoning gap—models struggle more with compositional tasks than their individual GSM8K accuracies predict.
  • Better performance on single tasks than on compositional tasks: The graph shows that models perform well on GSM8K, but their performance declines on compositional questions. Even as GSM8K accuracy nears 100%, compositional GSM accuracy remains lower.
  • Outliers with high compositional accuracy: Models like GPT-4o, Gemini 1.5 Pro, and Qwen2.5-MATH-72B-IT excel in both GSM8K and compositional GSM, indicating superior reasoning accuracy across chained problems.
  • Models with lower compositional GSM accuracy: Models like Mistral-7B-PT and Phi-2 show a larger gap between their GSM8K and compositional GSM accuracy, suggesting their reasoning struggles with more complex, chained tasks.

The graph highlights a critical reasoning gap in current models. Although models can achieve high accuracy on individual reasoning questions (GSM8K), their performance significantly degrades when these questions are chained together in a compositional manner. This suggests that improving models’ ability to handle compositional reasoning tasks is a key challenge in advancing machine reasoning capabilities.

Reasoning Gap of Notable open-weights and closed-source LLMs

Reasoning Gap of notable open-weights and closed-source LLMs
Source: Link

The graph compares language models (like AI models that understand and generate text). Some of these models are “open-weight,” meaning anyone can use and study them, while others are “closed-source,” meaning only the creators can access them.

The graph’s main focus is on the “reasoning gap.” It measures how well each model performs reasoning tasks—like solving problems or understanding logic—compared to a standard baseline (a reference point).

  • If a model has a lower reasoning gap value (which is more negative), it means it performs worse on reasoning tasks.
  • A higher reasoning gap value means the model performs better.

Graph Analysis

The graph basically shows how good or bad different models are at reasoning, and whether they’re open to everyone or kept private doesn’t matter in this case.

  1. Phi 3-mini-4k-IT has the largest negative reasoning gap, meaning it performs the most poorly in reasoning tasks compared to others. It is a smaller and more cost-efficient model.
  2. Gemma2-98-IT and LLAMA3-88B-IT also show significant reasoning gaps, ranking just above Phi models in terms of weaker performance.
  3. Qwen2.5-MATH-72B-IT shows much better performance, positioned closer to a reasoning gap of 0, indicating a strong performance, particularly in math-specialized tasks.
  4. GPT-4o, as expected, has the smallest reasoning gap (nearly 0), making it the most capable in reasoning tasks among the models listed.
  5. General Trend: Smaller and more cost-efficient models, particularly those specialised in mathematics (indicated by the light green bars), seem to have a larger reasoning gap (poorer performance). Larger, more powerful models like GPT-4o tend to close this gap, achieving much better reasoning results.

The chart shows that smaller, math-specialized, and cost-efficient models tend to have greater reasoning gaps, suggesting they may not generalise well across broader reasoning tasks. In contrast, larger models like GPT-4o and others in the LLAMA or GPT family tend to perform better across the board in reasoning tasks, narrowing the gap.

Compositional Grade-School Math (GSM) and Language Model Reasoning Gaps

Compositional Grade-School Math (GSM) and Language Model Reasoning Gaps
Source: Link

The exploration of compositional grade-school math (GSM) in the research context offers a deeper insight into the challenges large language models (LLMs) face when solving interconnected reasoning problems. Each question in compositional GSM consists of two parts: Question-1 and Question-2. The answer to Question-1 becomes a variable, referred to as X, used in solving Question-2. This unique design forces models to maintain consistency and accuracy across chained questions, adding complexity to the task beyond traditional single-question formats. Researchers ensure that the modified questions remain logical and practical by verifying them through large-scale generation and manual review processes.

A core concept introduced in this study is the Reasoning Gap, which quantifies the discrepancy between expected model performance on individual tasks and their performance on compositional tasks. The reasoning gap is calculated as:

Δ=Scomp​−S1​×S2​

where Scomp​ represents the model’s accuracy on compositional tasks, while S1​ and S2​​ represent the accuracies on the respective components (Question-1 and Question-2). A significant reasoning gap indicates that the model struggles with maintaining performance when chaining reasoning tasks together.

Analysis per Model Family

  1. GPT (4o and 4o mini): Both versions perform similarly on the original GSM8K test, achieving around 90% accuracy. However, the low-cost version (4o mini) exhibits a more significant performance drop on the Compositional GSM test, with 14.2% lower accuracy compared to the high-cost version (4o), suggesting that it struggles more with complex reasoning tasks.
  2. Gemini (1.5 Pro and 1.5 Flash): Both Gemini models show slightly lower original GSM8K accuracy (about 80%), but the low-cost model (1.5 Flash) shows a more substantial performance drop (–11.3%) compared to the high-cost version (1.5 Pro, –5.8%).
  3. LLAMA3 (70B-IT and 8B-IT): The high-cost model (70B-IT) maintains a decent accuracy on both tests, with only a small gap of –4.9%. In contrast, the low-cost model (8B-IT) experiences a significant decline in performance, particularly on the compositional test, where it shows a 27.5% drop, indicating that compositional reasoning tasks are especially challenging for this more affordable variant.
  4. Gemma2 (27B-IT and 9B-IT): The Gemma2 models exhibit the most significant reasoning gaps. The low-cost version (9B-IT) sees a massive 37.3% drop in accuracy, while the high-cost version (27B-IT) also experiences a notable decline (18%).

Cheaper models (low-cost) generally perform similarly to their high-cost counterparts on the simpler original GSM8K test. However, they struggle significantly more with the compositional GSM test. The reasoning gap increases for cheaper models. This indicates that cost-efficient LLMs may handle simpler tasks well but are less capable of managing more complex, compositional reasoning tasks.

Experiment Results and Insights

Experiment Results and Insights
GSM8K 8-shot Prompt – Source: Link

The experiments were conducted using various models, such as GPT-4o, LLAMA, Gemini, and Mistral, to assess their ability to solve three test sets: the original GSM8K, the modified GSM8K (with the substitution of X), and the compositional GSM. The models were tested using an 8-shot prompt strategy, as outlined in Zhang et al. (2024), with the same approach applied to both the original and modified GSM8K test sets. A similar prompt was developed for the compositional GSM test set to maintain consistency across the experiments. The study evaluated a variety of models, including GPT-4o, GPT-4o mini, LLAMA3, Phi, Gemini, Gemma2, Mistral, and math-specialized models like Numina-7B and Mathstral-7B.

Pretrained vs instruction tuned
Source: Link

The research highlights three key findings:

  1. Cost-Efficient and Smaller LLMs Struggle with Compositional Tasks: While smaller models, such as GPT-4o mini and Gemini 1.5 Flash, perform comparably on GSM8K benchmarks, they exhibit significantly larger reasoning gaps when faced with compositional GSM. These models, which are cost-efficient and optimized for standard benchmarks, seem to have reasoning weaknesses that become evident in more complex, multi-step problems.
  2. Instruction-Tuning Effects Vary by Model Size: Instruction-tuning boosts LLMs’ understanding of task-specific instructions, but its impact varies by model size. Smaller models show significant accuracy gains on GSM8K but struggle with compositional GSM tasks, while larger models perform more consistently, implying small models may be over-optimized for certain tasks.
  3. Math-Specialization Doesn’t Solve the Reasoning Gap: Math-focused models like Qwen2.5-Math and Numina-7B face similar reasoning gaps on compositional GSM as general-purpose models. Despite being tailored for complex math, they struggle to generalize from single questions to multi-step reasoning.

Why do LLMs Struggle with Compositional GSM?

Large language models (LLMs) have shown difficulty handling compositional tasks, especially in mathematical problem-solving, such as GSM8K. A prevalent hypothesis attributes these struggles to benchmark leakage. This occurs when models are exposed to test data during training, which can artificially inflate performance metrics. Studies indicate that leakage may lead to overestimating LLMs’ abilities in solving mathematical tasks. This is evident in models evaluated on GSM1K or variations of MATH problems. An evaluation was conducted to determine if leakage affects performance. It compared LLMs’ ability to solve modified GSM tasks with the original GSM8K benchmark. The results suggest that leakage is not the primary issue, as models displayed similar accuracy across both versions.

Moreover, the core of the problem lies in how LLMs handle multi-step reasoning and maintain context. The study notes several critical areas where models falter:

  • Overfitting to Benchmarks: Many models perform well on established benchmarks like GSM8K but struggle when presented with modified or compositional questions. This suggests that models may be overfitting to specific datasets rather than learning generalized reasoning skills.
  • Distraction by Context: LLMs can be easily distracted when presented with irrelevant or additional context. For example, even when models correctly solve Question-1, they often fail to use this information accurately in Question-2, leading to incorrect final answers.
  • Lack of Transfer Between Subtasks: Solving Question-1 doesn’t guarantee the correct solution for Question-2. Many models exhibit a gap between solving the first part of a compositional problem and effectively using the result to solve the second part. This failure reveals a disconnect in the model’s ability to transfer reasoning across chained tasks.

Implications for Future Research

This analysis underscores the need for more robust methods of improving compositional reasoning in LLMs. Current approaches, such as instruction tuning and math specialization, offer some benefits. However, they are insufficient to address the reasoning gaps in compositional tasks. Researchers may need to rethink how models are trained. The focus should be on developing more generalized reasoning abilities rather than optimizing for specific benchmarks.

Furthermore, the study suggests alternative techniques. One such technique is code-based reasoning. In code-based reasoning, models generate executable code to solve problems. This approach could offer a path forward. While this approach shows promise, especially for smaller models, the broader challenge remains. How can we ensure that LLMs maintain coherence and accuracy across complex, multi-step reasoning tasks?

Conclusion

Smaller LLMs, while efficient and effective for simple tasks, could improve with complex, multi-step reasoning, especially in compositional tasks where answers must be linked across questions. This “reasoning gap” limits their reliability in real-world applications. Larger models like GPT-4 perform better but at a higher cost, highlighting the need for improved training methods to enhance reasoning abilities in smaller, more cost-effective models.

In conclusion, this research sheds light on the limitations of current LLMs in handling compositional reasoning tasks. As LLMs continue to evolve, addressing the reasoning gap in compositional GSM will be crucial for advancing their ability to tackle more complex and interconnected problems in real-world applications.

If you are looking for a Generative AI course online then, explore: GenAI Pinnacle Program.

Frequently Asked Questions

Q1. What are LLMs, and how do they perform on simple vs. complex math problems?

Ans. LLMs, or Large Language Models, excel at handling tasks like high school and college-level math problems. However, while they perform well on simple math tasks, they often struggle with complex, multi-step reasoning tasks, especially smaller, cost-efficient models.

Q2. What is compositional reasoning, and why is it challenging for LLMs?

Ans. Compositional reasoning requires solving interconnected problems where the solution to one part impacts the next. Smaller LLMs struggle with “second-hop reasoning,” which involves using an earlier solution to solve subsequent parts, leading to errors in multi-step problems.

Q3. How do smaller LLMs compare to larger models in handling compositional tasks?

Ans. Smaller models are often less capable of handling compositional reasoning tasks, showing significant performance drops when required to link answers across multiple steps. Larger models like GPT-4 perform better but come with higher computational costs.

Q4. What is the ‘Reasoning Gap’ in the context of LLMs?

Ans. The reasoning gap measures the discrepancy between a model’s performance on individual tasks and its performance on compositional tasks. A large reasoning gap indicates the mo

Q5. What solutions have researchers suggested to improve LLMs’ compositional reasoning?

Ans. Researchers suggest that training methods need to be improved. Techniques like instruction-tuning and math specialization help but aren’t enough. One possible path forward for enhancing multi-step reasoning capabilities is code-based reasoning, where models generate executable code to solve problems.

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details