Do Smaller Models Struggle Often in High Order Thinking?

Pankaj Singh Last Updated : 18 Oct, 2024

11 min read

I was reading about the challenges that large language models (LLMs) face despite their impressive progress in recent years. I came across this research paper on Not All LLM Reasoners Are Created Equal by Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, and Rishabh Agarwal. It is from Mila, Google DeepMind, and Microsoft Research. This paper talks about Complex Reasoning in LLMs.

Talking about the progress: Large language models (LLMs) have made our (students, working professionals and more) lives easier in handling complex tasks such as high school and college-level math problems. This impressive performance has led many to believe that LLMs have also mastered simpler grade-school math, as measured by benchmarks like GSM8K. However, when we dig deep into their abilities, it reveals a different story, particularly when we focus on the smaller, more cost-efficient models. While seemingly powerful, smaller LLMs show surprising weaknesses when tested on more complex problems requiring multi-step reasoning.

The study assessed how well LLMs can solve math problems that build on one another, where the solution to one problem directly impacts the next. This type of evaluation goes beyond the standard single-question tests and exposes the limitations of LLMs, particularly the smaller ones. The results showed a significant performance gap when these models were tasked with solving paired problems as compared to solving individual problems independently. Surprisingly, it was more prominent in smaller, specialised models, often praised for efficiency and speed. While they perform well in simple tasks, their ability to handle multi-step or compositional reasoning problems is limited, making them less reliable in real-world applications.

Complex Reasoning in LLMs: Do Smaller Models Struggle?

Overview

Smaller LLMs struggle with complex multi-step reasoning tasks.
Performance drops significantly when LLMs handle interconnected problems.
Instruction-tuning provides inconsistent improvements for smaller models.
Reasoning gaps limit smaller models’ reliability in real-world applications.
Math-specialized models still face difficulties with compositional reasoning.
Improving multi-step reasoning requires better training approaches.

Overview
Why Smaller LLMs Struggle with Complex Reasoning?
- Example Problem from the Compositional GSM Test
- Key Observations
Reasoning Gap of Notable open-weights and closed-source LLMs
- Graph Analysis
Compositional Grade-School Math (GSM) and Language Model Reasoning Gaps
- Analysis per Model Family
- Experiment Results and Insights
Why do LLMs Struggle with Compositional GSM?
Implications for Future Research
Conclusion
Frequently Asked Questions

Why Smaller LLMs Struggle with Complex Reasoning?

The research explains why smaller LLMs, despite being efficient and successful in basic tasks, struggle with complex reasoning. One major reason is that these models get distracted by additional context. They also have difficulty with “second-hop reasoning,” which involves using the solution of the first problem to inform the second. This weakness is not caused by common issues like test-set leakage, where models have seen test problems during training. Instead, it stems from their inability to maintain focus and logically connect different parts of a problem.

Instruction-tuning, where models are fine-tuned to follow human instructions, is a common strategy to improve performance. However, its effectiveness varies across different model sizes. Smaller models show inconsistent improvements, indicating that their training methods may need adjustment. When fine-tuned on grade-school math problems, smaller models often overfit, becoming too specialized to the training data and failing to generalize to new problems.

In summary, while smaller LLMs can offer good performance at a lower cost, their brittleness in handling complex, multi-step reasoning tasks limits their practical use, especially in scenarios requiring consistent, reliable performance across various problems.

Example Problem from the Compositional GSM Test

Let X be the answer to the Q1:

Q1: There are 27 unicorns left in the world. One-third of them are in the Scottish Highlands. Two-thirds of the Scottish unicorns are female. How many female Scottish unicorns are there? Solve it and use the value of X to solve Q2. Explain your answer step by step.

Q2: Zack’s locker is half as big as Timothy’s locker. Peter’s locker is 1/4 as big as Zack’s locker. If Peter’s locker is X cubic inches, how big is Timothy’s locker in cubic inches?

The answer of Question-1 (Q1) is a variable X in Question-2 (Q2). The model has to be able to solve the first question correctly in order to solve the second question. The new final answer of Q2 is calculated by modifying its code-form solution and executing it.

According to the given graph:

GSM8K Accuracy: This represents the performance of models on the GSM8K dataset, which is a standard reasoning benchmark consisting of single-question problems. The score on this axis is the geometric mean of the model’s accuracy on individual components of the questions, 𝑆1 and 𝑆2.
Compositional GSM Accuracy: This is a more challenging task where two questions from the GSM8K dataset are chained together. The answer to the first question (Q1) becomes a variable in the second question (Q2). For a model to get the compositional GSM problem correct, it must answer both questions correctly. Thus, the compositional accuracy is 𝑆1 × 𝑆2.

Key Observations

Most models fall below the 𝑦 = 𝑥 ² trend line (dashed curve): This line shows the expected performance if a model’s compositional accuracy were the product of its accuracies on Q1 and Q2. Most points falling below it suggest a reasoning gap—models struggle more with compositional tasks than their individual GSM8K accuracies predict.
Better performance on single tasks than on compositional tasks: The graph shows that models perform well on GSM8K, but their performance declines on compositional questions. Even as GSM8K accuracy nears 100%, compositional GSM accuracy remains lower.
Outliers with high compositional accuracy: Models like GPT-4o, Gemini 1.5 Pro, and Qwen2.5-MATH-72B-IT excel in both GSM8K and compositional GSM, indicating superior reasoning accuracy across chained problems.
Models with lower compositional GSM accuracy: Models like Mistral-7B-PT and Phi-2 show a larger gap between their GSM8K and compositional GSM accuracy, suggesting their reasoning struggles with more complex, chained tasks.

The graph highlights a critical reasoning gap in current models. Although models can achieve high accuracy on individual reasoning questions (GSM8K), their performance significantly degrades when these questions are chained together in a compositional manner. This suggests that improving models’ ability to handle compositional reasoning tasks is a key challenge in advancing machine reasoning capabilities.

Reasoning Gap of Notable open-weights and closed-source LLMs

The graph compares language models (like AI models that understand and generate text). Some of these models are “open-weight,” meaning anyone can use and study them, while others are “closed-source,” meaning only the creators can access them.

The graph’s main focus is on the “reasoning gap.” It measures how well each model performs reasoning tasks—like solving problems or understanding logic—compared to a standard baseline (a reference point).

If a model has a lower reasoning gap value (which is more negative), it means it performs worse on reasoning tasks.
A higher reasoning gap value means the model performs better.

Graph Analysis

The graph basically shows how good or bad different models are at reasoning, and whether they’re open to everyone or kept private doesn’t matter in this case.

Phi 3-mini-4k-IT has the largest negative reasoning gap, meaning it performs the most poorly in reasoning tasks compared to others. It is a smaller and more cost-efficient model.
Gemma2-98-IT and LLAMA3-88B-IT also show significant reasoning gaps, ranking just above Phi models in terms of weaker performance.
Qwen2.5-MATH-72B-IT shows much better performance, positioned closer to a reasoning gap of 0, indicating a strong performance, particularly in math-specialized tasks.
GPT-4o, as expected, has the smallest reasoning gap (nearly 0), making it the most capable in reasoning tasks among the models listed.
General Trend: Smaller and more cost-efficient models, particularly those specialised in mathematics (indicated by the light green bars), seem to have a larger reasoning gap (poorer performance). Larger, more powerful models like GPT-4o tend to close this gap, achieving much better reasoning results.

The chart shows that smaller, math-specialized, and cost-efficient models tend to have greater reasoning gaps, suggesting they may not generalise well across broader reasoning tasks. In contrast, larger models like GPT-4o and others in the LLAMA or GPT family tend to perform better across the board in reasoning tasks, narrowing the gap.

Compositional Grade-School Math (GSM) and Language Model Reasoning Gaps

The exploration of compositional grade-school math (GSM) in the research context offers a deeper insight into the challenges large language models (LLMs) face when solving interconnected reasoning problems. Each question in compositional GSM consists of two parts: Question-1 and Question-2. The answer to Question-1 becomes a variable, referred to as X, used in solving Question-2. This unique design forces models to maintain consistency and accuracy across chained questions, adding complexity to the task beyond traditional single-question formats. Researchers ensure that the modified questions remain logical and practical by verifying them through large-scale generation and manual review processes.

A core concept introduced in this study is the Reasoning Gap, which quantifies the discrepancy between expected model performance on individual tasks and their performance on compositional tasks. The reasoning gap is calculated as:

Δ=S_comp−S₁×S₂

where S_comp represents the model’s accuracy on compositional tasks, while S₁ and S₂ represent the accuracies on the respective components (Question-1 and Question-2). A significant reasoning gap indicates that the model struggles with maintaining performance when chaining reasoning tasks together.

Analysis per Model Family

GPT (4o and 4o mini): Both versions perform similarly on the original GSM8K test, achieving around 90% accuracy. However, the low-cost version (4o mini) exhibits a more significant performance drop on the Compositional GSM test, with 14.2% lower accuracy compared to the high-cost version (4o), suggesting that it struggles more with complex reasoning tasks.
Gemini (1.5 Pro and 1.5 Flash): Both Gemini models show slightly lower original GSM8K accuracy (about 80%), but the low-cost model (1.5 Flash) shows a more substantial performance drop (–11.3%) compared to the high-cost version (1.5 Pro, –5.8%).
LLAMA3 (70B-IT and 8B-IT): The high-cost model (70B-IT) maintains a decent accuracy on both tests, with only a small gap of –4.9%. In contrast, the low-cost model (8B-IT) experiences a significant decline in performance, particularly on the compositional test, where it shows a 27.5% drop, indicating that compositional reasoning tasks are especially challenging for this more affordable variant.
Gemma2 (27B-IT and 9B-IT): The Gemma2 models exhibit the most significant reasoning gaps. The low-cost version (9B-IT) sees a massive 37.3% drop in accuracy, while the high-cost version (27B-IT) also experiences a notable decline (18%).

Cheaper models (low-cost) generally perform similarly to their high-cost counterparts on the simpler original GSM8K test. However, they struggle significantly more with the compositional GSM test. The reasoning gap increases for cheaper models. This indicates that cost-efficient LLMs may handle simpler tasks well but are less capable of managing more complex, compositional reasoning tasks.

Experiment Results and Insights

The experiments were conducted using various models, such as GPT-4o, LLAMA, Gemini, and Mistral, to assess their ability to solve three test sets: the original GSM8K, the modified GSM8K (with the substitution of X), and the compositional GSM. The models were tested using an 8-shot prompt strategy, as outlined in Zhang et al. (2024), with the same approach applied to both the original and modified GSM8K test sets. A similar prompt was developed for the compositional GSM test set to maintain consistency across the experiments. The study evaluated a variety of models, including GPT-4o, GPT-4o mini, LLAMA3, Phi, Gemini, Gemma2, Mistral, and math-specialized models like Numina-7B and Mathstral-7B.

Pretrained vs instruction tuned — Source: Link

The research highlights three key findings:

Cost-Efficient and Smaller LLMs Struggle with Compositional Tasks: While smaller models, such as GPT-4o mini and Gemini 1.5 Flash, perform comparably on GSM8K benchmarks, they exhibit significantly larger reasoning gaps when faced with compositional GSM. These models, which are cost-efficient and optimized for standard benchmarks, seem to have reasoning weaknesses that become evident in more complex, multi-step problems.
Instruction-Tuning Effects Vary by Model Size: Instruction-tuning boosts LLMs’ understanding of task-specific instructions, but its impact varies by model size. Smaller models show significant accuracy gains on GSM8K but struggle with compositional GSM tasks, while larger models perform more consistently, implying small models may be over-optimized for certain tasks.
Math-Specialization Doesn’t Solve the Reasoning Gap: Math-focused models like Qwen2.5-Math and Numina-7B face similar reasoning gaps on compositional GSM as general-purpose models. Despite being tailored for complex math, they struggle to generalize from single questions to multi-step reasoning.

Why do LLMs Struggle with Compositional GSM?

Large language models (LLMs) have shown difficulty handling compositional tasks, especially in mathematical problem-solving, such as GSM8K. A prevalent hypothesis attributes these struggles to benchmark leakage. This occurs when models are exposed to test data during training, which can artificially inflate performance metrics. Studies indicate that leakage may lead to overestimating LLMs’ abilities in solving mathematical tasks. This is evident in models evaluated on GSM1K or variations of MATH problems. An evaluation was conducted to determine if leakage affects performance. It compared LLMs’ ability to solve modified GSM tasks with the original GSM8K benchmark. The results suggest that leakage is not the primary issue, as models displayed similar accuracy across both versions.

Moreover, the core of the problem lies in how LLMs handle multi-step reasoning and maintain context. The study notes several critical areas where models falter:

Overfitting to Benchmarks: Many models perform well on established benchmarks like GSM8K but struggle when presented with modified or compositional questions. This suggests that models may be overfitting to specific datasets rather than learning generalized reasoning skills.
Distraction by Context: LLMs can be easily distracted when presented with irrelevant or additional context. For example, even when models correctly solve Question-1, they often fail to use this information accurately in Question-2, leading to incorrect final answers.
Lack of Transfer Between Subtasks: Solving Question-1 doesn’t guarantee the correct solution for Question-2. Many models exhibit a gap between solving the first part of a compositional problem and effectively using the result to solve the second part. This failure reveals a disconnect in the model’s ability to transfer reasoning across chained tasks.

Implications for Future Research

This analysis underscores the need for more robust methods of improving compositional reasoning in LLMs. Current approaches, such as instruction tuning and math specialization, offer some benefits. However, they are insufficient to address the reasoning gaps in compositional tasks. Researchers may need to rethink how models are trained. The focus should be on developing more generalized reasoning abilities rather than optimizing for specific benchmarks.

Furthermore, the study suggests alternative techniques. One such technique is code-based reasoning. In code-based reasoning, models generate executable code to solve problems. This approach could offer a path forward. While this approach shows promise, especially for smaller models, the broader challenge remains. How can we ensure that LLMs maintain coherence and accuracy across complex, multi-step reasoning tasks?

Conclusion

Smaller LLMs, while efficient and effective for simple tasks, could improve with complex, multi-step reasoning, especially in compositional tasks where answers must be linked across questions. This “reasoning gap” limits their reliability in real-world applications. Larger models like GPT-4 perform better but at a higher cost, highlighting the need for improved training methods to enhance reasoning abilities in smaller, more cost-effective models.

In conclusion, this research sheds light on the limitations of current LLMs in handling compositional reasoning tasks. As LLMs continue to evolve, addressing the reasoning gap in compositional GSM will be crucial for advancing their ability to tackle more complex and interconnected problems in real-world applications.

If you are looking for a Generative AI course online then, explore: GenAI Pinnacle Program.

Frequently Asked Questions

Q1. What are LLMs, and how do they perform on simple vs. complex math problems?

Ans. LLMs, or Large Language Models, excel at handling tasks like high school and college-level math problems. However, while they perform well on simple math tasks, they often struggle with complex, multi-step reasoning tasks, especially smaller, cost-efficient models.

Q2. What is compositional reasoning, and why is it challenging for LLMs?

Ans. Compositional reasoning requires solving interconnected problems where the solution to one part impacts the next. Smaller LLMs struggle with “second-hop reasoning,” which involves using an earlier solution to solve subsequent parts, leading to errors in multi-step problems.

Q3. How do smaller LLMs compare to larger models in handling compositional tasks?

Ans. Smaller models are often less capable of handling compositional reasoning tasks, showing significant performance drops when required to link answers across multiple steps. Larger models like GPT-4 perform better but come with higher computational costs.

Q4. What is the ‘Reasoning Gap’ in the context of LLMs?

Ans. The reasoning gap measures the discrepancy between a model’s performance on individual tasks and its performance on compositional tasks. A large reasoning gap indicates the mo

Q5. What solutions have researchers suggested to improve LLMs’ compositional reasoning?

Ans. Researchers suggest that training methods need to be improved. Techniques like instruction-tuning and math specialization help but aren’t enough. One possible path forward for enhancing multi-step reasoning capabilities is code-based reasoning, where models generate executable code to solve problems.

Pankaj Singh

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Do Smaller Models Struggle Often in High Order Thinking?

Overview

Table of contents

Why Smaller LLMs Struggle with Complex Reasoning?

Example Problem from the Compositional GSM Test

Key Observations

Reasoning Gap of Notable open-weights and closed-source LLMs

Graph Analysis

Compositional Grade-School Math (GSM) and Language Model Reasoning Gaps

Analysis per Model Family

Experiment Results and Insights

Why do LLMs Struggle with Compositional GSM?

Implications for Future Research

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv