Getting Started with Qwen2.5-Math

Vikas Verma Last Updated : 27 Dec, 2024

7 min read

Over the last few years, significant progress has been made in researching and improving the reasoning capabilities of large language models, with a strong focus on enhancing their proficiency in solving
arithmetic and mathematical problems.

A model with good arithmetic and mathematical reasoning can help in :

Personalized Learning: AI-powered tutors can adapt to individual students’ needs, elping them understand complex mathematical concepts more effectively.
Problem Solving Assistance: Automating step-by-step explanations for solving problems improves student engagement and comprehension.
Curriculum Design: Creating adaptive and progressive learning modules in subjects like algebra and calculus.

This article explores how advancements in mathematical reasoning are driving innovations in AI models like Qwen2.5-Math and its applications in personalized learning, problem-solving, and curriculum design.

Learning Objectives

Understand and explore the Qwen2.5-Math series and its components.
Learn about Qwen2.5-Math model architecture.
Gain hands-on exposure on Qwen2.5-Math with examples.
Learn about the performance of Qwen2.5-Math on various benchmarks.

What is Qwen2.5-Math?
Qwen2.5-Math vs Qwen2-Math
Optimizing Training Data
Efficient Model Training
Optimizing Model Performance
Running Demo
Conclusion
Frequently Asked Questions

What is Qwen2.5-Math?

The Qwen2.5-Math series is the latest addition to Alibaba Cloud’s Qwen series of open-source, math-specific large language models. It follows the earlier release of Qwen2-Math, a series of specialized mathematical language models based on the Qwen2 LLMs. These models demonstrate superior mathematical capabilities, surpassing both open-source alternatives and even some closed-source models like GPT-4o.

This series demonstrates significant performance enhancements over the Qwen2-Math series on Chinese and English mathematics benchmarks. While this series applies Chain-of-Thought(CoT) to solve English-specific math problems only, the Qwen2.5-Math series expands its capabilities by incorporating both CoT and Tool-Integrated Reasoning (TIR), to tackle math problems in both Chinese and English effectively.

Qwen2.5-Math vs Qwen2-Math

The comparison between Qwen2.5-Math and Qwen2-Math highlights the advancements in mathematical reasoning and problem-solving capabilities achieved in the latest iteration of Alibaba Cloud’s math-specific language models.

Property	Qwen2-Math	Qwen2.5-Math
Pre-training data size	700B tokens (from Qwen Math Corpus v1)	Over 1T tokens (from Qwen Math Corpus v2)
Languages supported	English	English and Chinese
Approach	Chain-of-Thought (COT)	Chain-of-Thought (COT), Tool-integrated Reasoning (TIR)
Benchmark Score (GSM8K, Math, and MMLU-STEM)	89.1, 60.5, 79.1	90.8, 66.8, 82.8
Model Variants	Qwen2-Math-1.5B/7B/72B	Qwen2.5-Math-1.5B/7B/72B

Optimizing Training Data

The Qwen2.5-Math series is trained using the Qwen Math Corpus v2, comprising over 1 trillion high-quality mathematical data tokens in both English and Chinese. This dataset includes synthetic mathematical data generated using the Qwen2-Math-72B-Instruct model and aggregated mathematical Chinese data sourced from web content, books, and code repositories through multiple recall cycles.

Chain-of-Thought (CoT) Dataset

The chain-of-thought (CoT) dataset for Qwen2.5-Math is a comprehensive collection of mathematical problems aimed at improving the reasoning capabilities of the model. It includes:

580k English and 500k mathematical problems, including both annotated and synthesized items.
The Annotated data derived from sources like GSM8K, MATH, and NuminaMath.

Tool-Integrated Reasoning (TIR) Dataset

To address the computational and algorithmic challenges faced by CoT prompting—such as solving quadratic equations or computing eigenvalues—the tool-integrated reasoning (TIR) dataset was introduced. This dataset enhances the model’s proficiency in symbolic manipulation and precise calculations by enabling it to use a Python interpreter for reasoning tasks. It includes:

190k problems from benchmarks like GSM8K, MATH, CollegeMath, and NuminaMath.
205k problems created using techniques from MuggleMath and DotaMath to evolve queries within GSM8K and MATH training sets.

Efficient Model Training

Multimodal Sentiment Analysis — Source: HuggingFace

Since the Qwen2.5-Math model is the upgraded version of the Qwen2-Math model so its training is derived from Qwen2-Math as follows:

Qwen2-Math models train on Qwen Math Corpus v1, a high-quality dataset that contains approximately 700 billion tokens of mathematical content.
Developers train a math-specific reward model, Qwen2-Math-RM, derived from the Qwen2-Math-72B model.
The Qwen2.5 series base models serve for parameter initialization, enhancing language understanding, code generation, and text reasoning capabilities.
After training the base Qwen2.5-Math model, developers train a math-specific reward model, Qwen2.5-Math-RM-72B, based on Qwen2.5-Math-72B. This reward model evolves the SFT data through Rejection Sampling for the SFT model (Qwen2.5-Math-SFT).
An instruct model (Qwen2.5-Math-Instruct) is built at the end to polish the quality of responses. This model is created through an additional iteration using the Qwen2-Math-Instruct models and Qwen2.5-Math-RM-72B. The process incorporates Tool-Integrated Reasoning (TIR) data and SFT data, refined via Group Relative Policy Optimization (GRPO), to further polish the model’s performance.

Optimizing Model Performance

Enhancing model performance is key to delivering faster, more accurate results, ensuring efficiency and reliability in applications.

Base Models Performance

The base models Qwen2.5-Math-1.5B/7B/72B achieved significant improvements on English math benchmarks (GSM8K, Math, and MMLU-STEM) and Chinese math benchmarks (CMATH, GaoKao Math Cloze, and GaoKao Math QA) as compared Qwen2-Math-1.5B/7B/72B.

For example, Qwen2.5-Math-1.5B/7B/72B models show significant improvement of 5.4, 5.0, 6.3 on MATH, and score improvement of 3.4, 12.2, 19.8 on GaoKao Math QA.

Instruction-tuned Models Performance

The Qwen2.5-Math-72B-Instruct model outperformed both open-source models and top closed-source models, such as GPT-4o and Gemini Math-Specialized 1.5 Pro.

The Qwen2.5-Math-72B-Instruct model surpasses its predecessor (the Qwen2-Math-72B-Instruct model) by an average of 4.4 points in English and 6.1 points in Chinese. This performance marks its position as the leading open-source mathematical model available today.

On the extremely challenging benchmarks such as AIME 2024 and AMC23, models like Claude3 Opus, GPT-4 Turbo, and Gemini 1.5 Pro solve only 1 or 2 out of 30 problems. In contrast, Qwen2.5-Math-72B-Instruct demonstrates remarkable performance, solving 9 problems in Greedy decoding CoT mode and 12 problems in TIR mode. Furthermore, with the assistance of the reward model (RM), Qwen2.5-Math-7B-Instruct achieves an impressive 21 solved problems, showcasing its superior mathematical problem-solving capabilities.

Running Demo

Let’s see the Qwen2.5-Math demo using the HuggingFace space here.

This space provides a web-based user interface to input mathematical or arithmetic problems in either image or text format for testing the model’s capabilities.

To support multi-modalities this space uses Qwen2-VL for OCR and Qwen2.5-Math for mathematical reasoning.

Qwen-VL (Qwen Large Vision Language Model) is the multimodal vision language model that supports images, text as inputs. It naturally supports English and Chinese to perform various image-to-text generation tasks like image captioning, visual question-answering, visual reasoning, text recognition, etc.

Qwen-VL series contains many models such as Qwen-VL, Qwen-VL-Chat, Qwen-VL-Plus, Qwen-VL-Max
etc. Qwen-VL-Max is Qwen’s most Capable large Visual Language model for delivering optimal performance on an even broader range of complex tasks.

The system uses the qwen-vl-max-0809 model to understand, process, and extract textual information from the input images. The process_image() function first receives the input image and extracts the math-related content, ensuring accurate transcription of any LaTeX formulas. The system then applies the following standard prompt to extract the textual, math-related content from the image.

The prompt instructs: “Describe the math-related content in this image, ensuring accurate transcription of any LaTeX formulas. Do not describe non-mathematical details.”

import os

os.system('pip install dashscope -U')
import tempfile
from pathlib import Path
import secrets
import dashscope
from dashscope import MultiModalConversation, Generation
from PIL import Image



YOUR_API_TOKEN = os.getenv('YOUR_API_TOKEN')
dashscope.api_key = YOUR_API_TOKEN
math_messages = []
def process_image(image, shouldConvert=False):

    global math_messages
    math_messages = [] # reset when upload image
    uploaded_file_dir = os.environ.get("GRADIO_TEMP_DIR") or str(
        Path(tempfile.gettempdir()) / "gradio"
    )
    os.makedirs(uploaded_file_dir, exist_ok=True)
    

    name = f"tmp{secrets.token_hex(20)}.jpg"
    filename = os.path.join(uploaded_file_dir, name)

    if shouldConvert:
        new_img = Image.new('RGB', size=(image.width, image.height), color=(255, 255, 255))
        new_img.paste(image, (0, 0), mask=image)
        image = new_img
    image.save(filename)
    

    messages = [{
        'role': 'system',
        'content': [{'text': 'You are a helpful assistant.'}]
    }, {
        'role': 'user',
        'content': [
            {'image': f'file://{filename}'},
            {'text': 'Please describe the math-related content in this image, ensuring that any LaTeX formulas are correctly transcribed. Non-mathematical details do not need to be described.'}
        ]
    }]
    
    response = MultiModalConversation.call(model='qwen-vl-max-0809', messages=messages)
    

    os.remove(filename)
    
    return response.output.choices[0]["message"]["content"]#import csv

Step2: Mathematical reasoning using Qwen2.5-Math

This step extracts the image description, which is then passed to the Qwen2.5 model along with the user question to generate the response. The qwen2.5-math-72b-instruct model performs the mathematical reasoning in this process.

def get_math_response(image_description, user_question):
    global math_messages
    if not math_messages:
        math_messages.append({'role': 'system', 'content': 'You are a helpful math assistant.'})
    math_messages = math_messages[:1]
    if image_description is not None:
        content = f'Image description: {image_description}\n\n'
    else:
        content = ''
    query = f"{content}User question: {user_question}"
    math_messages.append({'role': 'user', 'content': query})
    response = Generation.call(	
        model="qwen2.5-math-72b-instruct",
        messages=math_messages,	
        result_format='message',
        stream=True
    )
    answer = None
    for resp in response:
        if resp.output is None:
            continue
        answer = resp.output.choices[0].message.content
        yield answer.replace("\\", "\\\\")
    print(f'query: {query}\nanswer: {answer}')
    if answer is None:
        math_messages.pop()
    else:
        math_messages.append({'role': 'assistant', 'content': answer})

Having known about the models used in this space, let’s see some examples to
assess model capability to solve mathematical or arithmetic problems.

Example1

An input image containing the following problem statement –

The model finds the values of x as 5 and y as 2. It also provides step-by-step
natural language reasoning while finding the values of x and y.

Example2

An input image containing the following problem statement –

The model finds out the value of the last expression as 50.

Example3

An input image containing the following problem statement –

The model finds out the value of the above expression as 5.

Conclusion

In this article, we explored Qwen2.5-Math—a series of mathematical models with robust reasoning capabilities. We examined its components, training data, architecture, and performance on various standard benchmarks. Additionally, we reviewed the demo, testing it with a range of moderate to complex examples.

Key Takeaways

The Qwen2.5-Math models support both Chinese and English and showcase advanced mathematical reasoning capabilities. It utilizes techniques such as Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR).
The Qwen2.5 series includes multiple variants based on the number of parameters, with models available in 1.5B, 7B, and 72B parameters.
The Qwen2.5-Math models leverage 1 trillion tokens for pre-training, a substantial increase compared to the 700 billion tokens used for Qwen2-Math.
Qwen2.5-Math surpasses Qwen2-Math across various English and Chinese benchmarks. Additionally, it outperforms models like Claude3 Opus, GPT-4 Turbo, and Gemini 1.5 Pro on challenging benchmarks such as AIME 2024.

Frequently Asked Questions

Q1. What is the difference between Qwen2.5-Math and Qwen2-Math?

A. Qwen2.5-Math is an upgraded version of Qwen2-Math, offering improved performance, better accuracy in solving complex mathematical problems, and enhanced training techniques.

Q2. Which model performs better for complex mathematical tasks, Qwen2.5-Math or Qwen2-Math?

A. Qwen2.5-Math typically outperforms Qwen2-Math on complex tasks due to its advanced training and refined capabilities in mathematical reasoning.

Q3. How do Qwen2.5-Math and Qwen2-Math handle mathematical reasoning?

A. Both models are designed for mathematical reasoning, but Qwen2.5 uses more sophisticated algorithms and training data to solve challenging problems more effectively.

Q4. What is the significance of training data in Qwen2.5-Math vs Qwen2-Math?

A. Qwen2.5-Math benefits from a larger and more diverse dataset, which enhances its ability to generalize and solve complex mathematical problems more accurately than Qwen2-Math.

Q5. Are there any differences in the speed of processing between Qwen2.5-Math and Qwen2-Math?

A. Qwen2.5 optimizes faster processing and provides quicker responses compared to Qwen2-Math while maintaining high accuracy.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Vikas Verma

A Data Science professional with 7.5 years of experience in data science, machine learning, and programming. Hands-on experience in different domains like data analytics, deep learning, big data, and natural language processing.

Advanced Generative AI LLMs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Getting Started with Qwen2.5-Math

Learning Objectives

Table of contents

What is Qwen2.5-Math?

Qwen2.5-Math vs Qwen2-Math

Optimizing Training Data

Chain-of-Thought (CoT) Dataset

Tool-Integrated Reasoning (TIR) Dataset

Efficient Model Training

Optimizing Model Performance

Base Models Performance

Instruction-tuned Models Performance

Running Demo

Step1: Extracting the math-related content using Qwen-VL

Step2: Mathematical reasoning using Qwen2.5-Math

Example1

Example2

Example3

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at