Getting Started with Qwen2.5-Math

Vikas Verma Last Updated : 27 Dec, 2024
7 min read

Over the last few years, significant progress has been made in researching and improving the reasoning capabilities of large language models, with a strong focus on enhancing their proficiency in solving
arithmetic and mathematical problems.

A model with good arithmetic and mathematical reasoning can help in :

  • Personalized Learning: AI-powered tutors can adapt to individual students’ needs, elping them understand complex mathematical concepts more effectively.
  • Problem Solving Assistance: Automating step-by-step explanations for solving problems improves student engagement and comprehension.
  • Curriculum Design: Creating adaptive and progressive learning modules in subjects like algebra and calculus.

This article explores how advancements in mathematical reasoning are driving innovations in AI models like Qwen2.5-Math and its applications in personalized learning, problem-solving, and curriculum design.

Learning Objectives

  • Understand and explore the Qwen2.5-Math series and its components.
  • Learn about Qwen2.5-Math model architecture.
  • Gain hands-on exposure on Qwen2.5-Math with examples.
  • Learn about the performance of Qwen2.5-Math on various benchmarks.

What is Qwen2.5-Math?

The Qwen2.5-Math series is the latest addition to Alibaba Cloud’s Qwen series of open-source, math-specific large language models. It follows the earlier release of Qwen2-Math, a series of specialized mathematical language models based on the Qwen2 LLMs. These models demonstrate superior mathematical capabilities, surpassing both open-source alternatives and even some closed-source models like GPT-4o.

This series demonstrates significant performance enhancements over the Qwen2-Math series on Chinese and English mathematics benchmarks. While this series applies Chain-of-Thought(CoT) to solve English-specific math problems only, the Qwen2.5-Math series expands its capabilities by incorporating both CoT and Tool-Integrated Reasoning (TIR), to tackle math problems in both Chinese and English effectively.

Qwen2.5-Math vs Qwen2-Math

The comparison between Qwen2.5-Math and Qwen2-Math highlights the advancements in mathematical reasoning and problem-solving capabilities achieved in the latest iteration of Alibaba Cloud’s math-specific language models.

PropertyQwen2-MathQwen2.5-Math
Pre-training data size700B tokens (from Qwen Math Corpus v1)Over 1T tokens (from Qwen Math Corpus v2)
Languages supportedEnglishEnglish and Chinese
ApproachChain-of-Thought (COT)Chain-of-Thought (COT), Tool-integrated Reasoning (TIR)
Benchmark Score (GSM8K, Math, and MMLU-STEM)89.1, 60.5, 79.190.8, 66.8, 82.8
Model VariantsQwen2-Math-1.5B/7B/72BQwen2.5-Math-1.5B/7B/72B

Optimizing Training Data

The Qwen2.5-Math series is trained using the Qwen Math Corpus v2, comprising over 1 trillion high-quality mathematical data tokens in both English and Chinese. This dataset includes synthetic mathematical data generated using the Qwen2-Math-72B-Instruct model and aggregated mathematical Chinese data sourced from web content, books, and code repositories through multiple recall cycles.

Chain-of-Thought (CoT) Dataset

The chain-of-thought (CoT) dataset for Qwen2.5-Math is a comprehensive collection of mathematical problems aimed at improving the reasoning capabilities of the model. It includes:

  • 580k English and 500k mathematical problems, including both annotated and synthesized items.
  • The Annotated data derived from sources like GSM8K, MATH, and NuminaMath.

Tool-Integrated Reasoning (TIR) Dataset

To address the computational and algorithmic challenges faced by CoT prompting—such as solving quadratic equations or computing eigenvalues—the tool-integrated reasoning (TIR) dataset was introduced. This dataset enhances the model’s proficiency in symbolic manipulation and precise calculations by enabling it to use a Python interpreter for reasoning tasks. It includes:

  • 190k problems from benchmarks like GSM8K, MATH, CollegeMath, and NuminaMath.
  • 205k problems created using techniques from MuggleMath and DotaMath to evolve queries within GSM8K and MATH training sets.

Efficient Model Training

Multimodal Sentiment Analysis
Source: HuggingFace

Since the Qwen2.5-Math model is the upgraded version of the Qwen2-Math model so its training is derived from Qwen2-Math as follows:

  • Qwen2-Math models train on Qwen Math Corpus v1, a high-quality dataset that contains approximately 700 billion tokens of mathematical content.
  • Developers train a math-specific reward model, Qwen2-Math-RM, derived from the Qwen2-Math-72B model.
  • The Qwen2.5 series base models serve for parameter initialization, enhancing language understanding, code generation, and text reasoning capabilities.
  • After training the base Qwen2.5-Math model, developers train a math-specific reward model, Qwen2.5-Math-RM-72B, based on Qwen2.5-Math-72B. This reward model evolves the SFT data through Rejection Sampling for the SFT model (Qwen2.5-Math-SFT).
  • An instruct model (Qwen2.5-Math-Instruct) is built at the end to polish the quality of responses. This model is created through an additional iteration using the Qwen2-Math-Instruct models and Qwen2.5-Math-RM-72B. The process incorporates Tool-Integrated Reasoning (TIR) data and SFT data, refined via Group Relative Policy Optimization (GRPO), to further polish the model’s performance.

Optimizing Model Performance

Enhancing model performance is key to delivering faster, more accurate results, ensuring efficiency and reliability in applications.

Base Models Performance

The base models Qwen2.5-Math-1.5B/7B/72B achieved significant improvements on English math benchmarks (GSM8K, Math, and MMLU-STEM) and Chinese math benchmarks (CMATH, GaoKao Math Cloze, and GaoKao Math QA) as compared Qwen2-Math-1.5B/7B/72B.

Base Models Benchmark Results

For example, Qwen2.5-Math-1.5B/7B/72B models show significant improvement of 5.4, 5.0, 6.3 on MATH, and score improvement of 3.4, 12.2, 19.8 on GaoKao Math QA.

Instruction-tuned Models Performance

The Qwen2.5-Math-72B-Instruct model outperformed both open-source models and top closed-source models, such as GPT-4o and Gemini Math-Specialized 1.5 Pro.

Instruction-tuned Models Performance

The Qwen2.5-Math-72B-Instruct model surpasses its predecessor (the Qwen2-Math-72B-Instruct model) by an average of 4.4 points in English and 6.1 points in Chinese. This performance marks its position as the leading open-source mathematical model available today.

On the extremely challenging benchmarks such as AIME 2024 and AMC23, models like Claude3 Opus, GPT-4 Turbo, and Gemini 1.5 Pro solve only 1 or 2 out of 30 problems. In contrast, Qwen2.5-Math-72B-Instruct demonstrates remarkable performance, solving 9 problems in Greedy decoding CoT mode and 12 problems in TIR mode. Furthermore, with the assistance of the reward model (RM), Qwen2.5-Math-7B-Instruct achieves an impressive 21 solved problems, showcasing its superior mathematical problem-solving capabilities.

Qwen2.5-Math vs Closed-source models

Running Demo

Let’s see the Qwen2.5-Math demo using the HuggingFace space here.

This space provides a web-based user interface to input mathematical or arithmetic problems in either image or text format for testing the model’s capabilities.

To support multi-modalities this space uses Qwen2-VL for OCR and Qwen2.5-Math for mathematical reasoning.

Qwen-VL (Qwen Large Vision Language Model) is the multimodal vision language model that supports images, text as inputs. It naturally supports English and Chinese to perform various image-to-text generation tasks like image captioning, visual question-answering, visual reasoning, text recognition, etc.

Qwen-VL series contains many models such as Qwen-VL, Qwen-VL-Chat, Qwen-VL-Plus, Qwen-VL-Max
etc. Qwen-VL-Max is Qwen’s most Capable large Visual Language model for delivering optimal performance on an even broader range of complex tasks.

The system uses the qwen-vl-max-0809 model to understand, process, and extract textual information from the input images. The process_image() function first receives the input image and extracts the math-related content, ensuring accurate transcription of any LaTeX formulas. The system then applies the following standard prompt to extract the textual, math-related content from the image.

The prompt instructs: “Describe the math-related content in this image, ensuring accurate transcription of any LaTeX formulas. Do not describe non-mathematical details.”

import os

os.system('pip install dashscope -U')
import tempfile
from pathlib import Path
import secrets
import dashscope
from dashscope import MultiModalConversation, Generation
from PIL import Image



YOUR_API_TOKEN = os.getenv('YOUR_API_TOKEN')
dashscope.api_key = YOUR_API_TOKEN
math_messages = []
def process_image(image, shouldConvert=False):

    global math_messages
    math_messages = [] # reset when upload image
    uploaded_file_dir = os.environ.get("GRADIO_TEMP_DIR") or str(
        Path(tempfile.gettempdir()) / "gradio"
    )
    os.makedirs(uploaded_file_dir, exist_ok=True)
    

    name = f"tmp{secrets.token_hex(20)}.jpg"
    filename = os.path.join(uploaded_file_dir, name)

    if shouldConvert:
        new_img = Image.new('RGB', size=(image.width, image.height), color=(255, 255, 255))
        new_img.paste(image, (0, 0), mask=image)
        image = new_img
    image.save(filename)
    

    messages = [{
        'role': 'system',
        'content': [{'text': 'You are a helpful assistant.'}]
    }, {
        'role': 'user',
        'content': [
            {'image': f'file://{filename}'},
            {'text': 'Please describe the math-related content in this image, ensuring that any LaTeX formulas are correctly transcribed. Non-mathematical details do not need to be described.'}
        ]
    }]
    
    response = MultiModalConversation.call(model='qwen-vl-max-0809', messages=messages)
    

    os.remove(filename)
    
    return response.output.choices[0]["message"]["content"]#import csv

Step2: Mathematical reasoning using Qwen2.5-Math

This step extracts the image description, which is then passed to the Qwen2.5 model along with the user question to generate the response. The qwen2.5-math-72b-instruct model performs the mathematical reasoning in this process.

def get_math_response(image_description, user_question):
    global math_messages
    if not math_messages:
        math_messages.append({'role': 'system', 'content': 'You are a helpful math assistant.'})
    math_messages = math_messages[:1]
    if image_description is not None:
        content = f'Image description: {image_description}\n\n'
    else:
        content = ''
    query = f"{content}User question: {user_question}"
    math_messages.append({'role': 'user', 'content': query})
    response = Generation.call(	
        model="qwen2.5-math-72b-instruct",
        messages=math_messages,	
        result_format='message',
        stream=True
    )
    answer = None
    for resp in response:
        if resp.output is None:
            continue
        answer = resp.output.choices[0].message.content
        yield answer.replace("\\", "\\\\")
    print(f'query: {query}\nanswer: {answer}')
    if answer is None:
        math_messages.pop()
    else:
        math_messages.append({'role': 'assistant', 'content': answer})

Having known about the models used in this space, let’s see some examples to
assess model capability to solve mathematical or arithmetic problems.

Example1

An input image containing the following problem statement –

Model Output - Example 1: Qwen2.5-Math

The model finds the values of x as 5 and y as 2. It also provides step-by-step
natural language reasoning while finding the values of x and y.

Example2

An input image containing the following problem statement –

Model Output - Example 2: Qwen2.5-Math

The model finds out the value of the last expression as 50.

Example3

An input image containing the following problem statement –

Model Output - Example 3: Qwen2.5-Math

The model finds out the value of the above expression as 5.

Conclusion

In this article, we explored Qwen2.5-Math—a series of mathematical models with robust reasoning capabilities. We examined its components, training data, architecture, and performance on various standard benchmarks. Additionally, we reviewed the demo, testing it with a range of moderate to complex examples.

Key Takeaways

  • The Qwen2.5-Math models support both Chinese and English and showcase advanced mathematical reasoning capabilities. It utilizes techniques such as Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR).
  • The Qwen2.5 series includes multiple variants based on the number of parameters, with models available in 1.5B, 7B, and 72B parameters.
  • The Qwen2.5-Math models leverage 1 trillion tokens for pre-training, a substantial increase compared to the 700 billion tokens used for Qwen2-Math.
  • Qwen2.5-Math surpasses Qwen2-Math across various English and Chinese benchmarks. Additionally, it outperforms models like Claude3 Opus, GPT-4 Turbo, and Gemini 1.5 Pro on challenging benchmarks such as AIME 2024.

Frequently Asked Questions

Q1. What is the difference between Qwen2.5-Math and Qwen2-Math?

A. Qwen2.5-Math is an upgraded version of Qwen2-Math, offering improved performance, better accuracy in solving complex mathematical problems, and enhanced training techniques.

Q2. Which model performs better for complex mathematical tasks, Qwen2.5-Math or Qwen2-Math?

A. Qwen2.5-Math typically outperforms Qwen2-Math on complex tasks due to its advanced training and refined capabilities in mathematical reasoning.

Q3. How do Qwen2.5-Math and Qwen2-Math handle mathematical reasoning?

A. Both models are designed for mathematical reasoning, but Qwen2.5 uses more sophisticated algorithms and training data to solve challenging problems more effectively.

Q4. What is the significance of training data in Qwen2.5-Math vs Qwen2-Math?

A. Qwen2.5-Math benefits from a larger and more diverse dataset, which enhances its ability to generalize and solve complex mathematical problems more accurately than Qwen2-Math.

Q5. Are there any differences in the speed of processing between Qwen2.5-Math and Qwen2-Math?

A. Qwen2.5 optimizes faster processing and provides quicker responses compared to Qwen2-Math while maintaining high accuracy.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

A Data Science professional with 7.5 years of experience in data science, machine learning, and programming. Hands-on experience in different domains like data analytics, deep learning, big data, and natural language processing.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details