Over the last few years, significant progress has been made in researching and improving the reasoning capabilities of large language models, with a strong focus on enhancing their proficiency in solving
arithmetic and mathematical problems.
A model with good arithmetic and mathematical reasoning can help in :
This article explores how advancements in mathematical reasoning are driving innovations in AI models like Qwen2.5-Math and its applications in personalized learning, problem-solving, and curriculum design.
The Qwen2.5-Math series is the latest addition to Alibaba Cloud’s Qwen series of open-source, math-specific large language models. It follows the earlier release of Qwen2-Math, a series of specialized mathematical language models based on the Qwen2 LLMs. These models demonstrate superior mathematical capabilities, surpassing both open-source alternatives and even some closed-source models like GPT-4o.
This series demonstrates significant performance enhancements over the Qwen2-Math series on Chinese and English mathematics benchmarks. While this series applies Chain-of-Thought(CoT) to solve English-specific math problems only, the Qwen2.5-Math series expands its capabilities by incorporating both CoT and Tool-Integrated Reasoning (TIR), to tackle math problems in both Chinese and English effectively.
The comparison between Qwen2.5-Math and Qwen2-Math highlights the advancements in mathematical reasoning and problem-solving capabilities achieved in the latest iteration of Alibaba Cloud’s math-specific language models.
Property | Qwen2-Math | Qwen2.5-Math |
---|---|---|
Pre-training data size | 700B tokens (from Qwen Math Corpus v1) | Over 1T tokens (from Qwen Math Corpus v2) |
Languages supported | English | English and Chinese |
Approach | Chain-of-Thought (COT) | Chain-of-Thought (COT), Tool-integrated Reasoning (TIR) |
Benchmark Score (GSM8K, Math, and MMLU-STEM) | 89.1, 60.5, 79.1 | 90.8, 66.8, 82.8 |
Model Variants | Qwen2-Math-1.5B/7B/72B | Qwen2.5-Math-1.5B/7B/72B |
The Qwen2.5-Math series is trained using the Qwen Math Corpus v2, comprising over 1 trillion high-quality mathematical data tokens in both English and Chinese. This dataset includes synthetic mathematical data generated using the Qwen2-Math-72B-Instruct model and aggregated mathematical Chinese data sourced from web content, books, and code repositories through multiple recall cycles.
The chain-of-thought (CoT) dataset for Qwen2.5-Math is a comprehensive collection of mathematical problems aimed at improving the reasoning capabilities of the model. It includes:
To address the computational and algorithmic challenges faced by CoT prompting—such as solving quadratic equations or computing eigenvalues—the tool-integrated reasoning (TIR) dataset was introduced. This dataset enhances the model’s proficiency in symbolic manipulation and precise calculations by enabling it to use a Python interpreter for reasoning tasks. It includes:
Since the Qwen2.5-Math model is the upgraded version of the Qwen2-Math model so its training is derived from Qwen2-Math as follows:
Enhancing model performance is key to delivering faster, more accurate results, ensuring efficiency and reliability in applications.
The base models Qwen2.5-Math-1.5B/7B/72B achieved significant improvements on English math benchmarks (GSM8K, Math, and MMLU-STEM) and Chinese math benchmarks (CMATH, GaoKao Math Cloze, and GaoKao Math QA) as compared Qwen2-Math-1.5B/7B/72B.
For example, Qwen2.5-Math-1.5B/7B/72B models show significant improvement of 5.4, 5.0, 6.3 on MATH, and score improvement of 3.4, 12.2, 19.8 on GaoKao Math QA.
The Qwen2.5-Math-72B-Instruct model outperformed both open-source models and top closed-source models, such as GPT-4o and Gemini Math-Specialized 1.5 Pro.
The Qwen2.5-Math-72B-Instruct model surpasses its predecessor (the Qwen2-Math-72B-Instruct model) by an average of 4.4 points in English and 6.1 points in Chinese. This performance marks its position as the leading open-source mathematical model available today.
On the extremely challenging benchmarks such as AIME 2024 and AMC23, models like Claude3 Opus, GPT-4 Turbo, and Gemini 1.5 Pro solve only 1 or 2 out of 30 problems. In contrast, Qwen2.5-Math-72B-Instruct demonstrates remarkable performance, solving 9 problems in Greedy decoding CoT mode and 12 problems in TIR mode. Furthermore, with the assistance of the reward model (RM), Qwen2.5-Math-7B-Instruct achieves an impressive 21 solved problems, showcasing its superior mathematical problem-solving capabilities.
Let’s see the Qwen2.5-Math demo using the HuggingFace space here.
This space provides a web-based user interface to input mathematical or arithmetic problems in either image or text format for testing the model’s capabilities.
To support multi-modalities this space uses Qwen2-VL for OCR and Qwen2.5-Math for mathematical reasoning.
Qwen-VL (Qwen Large Vision Language Model) is the multimodal vision language model that supports images, text as inputs. It naturally supports English and Chinese to perform various image-to-text generation tasks like image captioning, visual question-answering, visual reasoning, text recognition, etc.
Qwen-VL series contains many models such as Qwen-VL, Qwen-VL-Chat, Qwen-VL-Plus, Qwen-VL-Max
etc. Qwen-VL-Max is Qwen’s most Capable large Visual Language model for delivering optimal performance on an even broader range of complex tasks.
The system uses the qwen-vl-max-0809 model to understand, process, and extract textual information from the input images. The process_image() function first receives the input image and extracts the math-related content, ensuring accurate transcription of any LaTeX formulas. The system then applies the following standard prompt to extract the textual, math-related content from the image.
The prompt instructs: “Describe the math-related content in this image, ensuring accurate transcription of any LaTeX formulas. Do not describe non-mathematical details.”
import os
os.system('pip install dashscope -U')
import tempfile
from pathlib import Path
import secrets
import dashscope
from dashscope import MultiModalConversation, Generation
from PIL import Image
YOUR_API_TOKEN = os.getenv('YOUR_API_TOKEN')
dashscope.api_key = YOUR_API_TOKEN
math_messages = []
def process_image(image, shouldConvert=False):
global math_messages
math_messages = [] # reset when upload image
uploaded_file_dir = os.environ.get("GRADIO_TEMP_DIR") or str(
Path(tempfile.gettempdir()) / "gradio"
)
os.makedirs(uploaded_file_dir, exist_ok=True)
name = f"tmp{secrets.token_hex(20)}.jpg"
filename = os.path.join(uploaded_file_dir, name)
if shouldConvert:
new_img = Image.new('RGB', size=(image.width, image.height), color=(255, 255, 255))
new_img.paste(image, (0, 0), mask=image)
image = new_img
image.save(filename)
messages = [{
'role': 'system',
'content': [{'text': 'You are a helpful assistant.'}]
}, {
'role': 'user',
'content': [
{'image': f'file://{filename}'},
{'text': 'Please describe the math-related content in this image, ensuring that any LaTeX formulas are correctly transcribed. Non-mathematical details do not need to be described.'}
]
}]
response = MultiModalConversation.call(model='qwen-vl-max-0809', messages=messages)
os.remove(filename)
return response.output.choices[0]["message"]["content"]#import csv
This step extracts the image description, which is then passed to the Qwen2.5 model along with the user question to generate the response. The qwen2.5-math-72b-instruct model performs the mathematical reasoning in this process.
def get_math_response(image_description, user_question):
global math_messages
if not math_messages:
math_messages.append({'role': 'system', 'content': 'You are a helpful math assistant.'})
math_messages = math_messages[:1]
if image_description is not None:
content = f'Image description: {image_description}\n\n'
else:
content = ''
query = f"{content}User question: {user_question}"
math_messages.append({'role': 'user', 'content': query})
response = Generation.call(
model="qwen2.5-math-72b-instruct",
messages=math_messages,
result_format='message',
stream=True
)
answer = None
for resp in response:
if resp.output is None:
continue
answer = resp.output.choices[0].message.content
yield answer.replace("\\", "\\\\")
print(f'query: {query}\nanswer: {answer}')
if answer is None:
math_messages.pop()
else:
math_messages.append({'role': 'assistant', 'content': answer})
Having known about the models used in this space, let’s see some examples to
assess model capability to solve mathematical or arithmetic problems.
An input image containing the following problem statement –
The model finds the values of x as 5 and y as 2. It also provides step-by-step
natural language reasoning while finding the values of x and y.
An input image containing the following problem statement –
The model finds out the value of the last expression as 50.
An input image containing the following problem statement –
The model finds out the value of the above expression as 5.
In this article, we explored Qwen2.5-Math—a series of mathematical models with robust reasoning capabilities. We examined its components, training data, architecture, and performance on various standard benchmarks. Additionally, we reviewed the demo, testing it with a range of moderate to complex examples.
A. Qwen2.5-Math is an upgraded version of Qwen2-Math, offering improved performance, better accuracy in solving complex mathematical problems, and enhanced training techniques.
A. Qwen2.5-Math typically outperforms Qwen2-Math on complex tasks due to its advanced training and refined capabilities in mathematical reasoning.
A. Both models are designed for mathematical reasoning, but Qwen2.5 uses more sophisticated algorithms and training data to solve challenging problems more effectively.
A. Qwen2.5-Math benefits from a larger and more diverse dataset, which enhances its ability to generalize and solve complex mathematical problems more accurately than Qwen2-Math.
A. Qwen2.5 optimizes faster processing and provides quicker responses compared to Qwen2-Math while maintaining high accuracy.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.