Is Qwen2.5-Max Better than DeepSeek-R1 and Kimi k1.5?

Anu Madan Last Updated : 02 Feb, 2025
7 min read

It’s Lunar New Year in China and the world is celebrating! Thanks to the launch of one amazing model after the other by Chinese companies. Alibaba too recently launched Qwen2.5-Max – a model that supersedes giants from OpenAI, DeepSeek & Llama. Packed with advanced reasoning, and image & video generation, this model is set to shake the GenAI world. In this blog, we will compare the performance of Qwen2.5-Max, DeepSeek-R1, and Kimi k1.5 on several fronts to find the best LLM at present!

Introduction to Qwen2.5-Max, DeepSeek-R1, and Kimi k1.5

  • Qwen2.5-Max: It is a closed-source multimodal LLM by Alibaba Cloud, trained with over 20 trillion parameters and fine-tuned using RLHF. It shows advanced reasoning capabilities with the ability to generate images and videos.
  • DeepSeek-R1: It is an open-source model by DeepSeek, that has been trained using reinforcement learning with supervised fine-tuning. This model excels in logical thinking, complex problem-solving, mathematics, and coding.
  • Kimi k1.5: It is an open-source multimodal LLM by Moonshot AI that can process large amounts of content in a simple prompt. It can conduct real-time web searches across 100+ websites and work with multiple files all at once. The model shows great results in tasks involving STEM, coding, and general reasoning.
"

Qwen2.5-Max Vs DeepSeek-R1 Vs Kimi k1.5: Technical Comparison

Let’s begin comparing Qwen2.5-max, DeepSeek-R1, and Kimi k1.5, starting with their technical details. For this, we will be comparing the benchmark performances and features of these 3 models.

Benchmark Performance Comparison

Based on the available data, here is how Qwen2.5-Max performs against DeepSeek-R1 and Kimi k1 on various standard benchmark tests.

Benchmark Performance Comparison
  1. Live Code Bench: This benchmark determines how each model handles coding tasks, including writing, debugging, or understanding code. Kimi k1.5 and Qwen2.5-Max are almost tied, indicating they are both very capable of generating and parsing code snippets.
  2. GPQA (General-Purpose Question Answering): This benchmark evaluates a model’s ability to understand and solve questions around multiple domains like reasoning, context-based understanding, and factual knowledge. On this benchmark, DeepSeek R1 leads over Qwen2.5-Max with a significant margin, highlighting its dominance.
  3. MMLU: This benchmark tests multi-subject knowledge and language understanding across various domains (mathematics, sciences, humanities, etc.). Qwen2.5-Max leads over DeepSeek R1, showing a better grasp of diverse academic and real-world topics.
  4. C-Eval (Comprehensive Evaluation): This benchmark covers nuanced, higher-level reasoning questions. Qwen2.5-Max shows a better performance on this parameter compared to Kimi k1.5, making it a better reasoning model.

Feature Comparison

Each of the three models has its own web chat interface which offers different features. Here’s how Qwen2.5-Max performs against its peers when it comes to features:

Feature Qwen2.5-Max DeepSeek-R1 Kimi k1.5
Image Analysis No Yes Yes
Web Interface Yes Yes Yes
Image Generation Yes No No
Web Search No Yes Yes
Artifacts Yes No No
Documents Upload Single Multiple Multiple
Common Phrase No No Yes

Qwen2.5-Max Vs DeepSeek-R1 Vs Kimi k1.5: Application-based Analysis

Now let’s compare the currently reigning top 3 Chinese models – Qwen2.5-max, DeepSeek-R1, and Kimi k1.5 – based on their applications. For this, we will be testing their performance on three different tasks involving reasoning, multi-step processing, and coding.

We will give the same prompt to all three models and compare their responses, giving them scores of 0, 0.5, and 1 based on their performance. At the end, we’ll tally the scores to find out which of them is the best. So let’s get started.

Task 1: Advanced Reasoning

Prompt: “ Prove mathematically that Earth is round”

Outputs

Qwen2.5-Max DeepSeek-R1 Kimi k1.5
Using geometry, gravity, satellite motion, and historical experiments, we have shown mathematically that the Earth is round. The curvature of the horizon, uniformity of gravity, and consistency of satellite orbits all support this conclusion.The Earth is a sphere (or more precisely, an oblate spheroid). By calculating Earth’s circumference using angular differences in shadows and distances between two locations, Eratosthenes mathematically proved Earth’s spherical shape. Modern measurements and observations (e.g., horizon curvature, lunar eclipses) further confirm this. The consistency of measured distances and angles with spherical geometry (e.g., d=Rθ) proves the Earth is round. Flat-Earth models fail to explain these relationships.

My Analysis

Qwen2.5-Max DeepSeek-R1 Kimi k1.5
The model provided some key observations supporting the claim. It then covered various parameters like the geometry of a sphere, horizon curvature, rotational and centrifugal force, and others that were not required. It did cover “Eratosthenes’ Experiment” as part of those parameters. The model explained the theory in brief along with all the mathematical calculations to give the proof for the given task. It also gave some additional proofs along with some key observations to establish its result. The model used Eratosthenes’ Experiment to establish the proof. This model gave the most clear and concise response. It proved the hypothesis with a small mathematical calculation. It used the measurements from Eratosthenes’ Experiment, without explicitly mentioning it.

The most relevant proof of Earth’s curvature would be the one involving Eratosthenes’ method (circa 240 BCE), since it’s one of the oldest and most comprehensive proofs involving the mathematical analysis of the angular difference between two places. All three models used that approach in one way or the other.

Qwen2.5-Max provided 8 different ways to prove that the Earth is round with no proper explanation of any. DeepSeek-R1 took Eratosthenes’ method – explained its theory and mathematics in concise and clear terms. Kimi k 1.5 used the simplest approach, based on the required method without even explicitly mentioning it.

Score: Qwen2.5-Max: 0 | DeepSeek-R1: 0.5 | Kimi k1.5: 1

Task 2: Multi-step Document Processing & Analysis

Prompt: Summarise the lesson in 1 line, create a flowchart to explain the process happening in the lesson, and then translate the summary into French.
🔗 Lesson

Outputs

My Analysis

Qwen2.5-Max DeepSeek-R1 Kimi k1.5
The summary was concise and laid out the topics covered in the lesson. The summary of the lesson was crisp, concise, and to the point. The summary covered all the topics and was quite simple, yet a bit long compared to the others.
The flowchart covered all essential headings and their subheadings as required. The flowchart covered all essential headings but had more than the required content in the sub-headings. Instead of the flowchart about the lesson, the model generated the flowchart on the process that was covered in the lesson. Overall this flowchart was clear and crisp.

I wanted a simple, crisp, one-line summary of the lesson which was generated by DeepSeek-R1 and Qwen2.5-Max alike. But for the flowchart, while the design and crispness of the result generated by Kimi k1.5 was the exact ask, it lacked details about the flow of the lesson. The flowchart by DeepSeek-R1 was a bit content-heavy while Qwen2.5-Max gave a good flowchart covering all essentials.

Score: Qwen2.5-Max: 1 | DeepSeek-R1: 0.5 | Kimi k1.5: 0.5

Task 3: Coding

Prompt: “Write an HTML code for a wordle kind of an app”

Note: Before you enter your prompt in Qwen2.5-Max, click on artifacts, this way you will be able to visualize the output of your code within the chat interface.

Output:

Qwen2.5-Max:

DeepSeek-R1:

Kimi k1.5:

My Analysis:

Qwen2.5-Max DeepSeek-R1 Kimi k1.5
The model generates the code quickly and the app itself looks a lot like the actual “Wordle app”. Instead of alphabets listed at the bottom, it presented us the option to directly enter our 5 letters. It would then automatically update those letters in the board. The model takes some time to generate the code but the output was great! The output it generated was almost the same as the actual “Wordle App”. We can select the alphabets that we wish to try guessing and they would put our selection into the word. The model generates the code quickly enough. But the output of the code was a distorted version of the actual “Wordle App”. The wordboard was not appearing, neither were all letters. In fact, the enter and delete features were almost coming over the alphabets.
With its artifacts feature, it was super easy to analyze the code right there. The only issue with it was that I had to copy the code and run it in a different interface. Besides this, I had to run this code in a different interface to visualize the output.

Firstly, I wanted the app generated to be as similar to the actual Wordle app as possible. Secondly, I wanted to put minimum effort into testing the generated code. The result generated by DeepSeek-R1 was the closest to the ask, while Qwen-2.5’s fairly good result was the easiest to test.

Score: Qwen2.5-Max: 1 | DeepSeek-R1: 1 | Kimi k1.5: 0

Final Score

Qwen2.5-Max: 2 | DeepSeek-R1: 1.5 | Kimi k1.5: 1.5

Conclusion

Qwen2.5-Max is an amazing LLM that gives models like DeepSeek-R1 and Kimi k1.5 tough competition. Its responses were comparable across all different tasks. Although it currently lacks the power to analyze images or search the web, once those features are live; Qwen2.5-Max will be an unbeatable model. It already possesses video generation capabilities that even GPT-4o doesn’t have yet. Moreover, its interface is quite intuitive, with features like artifacts, which make it simpler to run the codes within the same platform. All in all, Qwen2.5-Max by Alibaba is an all-round LLM that is here to redefine how we work with LLMs!

Frequently Asked Questions

Q1. What is Qwen2.5-Max? 

A. Qwen2.5-Max is Alibaba’s latest multimodal LLM, optimized for text, image, and video generation with over 20 trillion parameters. 

Q2. How does Qwen2.5-Max perform compared to DeepSeek-R1 and Kimi k1.5?

A. Compared to DeepSeek-R1 and Kimi k1.5, it excels in reasoning, multimodal content creation, and programming support, making it a strong competitor in the Chinese AI ecosystem.

Q3. Is Qwen2.5-Max open-source?

A. No, Qwen2.5-Max is a closed-source model, while DeepSeek-R1 and Kimi k1.5 are open-source.

Q4. Can Qwen2.5-Max generate images and videos?

A. Yes! Qwen2.5-Max model supports image and video generation. 

Q5. Can Kimi k1.5 and DeepSeek-R1 perform web searches?

A. Yes, both DeepSeek-R1 and Kimi k1.5 support real-time web search, whereas Qwen2.5-Max currently lacks web search capabilities. This gives DeepSeek-R1 and Kimi an edge in retrieving the latest online information.

Q6. Should I choose Qwen2.5-Max, DeepSeek-R1, or Kimi k1.5?

A. Depending on your use case, choose:
– Qwen2.5-Max: If you need multimodal capabilities (text, images, video) and advanced AI reasoning.
– DeepSeek-R1: If you want the flexibility of an open-source model, superior question-answering performance, and web search integration.
– Kimi k1.5: If you need efficient document handling, STEM-based problem-solving, and real-time web access.

Anu Madan has 5+ years of experience in content creation and management. Having worked as a content creator, reviewer, and manager, she has created several courses and blogs. Currently, she working on creating and strategizing the content curation and design around Generative AI and other upcoming technology.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details