OpenAI was the first to introduce reasoning models like o1 and o1-mini, but is it the only player in the game? Not by a long shot! Chinese LLMs like DeepSeek, Qwen, and now Kimi are stepping up to challenge OpenAI by delivering similar capabilities at much more affordable prices. After DeepSeek’s impressive debut, it’s Kimi AI’s turn to shine with its new Kimi k1.5 model. In this article, we will be testing Kimi k1.5 against OpenAI o1 on the same tasks and see which one is better!
Please Note: Please note: Kimi k1.5 is free, while accessing ChatGPT Plus for o1 and o1-mini costs $20. Before diving into the tasks, let’s compare the two models.
Kimi k1.5 is a multi-modal LLM by Moonshot AI that is trained with reinforcement learning (RL) and designed to excel in various reasoning tasks across text, vision, and coding. Launched recently, Kimi K1.5 has quickly gained attention for its impressive performance, matching the capabilities of OpenAI’s o1 model without the preview or mini suffixes.
Also Read:
Kimi k1.5 matches OpenAI’s o1 and o1-mini in long-CoT tasks and outperforms GPT-4o and Claude Sonnet 3.5 in short-CoT tasks. Its multi-modal capabilities, particularly in visual reasoning, position it as a strong competitor to OpenAI’s models. The use of RL training techniques, multi-modal data recipes, and infrastructure optimization has been pivotal in achieving these results.
The Kimi k1.5 long-CoT model achieves state-of-the-art (SOTA) performance through a combination of long-context scaling, improved policy optimization methods, and vision-text reinforcement learning (RL). Unlike traditional language model pretraining, which relies on next-token prediction and is limited by available training data, Kimi k1.5 leverages RL to scale its training data by learning to explore with rewards. This approach establishes a simplistic yet effective RL framework, avoiding complex techniques like Monte Carlo tree search or value functions.
Key benchmarks highlight the model’s exceptional performance:
The model’s ability to handle long-context tasks like planning, reflection, and correction is enhanced by partial rollouts during training, improving efficiency and performance.
The short-CoT model builds on the success of the long-CoT model, using effective long2short methods to distill long-CoT techniques into a more efficient framework. This approach combines fine-tuning, reinforcement learning, and long-to-short distillation, delivering rapid and accurate reasoning for short-context tasks.
Notable achievements include:
The long-to-short distillation techniques, including model merging and shortest rejection sampling, ensure high performance with reduced computational costs, making the short-CoT model highly efficient.
To know more you can checkout their research paper here!
Now, let’s evaluate Kimi k1.5 and OpenAI o1 on the same prompts!
Prompt: Solve this zebra puzzle correctly.
Collector Position | Nationality | Lot | Amount Spent | Shirt Color |
---|---|---|---|---|
Collector #1 | Portuguese | Toy | Unknown | Blue |
Collector #2 | Japanese | Lot 40 | $400,000 | Purple |
Collector #3 | Italian | Lot 30 | $700,000 | Pink |
Collector #4 | Canadian | Unknown | $200,000 | Yellow |
Collector #5 | Unknown | Unknown | $800,000 | Unknown |
Collector # | Nationality | Lot | Item | Amount | Shirt Color |
---|---|---|---|---|---|
1 | Italian | 30 | Clock | $400,000 | Blue |
2 | Japanese | 40 | Painting | $200,000 | Purple |
3 | Canadian | 10 | Vase | $700,000 | Pink |
4 | Portuguese | 34 | Toy | $800,000 | Yellow |
5 | (Unknown) |
The response provided by OpenAI o1 is significantly inaccurate. The resulting table does not adhere to the required format specified in the question, making the response unreliable. On the other hand, Kimi k1.5 provided a response in the correct format as per the question, but the content of the answer itself is incorrect.
Verdict:
OpenAI o1: ❌ | Kimi k1.5: ❌
Prompt: You walk into a room and see a bed. On the bed there are two dogs, four cats, a giraffe, five cows, and a duck. There are also three chairs and a table. How many legs are on the floor?
Correct Answer: 22 (2 (human) + 4 (table) + 4 (bed) + 12 (3 chairs))
OpenAI o1 took too long to generate the response. While the calculations were correct and included a detailed explanation, it failed to account for the human legs on the floor, making its answer incorrect. In contrast, Kimi k1.5 provided the correct answer in less time.
Verdict:
OpenAI o1: ❌ | Kimi k1.5: ✅
Prompt: Create a simple HTML page with a button that changes the background color to Pink when clicked with some confetti.
Both the models adhered to the prompt and provided correct responses. In terms of look and functionality of the webpage, Kimi k1.5 is doing a much better job!
Verdict:
OpenAI o1: ❌ | Kimi k1.5: ✅
Prompt: Read the article – https://www.analyticsvidhya.com/blog/2024/07/building-agentic-rag-systems-with-langgraph/ to understand the process of building a vector database for Wikipedia data. Summarize the key steps in a concise manner.
Key Points to Summarize the Process of Building a Vector Database for Wikipedia Data:
The article “A Comprehensive Guide to Building Agentic RAG Systems with LangGraph” outlines the process of building a vector database for Wikipedia data as part of creating an Agentic RAG system. Here are the key steps summarized:
langchain
, langchain-openai
, langchain-community
, langgraph
, and langchain-chroma
.getpass
function to avoid exposing them in the code.RecursiveCharacterTextSplitter
to split the documents into chunks of approximately 2000 characters with an overlap of 300 characters.text-embedding-3-small
) to convert document chunks into embeddings.Kimi k1.5’s output excels by providing detailed, actionable implementation steps with specific technical parameters and library requirements. It balances comprehensive depth with clear organization, explains the reasoning behind each step, and effectively connects the technical process to the broader goal of building an Agentic RAG system.
Verdict:
OpenAI o1: ❌ | Kimi k1.5: ✅
Prompt: The image shows a highly simplified schematic of a datacenter in space, with some simplifying assumptions that need justification.
Tasks:
Estimate the lower bound of the required radiator surface area for the datacenter.
Address how to handle the Sun and deep space in the context of thermal management.
Explain how the First Law of Thermodynamics applies to this scenario.
Input Image:
Below is a step-by-step reasoning process with clearly stated simplifying assumptions, justifications, and the application of the First Law of Thermodynamics. The final result will be a lower bound on the radiator area required to dissipate 1 GW of heat generated by a space-based data center.
1. Power Input and Heat Generation:
Assume the data center is powered by a 1 GW solar power source. Nearly all of the electrical power consumed by GPUs will be converted into heat (electrical-to-thermal energy conversion is essentially 100% at steady state since all electrical energy not stored must eventually dissipate as heat). Thus, the waste heat to be removed is approximately 1 GW = 1×10^9 W.
2. Radiative Heat Transfer Only:
In space, there is no convective or conductive cooling to the environment—only radiative heat transfer. Therefore, the primary mode of heat rejection is thermal radiation from a radiator surface into deep space.
3. Emissivity and Absorptivity:
Assume the radiator is constructed with a high-emissivity surface, say ε=0.9ε=0.9. This is plausible for surfaces coated with certain high-emissivity paints or materials. For a lower bound on required area, we assume ideal conditions:
To estimate the lower bound of the required radiator surface area for the data center in space, I start by applying the Stefan-Boltzmann Law, which describes the power radiated by a surface:
To handle the Sun and deep space in thermal management, the data center should be shielded from direct sunlight using a sunshade or positioned in a location where it is not exposed to the Sun, such as the dark side of a planet or the Moon. The radiator must be designed to efficiently radiate heat into the vacuum of space, which is extremely cold.
The First Law of Thermodynamics applies to this scenario by ensuring that the heat generated by the data center’s equipment is equal to the heat radiated away by the radiator. This law guarantees that the total energy in the system remains constant, with the heat generated being fully accounted for in the radiator’s design.
The lower bound of the required radiator surface area is approximately 766 square meters. The data center should be shielded from direct sunlight and the radiator should be optimized for heat dissipation in the vacuum of space to maintain thermal equilibrium.
Kimi k1.5’s response is superior for several reasons. It demonstrates clear mathematical reasoning through a step-by-step approach, starting with the fundamental Stefan-Boltzmann Law equation. Kimi clearly defines all variables and their values, shows the mathematical process of solving for the radiator area, and provides a concrete numerical result of 766 square meters. The explanation includes clear justifications for thermal management strategies, practical considerations such as positioning the radiator on the dark side of a planet, and a direct connection to the First Law of Thermodynamics with real-world application. The response concludes with specific numbers and actionable recommendations.
In contrast, OpenAI o1’s response remains more theoretical, focusing on general assumptions and setup rather than completing the actual calculation. It lacks a concrete numerical solution and does not fully address the thermal management aspect, making it less practical and actionable compared to Kimi k1.5’s detailed and solution-oriented approach.
OpenAI o1: ❌ | Kimi k1.5: ✅
Task | Winner |
---|---|
Zebra Puzzle | Neither |
Logical Reasoning | Kimi k1.5 |
Create an HTML Site | Kimi k1.5 |
Web Search | Kimi k1.5 |
Image Analysis | Kimi k1.5 |
Also Read: Kimi k1.5 vs DeepSeek R1: Battle of the Best Chinese LLMs
Free models like Kimi k1.5 and DeepSeek R1 are challenging OpenAI o1’s dominance, offering superior performance in reasoning, coding, and multi-modal tasks at no cost. With Kimi k1.5 outperforming OpenAI in key benchmarks and DeepSeek R1 excelling in coding challenges, is paying $20/month for OpenAI o1 still justified? Let us know in the comment section below!
Stay tuned to Analytics Vidhya Blog for more such awesome content!