AI-powered reasoning models are taking the world by storm in 2025! With the launch of DeepSeek-R1 and o3-mini, we have seen unprecedented levels of logical reasoning capabilities in AI chatbots. In this article, we will access these models via their APIs and evaluate their logical reasoning skills to find out if o3-mini can replace DeepSeek-R1. We will be comparing their performance on standard benchmarks as well as real-world applications like solving logical puzzles and even building a Tetris game! So buckle up and join the ride.
DeepSeek-R1 and o3-mini offer unique approaches to structured thinking and deduction, making them apt for various kinds of complex problem-solving tasks. Before we speak of their benchmark performance, let’s first have a sneak peek at the architecture of these models.
o3-mini is OpenAI’s most advanced reasoning model. It uses a dense transformer architecture, processing each token with all model parameters for strong performance but high resource consumption. In contrast, DeepSeek’s most logical model, R1, employs a Mixture-of-Experts (MoE) framework, activating only a subset of parameters per input for greater efficiency. This makes DeepSeek-R1 more scalable and computationally optimized while maintaining solid performance.
Learn More: Is OpenAI’s o3-mini Better Than DeepSeek-R1?
Now what we need to see is how well these models perform in logical reasoning tasks. First, let’s have a look at their performance in the livebench benchmark tests.
Sources: livebench.ai
The benchmark results show that OpenAI’s o3-mini outperforms DeepSeek-R1 in almost all aspects, except for math. With a global average score of 73.94 compared to DeepSeek’s 71.38, the o3-mini demonstrates slightly stronger overall performance. It particularly excels in reasoning, achieving 89.58 versus DeepSeek’s 83.17, reflecting superior analytical and problem-solving capabilities.
Also Read: Google Gemini 2.0 Pro vs DeepSeek-R1: Who Does Coding Better?
Since we are testing these models through their APIs, let’s see how much these models cost.
Model | Context length | Input Price | Cached Input Price | Output Price |
o3-mini | 200k | $1.10/M tokens | $0.55/M tokens | $4.40/M tokens |
deepseek-chat | 64k | $0.27/M tokens | $0.07/M tokens | $1.10/M tokens |
deepseek-reasoner | 64k | $0.55/M tokens | $0.14/M tokens | $2.19/M tokens |
As seen in the table, OpenAI’s o3-mini is nearly twice as expensive as DeepSeek R1 in terms of API costs. It charges $1.10 per million tokens for input and $4.40 for output, whereas DeepSeek R1 offers a more cost-effective rate of $0.55 per million tokens for input and $2.19 for output, making it a more budget-friendly option for large-scale applications.
Sources: DeepSeek-R1 | o3-mini
Before we step into the hands-on performance comparison, let’s learn how to access DeepSeek-R1 and o3-mini using APIs.
All you have to do for this, is import the necessary libraries and api keys:
from openai import OpenAI
from IPython.display import display, Markdown
import time
with open("path_of_api_key") as file:
openai_api_key = file.read().strip()
with open("path_of_api_key") as file:
deepseek_api = file.read().strip()
Now that we’ve gotten the API access, let’s compare DeepSeek-R1 and o3-mini based on their logical reasoning capabilities. For this, we will give the same prompt to both the models and evaluate their responses based on these metrics:
We will then score the models 0 or 1 for each task, depending on their performance. So let’s try out the tasks and see who emerges as the winner in the DeepSeek-R1 vs o3-mini reasoning battle!
This task requires the model to implement a fully functional Tetris game using Python, efficiently managing game logic, piece movement, collision detection, and rendering without relying on external game engines.
Prompt: “Write a python code for this problem: generate a Python code for the Tetris game“
Input to DeepSeek-R1 API
INPUT_COST_CACHE_HIT = 0.14 / 1_000_000 # $0.14 per 1M tokens
INPUT_COST_CACHE_MISS = 0.55 / 1_000_000 # $0.55 per 1M tokens
OUTPUT_COST = 2.19 / 1_000_000 # $2.19 per 1M tokens
# Start timing
task1_start_time = time.time()
# Initialize OpenAI client for DeepSeek API
client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com")
messages = [
{
"role": "system",
"content": """You are a professional Programmer with a large experience."""
},
{
"role": "user",
"content": """write a python code for this problem: generate a python code for Tetris game."""
}
]
# Get token count using tiktoken (adjust model name if necessary)
encoding = tiktoken.get_encoding("cl100k_base") # Use a compatible tokenizer
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)
# Call DeepSeek API
response = client.chat.completions.create(
model="deepseek-reasoner",
messages=messages,
stream=False
)
# Get output token count
output_tokens = len(encoding.encode(response.choices[0].message.content))
task1_end_time = time.time()
total_time_taken = task1_end_time - task1_start_time
# Assume cache miss for worst-case pricing (adjust if cache info is available)
input_cost = (input_tokens / 1_000_000) * INPUT_COST_CACHE_MISS
output_cost = (output_tokens / 1_000_000) * OUTPUT_COST
total_cost = input_cost + output_cost
# Print results
print("Response:", response.choices[0].message.content)
print("------------------ Total Time Taken for Task 1: ------------------", total_time_taken)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")
# Display result
from IPython.display import Markdown
display(Markdown(response.choices[0].message.content))
Response by DeepSeek-R1
You can find DeepSeek-R1’s complete response here.
Output token cost:
Input Tokens: 28 | Output Tokens: 3323 | Estimated Cost: $0.0073
Code Output
Input to o3-mini API
task1_start_time = time.time()
client = OpenAI(api_key=api_key)
messages = messages=[
{
"role": "system",
"content": """You are a professional Programmer with a large experience ."""
},
{
"role": "user",
"content": """write a python code for this problem: generate a python code for Tetris game.
"""
}
]
# Use a compatible encoding (cl100k_base is the best option for new OpenAI models)
encoding = tiktoken.get_encoding("cl100k_base")
# Calculate token counts
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)
completion = client.chat.completions.create(
model="o3-mini-2025-01-31",
messages=messages
)
output_tokens = len(encoding.encode(completion.choices[0].message.content))
task1_end_time = time.time()
input_cost_per_1k = 0.0011 # Example: $0.005 per 1,000 input tokens
output_cost_per_1k = 0.0044 # Example: $0.015 per 1,000 output tokens
# Calculate cost
input_cost = (input_tokens / 1000) * input_cost_per_1k
output_cost = (output_tokens / 1000) * output_cost_per_1k
total_cost = input_cost + output_cost
print(completion.choices[0].message)
print("----------------=Total Time Taken for task 1:----------------- ", task1_end_time - task1_start_time)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")
# Display result
from IPython.display import Markdown
display(Markdown(completion.choices[0].message.content))
Response by o3-mini
You can find o3-mini’s complete response here.
Output token cost:
Input Tokens: 28 | Output Tokens: 3235 | Estimated Cost: $0.014265
Code Output
Comparative Analysis
In this task, the models were required to generate functional Tetris code that allows for actual gameplay. DeepSeek-R1 successfully produced a fully working implementation, as demonstrated in the code output video. In contrast, while o3-mini’s code appeared well-structured, it encountered errors during execution. As a result, DeepSeek-R1 outperforms o3-mini in this scenario, delivering a more reliable and playable solution.
Score: DeepSeek-R1: 1 | o3-mini: 0
This task requires the model to efficiently analyze relational inequalities rather than relying on basic sorting methods.
Prompt: “ In the following question assuming the given statements to be true, find which of the conclusion among the given conclusions is/are definitely true and then give your answers accordingly.
Statements:
H > F ≤ O ≤ L; F ≥ V < D
Conclusions: I. L ≥ V II. O > D
The options are:
A. Only I is true
B. Only II is true
C. Both I and II are true
D. Either I or II is true
E. Neither I nor II is true.”
Input to DeepSeek-R1 API
INPUT_COST_CACHE_HIT = 0.14 / 1_000_000 # $0.14 per 1M tokens
INPUT_COST_CACHE_MISS = 0.55 / 1_000_000 # $0.55 per 1M tokens
OUTPUT_COST = 2.19 / 1_000_000 # $2.19 per 1M tokens
# Start timing
task2_start_time = time.time()
# Initialize OpenAI client for DeepSeek API
client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com")
messages = [
{"role": "system", "content": "You are an expert in solving Reasoning Problems. Please solve the given problem."},
{"role": "user", "content": """ In the following question, assuming the given statements to be true, find which of the conclusions among given conclusions is/are definitely true and then give your answers accordingly.
Statements: H > F ≤ O ≤ L; F ≥ V < D
Conclusions:
I. L ≥ V
II. O > D
The options are:
A. Only I is true
B. Only II is true
C. Both I and II are true
D. Either I or II is true
E. Neither I nor II is true
"""}
]
# Get token count using tiktoken (adjust model name if necessary)
encoding = tiktoken.get_encoding("cl100k_base") # Use a compatible tokenizer
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)
# Call DeepSeek API
response = client.chat.completions.create(
model="deepseek-reasoner",
messages=messages,
stream=False
)
# Get output token count
output_tokens = len(encoding.encode(response.choices[0].message.content))
task2_end_time = time.time()
total_time_taken = task2_end_time - task2_start_time
# Assume cache miss for worst-case pricing (adjust if cache info is available)
input_cost = (input_tokens / 1_000_000) * INPUT_COST_CACHE_MISS
output_cost = (output_tokens / 1_000_000) * OUTPUT_COST
total_cost = input_cost + output_cost
# Print results
print("Response:", response.choices[0].message.content)
print("------------------ Total Time Taken for Task 2: ------------------", total_time_taken)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")
# Display result
from IPython.display import Markdown
display(Markdown(response.choices[0].message.content))
Output token cost:
Input Tokens: 136 | Output Tokens: 352 | Estimated Cost: $0.000004
Response by DeepSeek-R1
Input to o3-mini API
task2_start_time = time.time()
client = OpenAI(api_key=api_key)
messages = [
{
"role": "system",
"content": """You are an expert in solving Reasoning Problems. Please solve the given problem"""
},
{
"role": "user",
"content": """In the following question, assuming the given statements to be true, find which of the conclusions among given conclusions is/are definitely true and then give your answers accordingly.
Statements: H > F ≤ O ≤ L; F ≥ V < D
Conclusions:
I. L ≥ V
II. O > D
The options are:
A. Only I is true
B. Only II is true
C. Both I and II are true
D. Either I or II is true
E. Neither I nor II is true
"""
}
]
# Use a compatible encoding (cl100k_base is the best option for new OpenAI models)
encoding = tiktoken.get_encoding("cl100k_base")
# Calculate token counts
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)
completion = client.chat.completions.create(
model="o3-mini-2025-01-31",
messages=messages
)
output_tokens = len(encoding.encode(completion.choices[0].message.content))
task2_end_time = time.time()
input_cost_per_1k = 0.0011 # Example: $0.005 per 1,000 input tokens
output_cost_per_1k = 0.0044 # Example: $0.015 per 1,000 output tokens
# Calculate cost
input_cost = (input_tokens / 1000) * input_cost_per_1k
output_cost = (output_tokens / 1000) * output_cost_per_1k
total_cost = input_cost + output_cost
# Print results
print(completion.choices[0].message)
print("----------------=Total Time Taken for task 2:----------------- ", task2_end_time - task2_start_time)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")
# Display result
from IPython.display import Markdown
display(Markdown(completion.choices[0].message.content))
Output token cost:
Input Tokens: 135 | Output Tokens: 423 | Estimated Cost: $0.002010
Response by o3-mini
Comparative Analysis
o3-mini delivers the most efficient solution, providing a concise yet accurate response in significantly less time. It maintains clarity while ensuring logical soundness, making it ideal for quick reasoning tasks. DeepSeek-R1, while equally correct, is much slower and more verbose. Its detailed breakdown of logical relationships enhances explainability but may feel excessive for straightforward evaluations. Though both models arrive at the same conclusion, o3-mini’s speed and direct approach make it the better choice for practical use.
Score: DeepSeek-R1: 0 | o3-mini: 1
This task challenges the model to recognize numerical patterns, which may involve arithmetic operations, multiplication, or a combination of mathematical rules. Instead of brute-force searching, the model must adopt a structured approach to deduce the hidden logic efficiently.
Prompt: “Study the given matrix carefully and select the number from among the given options that can replace the question mark (?) in it.
____________
| 7 | 13 | 174|
| 9 | 25 | 104|
| 11 | 30 | ? |
|_____|____|___|
The options are:
A 335
B 129
C 431
D 100
Please mention your approach that you have taken at each step.“
Input to DeepSeek-R1 API
INPUT_COST_CACHE_HIT = 0.14 / 1_000_000 # $0.14 per 1M tokens
INPUT_COST_CACHE_MISS = 0.55 / 1_000_000 # $0.55 per 1M tokens
OUTPUT_COST = 2.19 / 1_000_000 # $2.19 per 1M tokens
# Start timing
task3_start_time = time.time()
# Initialize OpenAI client for DeepSeek API
client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com")
messages = [
{
"role": "system",
"content": """You are a Expert in solving Reasoning Problems. Please solve the given problem"""
},
{
"role": "user",
"content": """
Study the given matrix carefully and select the number from among the given options that can replace the question mark (?) in it.
__________________
| 7 | 13 | 174|
| 9 | 25 | 104|
| 11 | 30 | ? |
|_____|_____|____|
The options are:
A 335
B 129
C 431
D 100
Please mention your approch that you have taken at each step
"""
}
]
# Get token count using tiktoken (adjust model name if necessary)
encoding = tiktoken.get_encoding("cl100k_base") # Use a compatible tokenizer
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)
# Call DeepSeek API
response = client.chat.completions.create(
model="deepseek-reasoner",
messages=messages,
stream=False
)
# Get output token count
output_tokens = len(encoding.encode(response.choices[0].message.content))
task3_end_time = time.time()
total_time_taken = task3_end_time - task3_start_time
# Assume cache miss for worst-case pricing (adjust if cache info is available)
input_cost = (input_tokens / 1_000_000) * INPUT_COST_CACHE_MISS
output_cost = (output_tokens / 1_000_000) * OUTPUT_COST
total_cost = input_cost + output_cost
# Print results
print("Response:", response.choices[0].message.content)
print("------------------ Total Time Taken for Task 3: ------------------", total_time_taken)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")
# Display result
from IPython.display import Markdown
display(Markdown(response.choices[0].message.content))
Output token cost:
Input Tokens: 134 | Output Tokens: 274 | Estimated Cost: $0.000003
Response by DeepSeek-R1
Input to o3-mini API
task3_start_time = time.time()
client = OpenAI(api_key=api_key)
messages = [
{
"role": "system",
"content": """You are a Expert in solving Reasoning Problems. Please solve the given problem"""
},
{
"role": "user",
"content": """
Study the given matrix carefully and select the number from among the given options that can replace the question mark (?) in it.
__________________
| 7 | 13 | 174|
| 9 | 25 | 104|
| 11 | 30 | ? |
|_____|_____|____|
The options are:
A 335
B 129
C 431
D 100
Please mention your approch that you have taken at each step
"""
}
]
# Use a compatible encoding (cl100k_base is the best option for new OpenAI models)
encoding = tiktoken.get_encoding("cl100k_base")
# Calculate token counts
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)
completion = client.chat.completions.create(
model="o3-mini-2025-01-31",
messages=messages
)
output_tokens = len(encoding.encode(completion.choices[0].message.content))
task3_end_time = time.time()
input_cost_per_1k = 0.0011 # Example: $0.005 per 1,000 input tokens
output_cost_per_1k = 0.0044 # Example: $0.015 per 1,000 output tokens
# Calculate cost
input_cost = (input_tokens / 1000) * input_cost_per_1k
output_cost = (output_tokens / 1000) * output_cost_per_1k
total_cost = input_cost + output_cost
# Print results
print(completion.choices[0].message)
print("----------------=Total Time Taken for task 3:----------------- ", task3_end_time - task3_start_time)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")
# Display result
from IPython.display import Markdown
display(Markdown(completion.choices[0].message.content))
Output token cost:
Input Tokens: 134 | Output Tokens: 736 | Estimated Cost: $0.003386
Output by o3-mini
Comparative Analysis
Here, the pattern followed in each row is:
(1st number)^3−(2nd number)^2 = 3rd number
Applying this pattern:
Hence, the correct answer is 431.
DeepSeek-R1 correctly identifies and applies this pattern, leading to the right answer. Its structured approach ensures accuracy, though it takes significantly longer to compute the result. o3-mini, on the other hand, fails to establish a consistent pattern. It attempts multiple operations, such as multiplication, addition, and exponentiation, but does not arrive at a definitive answer. This results in an unclear and incorrect response. Overall, DeepSeek-R1 outperforms o3-mini in logical reasoning and accuracy, while O3-mini struggles due to its inconsistent and ineffective approach.
Score: DeepSeek-R1: 1 | o3-mini: 0
Task No. | Task Type | Model | Performance | Time Taken (seconds) | Cost |
1 | Code Generation | DeepSeek-R1 | ✅ Working Code | 606.45 | $0.0073 |
o3-mini | ❌ Non-working Code | 99.73 | $0.014265 | ||
2 | Alphabetical Reasoning | DeepSeek-R1 | ✅ Correct | 74.28 | $0.000004 |
o3-mini | ✅ Correct | 8.08 | $0.002010 | ||
3 | Mathematical Reasoning | DeepSeek-R1 | ✅ Correct | 450.53 | $0.000003 |
o3-mini | ❌ Wrong Answer | 12.37 | $0.003386 |
As we have seen in this comparison, both DeepSeek-R1 and o3-mini demonstrate unique strengths catering to different needs. DeepSeek-R1 excels in accuracy-driven tasks, particularly in mathematical reasoning and complex code generation, making it a strong candidate for applications requiring logical depth and correctness. However, one significant drawback is its slower response times, partly due to ongoing server maintenance issues that have affected its accessibility. On the other hand, o3-mini offers significantly faster response times, but its tendency to produce incorrect results limits its reliability for high-stakes reasoning tasks.
This analysis underscores the trade-offs between speed and accuracy in language models. While o3-mini may be useful for rapid, low-risk applications, DeepSeek-R1 stands out as the superior choice for reasoning-intensive tasks, provided its latency issues are addressed. As AI models continue to evolve, striking a balance between performance efficiency and correctness will be key to optimizing AI-driven workflows across various domains.
Also Read: Can OpenAI’s o3-mini Beat Claude Sonnet 3.5 in Coding?
A. DeepSeek-R1 excels in mathematical reasoning and complex code generation, making it ideal for applications that require logical depth and accuracy. o3-mini, on the other hand, is significantly faster but often sacrifices accuracy, leading to occasional incorrect outputs.
A. DeepSeek-R1 is the better choice for coding and reasoning-intensive tasks due to its superior accuracy and ability to handle complex logic. While o3-mini provides quicker responses, it may generate errors, making it less reliable for high-stakes programming tasks.
A. o3-mini is best suited for low-risk, speed-dependent applications, such as chatbots, casual text generation, and interactive AI experiences. However, for tasks requiring high accuracy, DeepSeek-R1 is the preferred option.
A. DeepSeek-R1 has superior logical reasoning and problem-solving capabilities, making it a strong choice for mathematical computations, programming assistance, and scientific queries. o3-mini provides quick but sometimes inconsistent responses in complex problem-solving scenarios.