Can o3-mini Replace DeepSeek-R1 for Logical Reasoning?

Vipin Vashisth Last Updated : 12 Feb, 2025
12 min read

AI-powered reasoning models are taking the world by storm in 2025! With the launch of DeepSeek-R1 and o3-mini, we have seen unprecedented levels of logical reasoning capabilities in AI chatbots. In this article, we will access these models via their APIs and evaluate their logical reasoning skills to find out if o3-mini can replace DeepSeek-R1. We will be comparing their performance on standard benchmarks as well as real-world applications like solving logical puzzles and even building a Tetris game! So buckle up and join the ride.

DeepSeek-R1 vs o3-mini: Logical Reasoning Benchmarks

DeepSeek-R1 and o3-mini offer unique approaches to structured thinking and deduction, making them apt for various kinds of complex problem-solving tasks. Before we speak of their benchmark performance, let’s first have a sneak peek at the architecture of these models.

o3-mini is OpenAI’s most advanced reasoning model. It uses a dense transformer architecture, processing each token with all model parameters for strong performance but high resource consumption. In contrast, DeepSeek’s most logical model, R1, employs a Mixture-of-Experts (MoE) framework, activating only a subset of parameters per input for greater efficiency. This makes DeepSeek-R1 more scalable and computationally optimized while maintaining solid performance.

Learn More: Is OpenAI’s o3-mini Better Than DeepSeek-R1?

Now what we need to see is how well these models perform in logical reasoning tasks. First, let’s have a look at their performance in the livebench benchmark tests.

o3-mini & DeepSeek-R1 Logical Reasoning benchmarks

Sources: livebench.ai

The benchmark results show that OpenAI’s o3-mini outperforms DeepSeek-R1 in almost all aspects, except for math. With a global average score of 73.94 compared to DeepSeek’s 71.38, the o3-mini demonstrates slightly stronger overall performance. It particularly excels in reasoning, achieving 89.58 versus DeepSeek’s 83.17, reflecting superior analytical and problem-solving capabilities.

Also Read: Google Gemini 2.0 Pro vs DeepSeek-R1: Who Does Coding Better?

DeepSeek-R1 vs o3-mini: API Pricing Comparison

Since we are testing these models through their APIs, let’s see how much these models cost.

Model Context length Input Price Cached Input Price Output Price
o3-mini 200k $1.10/M tokens $0.55/M tokens $4.40/M tokens
deepseek-chat 64k $0.27/M tokens $0.07/M tokens $1.10/M tokens
deepseek-reasoner 64k $0.55/M tokens $0.14/M tokens $2.19/M tokens

As seen in the table, OpenAI’s o3-mini is nearly twice as expensive as DeepSeek R1 in terms of API costs. It charges $1.10 per million tokens for input and $4.40 for output, whereas DeepSeek R1 offers a more cost-effective rate of $0.55 per million tokens for input and $2.19 for output, making it a more budget-friendly option for large-scale applications.

Sources: DeepSeek-R1 | o3-mini

How to Access DeepSeek-R1 and o3-mini via API

Before we step into the hands-on performance comparison, let’s learn how to access DeepSeek-R1 and o3-mini using APIs.

All you have to do for this, is import the necessary libraries and api keys:

from openai import OpenAI
from IPython.display import display, Markdown
import time
with open("path_of_api_key") as file:
   openai_api_key = file.read().strip()
with open("path_of_api_key") as file:
   deepseek_api = file.read().strip()

DeepSeek-R1 vs o3-mini: Logical Reasoning Comparison

Now that we’ve gotten the API access, let’s compare DeepSeek-R1 and o3-mini based on their logical reasoning capabilities. For this, we will give the same prompt to both the models and evaluate their responses based on these metrics:

  1. Time taken by the model to generate the response,
  2. Quality of the generated response, and
  3. Cost incurred to generate the response.

We will then score the models 0 or 1 for each task, depending on their performance. So let’s try out the tasks and see who emerges as the winner in the DeepSeek-R1 vs o3-mini reasoning battle!

Task 1: Building a Tetris Game

This task requires the model to implement a fully functional Tetris game using Python, efficiently managing game logic, piece movement, collision detection, and rendering without relying on external game engines.

Prompt: “Write a python code for this problem: generate a Python code for the Tetris game“

Input to DeepSeek-R1 API

INPUT_COST_CACHE_HIT = 0.14 / 1_000_000  # $0.14 per 1M tokens
INPUT_COST_CACHE_MISS = 0.55 / 1_000_000  # $0.55 per 1M tokens
OUTPUT_COST = 2.19 / 1_000_000  # $2.19 per 1M tokens

# Start timing
task1_start_time = time.time()

# Initialize OpenAI client for DeepSeek API
client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com")

messages = [
    {
        "role": "system",
        "content": """You are a professional Programmer with a large experience."""
    },
    {
        "role": "user",
        "content": """write a python code for this problem: generate a python code for Tetris game."""
    }
]

# Get token count using tiktoken (adjust model name if necessary)
encoding = tiktoken.get_encoding("cl100k_base")  # Use a compatible tokenizer
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)

# Call DeepSeek API
response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=messages,
    stream=False
)

# Get output token count
output_tokens = len(encoding.encode(response.choices[0].message.content))

task1_end_time = time.time()

total_time_taken = task1_end_time - task1_start_time

# Assume cache miss for worst-case pricing (adjust if cache info is available)
input_cost = (input_tokens / 1_000_000) * INPUT_COST_CACHE_MISS
output_cost = (output_tokens / 1_000_000) * OUTPUT_COST

total_cost = input_cost + output_cost

# Print results
print("Response:", response.choices[0].message.content)
print("------------------ Total Time Taken for Task 1: ------------------", total_time_taken)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")

# Display result
from IPython.display import Markdown
display(Markdown(response.choices[0].message.content))

Response by DeepSeek-R1

DeepSeek-R1 task 1 output

You can find DeepSeek-R1’s complete response here.

Output token cost:

Input Tokens: 28 | Output Tokens: 3323 | Estimated Cost: $0.0073

Code Output

Input to o3-mini API

task1_start_time = time.time()


client = OpenAI(api_key=api_key)

messages = messages=[
       {
       "role": "system",
       "content": """You are a professional Programmer with a large experience ."""


   },
{
       "role": "user",
       "content": """write a python code for this problem: generate a python code for Tetris game.
"""


   }
   ]


# Use a compatible encoding (cl100k_base is the best option for new OpenAI models)
encoding = tiktoken.get_encoding("cl100k_base")


# Calculate token counts
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)


completion = client.chat.completions.create(
   model="o3-mini-2025-01-31",
   messages=messages
)


output_tokens = len(encoding.encode(completion.choices[0].message.content))


task1_end_time = time.time()




input_cost_per_1k = 0.0011  # Example: $0.005 per 1,000 input tokens
output_cost_per_1k = 0.0044  # Example: $0.015 per 1,000 output tokens


# Calculate cost
input_cost = (input_tokens / 1000) * input_cost_per_1k
output_cost = (output_tokens / 1000) * output_cost_per_1k
total_cost = input_cost + output_cost
print(completion.choices[0].message)
print("----------------=Total Time Taken for task 1:----------------- ", task1_end_time - task1_start_time)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")


# Display result
from IPython.display import Markdown
display(Markdown(completion.choices[0].message.content))

Response by o3-mini

o3-mini task 1 output

You can find o3-mini’s complete response here.

Output token cost: 

Input Tokens: 28 | Output Tokens: 3235 | Estimated Cost: $0.014265

Code Output

Comparative Analysis

In this task, the models were required to generate functional Tetris code that allows for actual gameplay. DeepSeek-R1 successfully produced a fully working implementation, as demonstrated in the code output video. In contrast, while o3-mini’s code appeared well-structured, it encountered errors during execution. As a result, DeepSeek-R1 outperforms o3-mini in this scenario, delivering a more reliable and playable solution.

Score: DeepSeek-R1: 1 | o3-mini: 0

Task 2: Analyzing Relational Inequalities

This task requires the model to efficiently analyze relational inequalities rather than relying on basic sorting methods.

Prompt: In the following question assuming the given statements to be true, find which of the conclusion among the given conclusions is/are definitely true and then give your answers accordingly. 

Statements: 

H > F ≤ O ≤ L; F ≥ V < D

Conclusions: I. L ≥ V II. O > D 

The options are:

 A. Only I is true 

B. Only II is true 

C. Both I and II are true

D. Either I or II is true 

E. Neither I nor II is true.”

Input to DeepSeek-R1 API

INPUT_COST_CACHE_HIT = 0.14 / 1_000_000  # $0.14 per 1M tokens
INPUT_COST_CACHE_MISS = 0.55 / 1_000_000  # $0.55 per 1M tokens
OUTPUT_COST = 2.19 / 1_000_000  # $2.19 per 1M tokens

# Start timing
task2_start_time = time.time()

# Initialize OpenAI client for DeepSeek API
client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com")

messages = [
    {"role": "system", "content": "You are an expert in solving Reasoning Problems. Please solve the given problem."},
    {"role": "user", "content": """ In the following question, assuming the given statements to be true, find which of the conclusions among given conclusions is/are definitely true and then give your answers accordingly.
        Statements: H > F ≤ O ≤ L; F ≥ V < D
        Conclusions:
        I. L ≥ V
        II. O > D
        The options are:
        A. Only I is true 
        B. Only II is true
        C. Both I and II are true
        D. Either I or II is true
        E. Neither I nor II is true
    """}
]

# Get token count using tiktoken (adjust model name if necessary)
encoding = tiktoken.get_encoding("cl100k_base")  # Use a compatible tokenizer
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)

# Call DeepSeek API
response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=messages,
    stream=False
)

# Get output token count
output_tokens = len(encoding.encode(response.choices[0].message.content))

task2_end_time = time.time()

total_time_taken = task2_end_time - task2_start_time

# Assume cache miss for worst-case pricing (adjust if cache info is available)
input_cost = (input_tokens / 1_000_000) * INPUT_COST_CACHE_MISS
output_cost = (output_tokens / 1_000_000) * OUTPUT_COST

total_cost = input_cost + output_cost

# Print results
print("Response:", response.choices[0].message.content)
print("------------------ Total Time Taken for Task 2: ------------------", total_time_taken)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")

# Display result
from IPython.display import Markdown
display(Markdown(response.choices[0].message.content))

Output token cost:

Input Tokens: 136 | Output Tokens: 352 | Estimated Cost: $0.000004

Response by DeepSeek-R1

deepseek-r1 task 2 output

Input to o3-mini API

task2_start_time = time.time()

client = OpenAI(api_key=api_key)

messages = [
    {
        "role": "system",
        "content": """You are an expert in solving Reasoning Problems. Please solve the given problem"""
    },
    {
        "role": "user",
        "content": """In the following question, assuming the given statements to be true, find which of the conclusions among given conclusions is/are definitely true and then give your answers accordingly.
        Statements: H > F ≤ O ≤ L; F ≥ V < D
        Conclusions:
        I. L ≥ V
        II. O > D
        The options are:
        A. Only I is true 
        B. Only II is true
        C. Both I and II are true
        D. Either I or II is true
        E. Neither I nor II is true
        """
    }
]

# Use a compatible encoding (cl100k_base is the best option for new OpenAI models)
encoding = tiktoken.get_encoding("cl100k_base")

# Calculate token counts
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)

completion = client.chat.completions.create(
    model="o3-mini-2025-01-31",
    messages=messages
)

output_tokens = len(encoding.encode(completion.choices[0].message.content))

task2_end_time = time.time()


input_cost_per_1k = 0.0011  # Example: $0.005 per 1,000 input tokens
output_cost_per_1k = 0.0044  # Example: $0.015 per 1,000 output tokens

# Calculate cost
input_cost = (input_tokens / 1000) * input_cost_per_1k
output_cost = (output_tokens / 1000) * output_cost_per_1k
total_cost = input_cost + output_cost


# Print results
print(completion.choices[0].message)
print("----------------=Total Time Taken for task 2:----------------- ", task2_end_time - task2_start_time)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")

# Display result
from IPython.display import Markdown
display(Markdown(completion.choices[0].message.content))

Output token cost:

Input Tokens: 135 | Output Tokens: 423 | Estimated Cost: $0.002010

Response by o3-mini

o3-mini task 2 output

Comparative Analysis

o3-mini delivers the most efficient solution, providing a concise yet accurate response in significantly less time. It maintains clarity while ensuring logical soundness, making it ideal for quick reasoning tasks. DeepSeek-R1, while equally correct, is much slower and more verbose. Its detailed breakdown of logical relationships enhances explainability but may feel excessive for straightforward evaluations. Though both models arrive at the same conclusion, o3-mini’s speed and direct approach make it the better choice for practical use.

Score: DeepSeek-R1: 0 | o3-mini: 1

Task 3: Logical Reasoning in Math

This task challenges the model to recognize numerical patterns, which may involve arithmetic operations, multiplication, or a combination of mathematical rules. Instead of brute-force searching, the model must adopt a structured approach to deduce the hidden logic efficiently.

Prompt:Study the given matrix carefully and select the number from among the given options that can replace the question mark (?) in it.

____________

|  7  | 13  | 174|

|  9  | 25  | 104|

|  11  | 30   | ?   |

|_____|____|___|

The options are:

 A 335

B 129

C 431

D 100

 Please mention your approach that you have taken at each step.“

Input to DeepSeek-R1 API

INPUT_COST_CACHE_HIT = 0.14 / 1_000_000  # $0.14 per 1M tokens
INPUT_COST_CACHE_MISS = 0.55 / 1_000_000  # $0.55 per 1M tokens
OUTPUT_COST = 2.19 / 1_000_000  # $2.19 per 1M tokens

# Start timing
task3_start_time = time.time()

# Initialize OpenAI client for DeepSeek API
client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com")

messages = [
{
		"role": "system",
		"content": """You are a Expert in solving Reasoning Problems. Please solve the given problem"""

	},
 {
		"role": "user",
		"content": """ 
Study the given matrix carefully and select the number from among the given options that can replace the question mark (?) in it.
    __________________
	|  7  | 13	| 174| 
	|  9  | 25	| 104|
	|  11 | 30	| ?  |
    |_____|_____|____|
    The options are: 
   A 335
   B 129
   C 431 
   D 100
   Please mention your approch that you have taken at each step
 """

	}
]
# Get token count using tiktoken (adjust model name if necessary)
encoding = tiktoken.get_encoding("cl100k_base")  # Use a compatible tokenizer
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)

# Call DeepSeek API
response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=messages,
    stream=False
)

# Get output token count
output_tokens = len(encoding.encode(response.choices[0].message.content))

task3_end_time = time.time()

total_time_taken = task3_end_time - task3_start_time

# Assume cache miss for worst-case pricing (adjust if cache info is available)
input_cost = (input_tokens / 1_000_000) * INPUT_COST_CACHE_MISS
output_cost = (output_tokens / 1_000_000) * OUTPUT_COST

total_cost = input_cost + output_cost

# Print results
print("Response:", response.choices[0].message.content)
print("------------------ Total Time Taken for Task 3: ------------------", total_time_taken)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")

# Display result
from IPython.display import Markdown
display(Markdown(response.choices[0].message.content))

Output token cost:

Input Tokens: 134 | Output Tokens: 274 | Estimated Cost: $0.000003

Response by DeepSeek-R1

deepseek r1 task 3 output

Input to o3-mini API

task3_start_time = time.time()
client = OpenAI(api_key=api_key)
messages = [
        {
		"role": "system",
		"content": """You are a Expert in solving Reasoning Problems. Please solve the given problem"""

	},
 {
		"role": "user",
		"content": """ 
Study the given matrix carefully and select the number from among the given options that can replace the question mark (?) in it.
    __________________
	|  7  | 13	| 174| 
	|  9  | 25	| 104|
	|  11 | 30	| ?  |
    |_____|_____|____|
    The options are: 
   A 335
   B 129
   C 431 
   D 100
   Please mention your approch that you have taken at each step
 """

	}
    ]

# Use a compatible encoding (cl100k_base is the best option for new OpenAI models)
encoding = tiktoken.get_encoding("cl100k_base")

# Calculate token counts
input_tokens = sum(len(encoding.encode(msg["content"])) for msg in messages)

completion = client.chat.completions.create(
    model="o3-mini-2025-01-31",
    messages=messages
)

output_tokens = len(encoding.encode(completion.choices[0].message.content))

task3_end_time = time.time()


input_cost_per_1k = 0.0011  # Example: $0.005 per 1,000 input tokens
output_cost_per_1k = 0.0044  # Example: $0.015 per 1,000 output tokens

# Calculate cost
input_cost = (input_tokens / 1000) * input_cost_per_1k
output_cost = (output_tokens / 1000) * output_cost_per_1k
total_cost = input_cost + output_cost

# Print results
print(completion.choices[0].message)
print("----------------=Total Time Taken for task 3:----------------- ", task3_end_time - task3_start_time)
print(f"Input Tokens: {input_tokens}, Output Tokens: {output_tokens}")
print(f"Estimated Cost: ${total_cost:.6f}")

# Display result
from IPython.display import Markdown
display(Markdown(completion.choices[0].message.content))

Output token cost:

Input Tokens: 134 | Output Tokens: 736 | Estimated Cost: $0.003386

Output by o3-mini

o3-mini vs DeepSeek-R1 API logical reasoning
logical reasoning task 3 output
logical reasoning task 3 output

Comparative Analysis

Here, the pattern followed in each row is:

(1st number)^3−(2nd number)^2 = 3rd number

Applying this pattern:

  • Row 1: 7^3 – 13^2 = 343 – 169 = 174
  • Row 2: 9^3 – 25^2 = 729 – 625 = 104
  • Row 3: 11^3 – 30^2 = 1331 – 900 = 431

Hence, the correct answer is 431.

DeepSeek-R1 correctly identifies and applies this pattern, leading to the right answer. Its structured approach ensures accuracy, though it takes significantly longer to compute the result. o3-mini, on the other hand, fails to establish a consistent pattern. It attempts multiple operations, such as multiplication, addition, and exponentiation, but does not arrive at a definitive answer. This results in an unclear and incorrect response. Overall, DeepSeek-R1 outperforms o3-mini in logical reasoning and accuracy, while O3-mini struggles due to its inconsistent and ineffective approach.

Score: DeepSeek-R1: 1 | o3-mini: 0

Final Score: DeepSeek-R1: 2 | o3-mini: 1

Logical Reasoning Comparison Summary

Task No. Task Type Model Performance  Time Taken (seconds) Cost
1 Code Generation DeepSeek-R1 ✅ Working Code 606.45 $0.0073
    o3-mini ❌ Non-working Code 99.73 $0.014265
2 Alphabetical Reasoning DeepSeek-R1 ✅ Correct 74.28 $0.000004
    o3-mini ✅ Correct 8.08 $0.002010
3 Mathematical Reasoning DeepSeek-R1 ✅ Correct 450.53 $0.000003
    o3-mini ❌ Wrong Answer 12.37 $0.003386

Conclusion

As we have seen in this comparison, both DeepSeek-R1 and o3-mini demonstrate unique strengths catering to different needs. DeepSeek-R1 excels in accuracy-driven tasks, particularly in mathematical reasoning and complex code generation, making it a strong candidate for applications requiring logical depth and correctness. However, one significant drawback is its slower response times, partly due to ongoing server maintenance issues that have affected its accessibility. On the other hand, o3-mini offers significantly faster response times, but its tendency to produce incorrect results limits its reliability for high-stakes reasoning tasks.

This analysis underscores the trade-offs between speed and accuracy in language models. While o3-mini may be useful for rapid, low-risk applications, DeepSeek-R1 stands out as the superior choice for reasoning-intensive tasks, provided its latency issues are addressed. As AI models continue to evolve, striking a balance between performance efficiency and correctness will be key to optimizing AI-driven workflows across various domains.

Also Read: Can OpenAI’s o3-mini Beat Claude Sonnet 3.5 in Coding?

Frequently Asked Questions

Q1. What are the key differences between DeepSeek-R1 and o3-mini?

A. DeepSeek-R1 excels in mathematical reasoning and complex code generation, making it ideal for applications that require logical depth and accuracy. o3-mini, on the other hand, is significantly faster but often sacrifices accuracy, leading to occasional incorrect outputs.

Q2. Is DeepSeek-R1 better than o3-mini for coding tasks?

A. DeepSeek-R1 is the better choice for coding and reasoning-intensive tasks due to its superior accuracy and ability to handle complex logic. While o3-mini provides quicker responses, it may generate errors, making it less reliable for high-stakes programming tasks.

Q3. Is o3-mini suitable for real-world applications?

A. o3-mini is best suited for low-risk, speed-dependent applications, such as chatbots, casual text generation, and interactive AI experiences. However, for tasks requiring high accuracy, DeepSeek-R1 is the preferred option.

Q4. Which model is better for reasoning and problem-solving – DeepSeek-R1 or o3-mini?

A. DeepSeek-R1 has superior logical reasoning and problem-solving capabilities, making it a strong choice for mathematical computations, programming assistance, and scientific queries. o3-mini provides quick but sometimes inconsistent responses in complex problem-solving scenarios.

Hello! I'm Vipin, a passionate data science and machine learning enthusiast with a strong foundation in data analysis, machine learning algorithms, and programming. I have hands-on experience in building models, managing messy data, and solving real-world problems. My goal is to apply data-driven insights to create practical solutions that drive results. I'm eager to contribute my skills in a collaborative environment while continuing to learn and grow in the fields of Data Science, Machine Learning, and NLP.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details