DeepSeek-R1’s advanced reasoning capabilities have made it the new leader in the generative LLM field. It has caused a stir in the AI industry, with reports of Nvidia’s $600 billion loss post-launch. But what makes DeepSeek-R1 so famous overnight? In this article, we’ll explore why DeepSeek-R1 is gaining so much attention, delve into its groundbreaking capabilities, and analyze how its reasoning powers are reshaping real-world applications. Stay tuned as we break down the model’s performance through a detailed, structured analysis.
This article was published as a part of the Data Science Blogathon.
In simple words, DeepSeek-R1 is a cutting-edge language model series developed by DeepSeek, established in 2023 by Liang Wenfeng. It achieved advanced reasoning capabilities in LLMs through reinforcement learning(RL). There are two variants:
It is trained purely via RL on the base model without supervised fine-tuned (SFT), and it autonomously develops advanced reasoning behavior like self-verification and multi-step reflection, achieving 71% accuracy on the AIME 2024 benchmark
It was enhanced with cold-start data and multi-stage training (RL+SFT), it addresses readability issues and outperforms OpenAI’s o1 on tasks like MATH-500 (97.3% accuracy) and coding challenges (Codeforces rating 2029)
DeepSeek uses Group Relative Policy Optimization(GRPO), an RL technique that does not use the Critic model and saves RL’s training costs. GRPO optimizes policies by grouping outputs and normalizing rewards, eliminating the need for the Critic models.
The project also distills its reasoning patterns into smaller models (1.5B-70B), enabling efficient deployment. According to the benchmark It’s 7B model surpasses GPT-4o.
DeepSeek-R1 Paper here.
Model | GPQA | LiveCode | Diamond Bench | CodeForces pass@1 cons@64 | CodeForces pass@1 | Rating |
---|---|---|---|---|---|---|
OpenAI-01-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 |
OpenAI-01-0912 | 74.4 | 83.3 | 94.8 | 77.3 | 63.4 | 1843 |
DeepSeek-R1-Zero | 71.0 | 86.7 | 95.9 | 73.3 | 50.0 | 1444 |
Accuracy Plot of Deepseek-R1-Zero on AIME Dataset
DeepSeek open-sourced the models, training pipelines, and benchmarks aim to democratize RL-driven reasoning research, offering scalable solutions for STEM, coding, and knowledge-intensive tasks. DeepSeek-R1 directs a path to the new era of low-cost, high-throughput SLMs and LLMs.
Before going into the cutting-edge GRPO, let’s surf on some basics of Reinforcement Learning(RL).
Reinforcement Learning is the interaction between the Agent and Environment. During training, the agent takes actions so that it maximizes the cumulative rewards. Think about a bot playing Chess or a Robot on a factory floor trying to do tasks with actual items.
The agent is learning by doing. It gets a reward when it does things right; otherwise, it gets negative. By doing these repetitive trials, it will be on a journey to find the optimal strategy to adapt to the unknown environment.
Here is the simple diagram of Reinforcement Learning, It has 3 components:
The experience gathered is used to update the policy through optimization. The value function provides insights to refine the policy. The policy guides the agent, which interacts with the environment to collect new experiences and the cycle goes on until the agent learns the optimum strategy or improves to adapt to the environment.
In the training of DeepSeek-R1-Zero, they use Group Relative Policy optimization or GRPO, it eliminate the Critic Model and lowers the training cost.
As for my understanding of the DeepSeek-R1 Research Paper, here is the schematic training process of the DeepSeek-R1-Zero and DeepSeek-R1 models.
Tentative DeepSeek-R1-Zero and R1 Training Diagram
For each question q, GRPO samples a group of output {o1, o2, o2..} from the old policy and optimizes the policy model by maximizing the below objective:
Here epsilon and beta are hyper-parameters, and A_i is the advantage computed using a group of rewards {r1, r2, r3…rG} corresponding to the output within each group.
In the Advantage calculation, Normalize rewards within group outputs, r_i is the reward for output I and r_group is the rewards of all output in the group.
To maximize the clipped policy updates with KL penalty,
The KL Divergence also known as Relative Entropy is a statistical distance function, that measures the difference between the models’s probability distribution (Q) and true probability distribution (P).
For more KL-Divergence
The below equation is the mathematical form of KL-Divergence:
Relative entropy or KL distance is always a non-negative real number. It has the lowest value of 0 if and only if the Q and P are identical. That means both the Model Probability distribution(Q) and True Probability distribution (P) overlap or a perfect system.
Here are simple examples to showcase KL divergence,
We will use the entropy function from the Scipy Statistical package, It will calculate the relative entropy between two distributions.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import entropy
# Define two probability distributions P and Q
x = np.linspace(-3, 3, 100)
P = np.exp(-(x**2)) # Gaussian-like distribution
Q = np.exp(-((x - 1) ** 2)) # Shifted Gaussian
# Normalize to ensure they sum to 1
P /= P.sum()
Q /= Q.sum()
# Compute KL divergence
kl_div = entropy(P, Q)
Our P and Q as Gaussian-like and shifted Gaussian distribution respectively.
plt.style.use("ggplot")
plt.figure(figsize=(12, 8))
plt.plot(x, P, label="P (Original)", linestyle="dashed", color="blue")
plt.plot(x, Q, label="Q (Shifted)", linestyle="solid", color="red")
plt.fill_between(x, P, Q, color="yellow", alpha=0.3, label="Difference")
plt.title(f"KL Divergence: {kl_div:.4f}")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.legend()
plt.show()
The yellow portion is the KL difference between P and Q.
In the GRPO equation, GRPO samples a group of outputs for each query and computes advantages relative to the group’s mean and standard deviation. This avoids training a separate critic model. The objective includes a clipped ratio and KL penalty to stay close to the reference policy.
The ratio part is the probability ratio of the new and old policy.Clip(ratio) is bound between 1-epsilon and 1 + epsilon.
The conversation process between User and Assistant
The user asks a question, and the model or assistant solves it by first thinking about the reasoning process and then responding to the user.
The reasoning and answer are enclosed in the below diagram.
<think> reasoning process</think>
<answer> answer here </answer>
USER: Prompt
Assistant: Answer
The Self-Evolution Process of DeepSeek-R1-Zero demonstrates how Reinforcement Learning can improve the model’s reasoning capabilities autonomously. The chart shows how the model’s reasoning capabilities for handling complex reasoning tasks evolve.
DeepSeek-R1, answers two significant questions that arise after promising results of the Zero model.
The DeepSeek-R1 uses Cold-Start Data in a format where the developer collects thousands of cold-start data to fine-tune the DeepSeek-V3-Base as a starting point of RL.
These data have two important advantages compared to DeepSeek-R1-zero.
According to the DeepSeek-R1 paper, They (the developer)set the maximum generation length to 32768 tokens for the models. They found long output reasoning model result in higher repetition rates with greedy decoding and significant variability. Therefore, they use pass@k evaluation, It use a sampling temperature of 0.6 and a top-p value of 0.95 to generate k numbers response for each question.
Pass@1 is then calculated as:
Here, P_i denotes the correctness of the i-th response, according to the research paper this method ensures more reliable performance estimates.
We can see that the education-oriented knowledge benchmarks such as MMLU, MMLU-Pro, GPQA Diamond, and DeepSeek-R1 perform better compared to DeepSeek-V3. It has primarily enhanced accuracy in STEM-related questions. DeepSeek-R1 also delivers great results on IF-Eval, a benchmark data designed to assess the model’s ability to follow format instructions.
Enough maths and theoretical understanding has been done, which I wish significantly boost your overall knowledge of Reinforcement Learning and its cutting-edge application on DeepSeek-R1 model development. Now we will get our hands on DeepSeek-R1 using Ollama and taste the newly minted LLM.
The evaluation of DeepSeek-R1-7B focuses on its enhanced reasoning capabilities, particularly its performance in complex problem-solving scenarios. By analyzing key benchmarks, this assessment provides insights into how effectively the model handles intricate reasoning tasks compared to its predecessors.
$ollama run deepseek-r1:7b
Now I put a Linear inequality question from NCERT
and the response is:
Which is accurate according to the book.
Amazing!!
Now will set up a testing environment using Llamaindex which will be a more prominent way to do this.
# create conda env
$conda create env --name dstest python=3.12
# Activate conda env
conda activate dstest
# create a folder
md dsreason
# switch to dir
cd dsreason
Now we install the necessary packages
$pip install llama-index llama-index-llms-ollama jupyterlab
Now Open VScode and create a Jupyter Notebook name prompt_analysis.ipynb root of the project folder.
from llama_index.llms.ollama import Ollama
from IPython.display import display, Markdown
llm = Ollama(model="deepseek-r1:7b", request_timeout=120.0, context_window=4000)
You must stay running ollama deepseek-r1:7b on your terminal.
Now, start with the mathematical problem
Imporant: OUTPUT will be very long so the output in this blog will be abridged, For full output you must see the blog’s code repository here.
This section explores complex problem-solving tasks that require a deep understanding of various reasoning techniques, from mathematical calculations to ethical dilemmas. By engaging with these scenarios, you will enhance your ability to think critically, analyze data, and draw logical conclusions across diverse contexts.
A store offers a 20% discount on all items. After applying the discount, there’s an additional 10% off for loyalty card members. If an item originally costs $150, what is the final price for a loyalty card member? Show your step-by-step calculation and explain your reasoning.
math_prompt= """A store offers a 20% discount on all items. After applying the discount,
there's an additional 10% off for loyalty card members.
If an item originally costs $150, what is the final price
for a loyalty card member? Show your step-by-step calculation and
explain your reasoning."""
response = llm.complete(math_prompt)
display(Markdown(f"**Question:** {math_prompt}\n **Answer:** {response}"))
Output:
The key aspect of this prompt is:
Consider these statements: All birds can flyPenguins are birdsPenguins cannot flyIdentify any contradictions in these statements. If there are contradictions, explain how to resolve them using logical reasoning.
contracdiction_prompt = """Consider these statements:
All birds can fly
Penguins are birds
Penguins cannot fly
Identify any contradictions in these statements.
If there are contradictions, explain how to resolve them using logical reasoning."""
contracdiction_response = llm.complete(contracdiction_prompt)
display(
Markdown(
f"**Question:** {contracdiction_prompt}\n **Answer:** {contracdiction_response}"
)
)
Output:
This will show Logical consistency, Propose logical solutions, understand class relationships, and syllogistic reasoning.
In a forest ecosystem, a disease kills 80% of the wolf population. Describe the potential chain of effects this might have on the ecosystem over the next 5 years. Include at least three levels of cause and effect, and explain your reasoning for each step.
chain_analysis_prompt = """
In a forest ecosystem, a disease kills 80% of the wolf population.
Describe the potential chain of effects this might have on the ecosystem over the next 5 years.
Include at least three levels of cause and effect, and explain your reasoning for each step."""
chain_analysis_response = llm.complete(chain_analysis_prompt)
display(
Markdown(
f"**Question:** {chain_analysis_prompt}\n **Answer:** {chain_analysis_response}"
)
)
Output:
This prompt model shows the understanding of complex systems, tracks multiple casual chains, considers indirect effects, and applies domain knowledge.
Consider this sequence: 2, 6, 12, 20, 30, __What’s the next number?
pattern_prompt = """
"Consider this sequence: 2, 6, 12, 20, 30, __
What's the next number?
Explain the pattern
Create a formula for the nth term
Verify your formula works for all given numbers"""
pattern_response = llm.complete(pattern_prompt)
display(Markdown(f"**Question:** {pattern_prompt}\n **Answer:** {pattern_response}"))
Output:
Model excels at identifying numerical patterns, generating mathematical formulas, explaining the reasoning process, and verifying the solution.
A bag contains 3 red marbles, 4 blue marbles, and 5 green marbles. If you draw two marbles without replacement:
Show all calculations and explain your approach.
prob_prompt = """
A bag contains 3 red marbles, 4 blue marbles, and 5 green marbles.
If you draw two marbles without replacement:
What's the probability of drawing two blue marbles?
What's the probability of drawing marbles of different colors?
Show all calculations and explain your approach.
"""
prob_prompt_response = llm.complete(prob_prompt)
display(
Markdown(f"**Question:** {prob_prompt}\n **Answer:** {prob_prompt_response}")
)
Output:
The model can calculate probabilities, handle conditional problems, and explain probabilistic reasoning.
This code has logical errors that prevent it from running correctly.
```def calculate_average(numbers):
sum = 0
count = 0
for num in numbers:
if num > 0:
sum += num
count += 1
return sum / count
result = calculate_average([1, -2, 3, -4, 5])```
debugging_prompt = """
This code has logical errors that prevent it from running correctly.
```
def calculate_average(numbers):
sum = 0
count = 0
for num in numbers:
if num > 0:
sum += num
count += 1
return sum / count
result = calculate_average([1, -2, 3, -4, 5])
```
1. Identify all potential problems
2. Explain why each is a problem
3. Provide a corrected version
4. Explain why your solution is better
"""
debugging_response = llm.complete(debugging_prompt)
display(
Markdown(f"**Question:** {debugging_prompt}\n **Answer:** {debugging_response}")
)
Output:
DeepSeek-R1 finds edge cases, understands error conditions, applies correction, and explains the technical solution.
Compare electric cars and traditional gasoline cars in terms of:
For each factor, provide specific examples and data points. Then, explain which type of car would be better for:
Justify your recommendations.
comparative_analysis_prompt = """
Compare electric cars and traditional gasoline cars in terms of:
Environmental impact
Long-term cost
Convenience
Performance
For each factor, provide specific examples and data points.
Then, explain which type of car would be better for:
a) A city dweller with a short commute
b) A traveling salesperson who drives 30,000 miles annually
Justify your recommendations.
"""
comparative_analysis_prompt_response = llm.complete(comparative_analysis_prompt)
display(
Markdown(
f"**Question:** {comparative_analysis_prompt}\n **Answer:** {comparative_analysis_prompt_response}"
)
)
Output:
It is a huge response, I loved the reasoning process. It analyzes multiple factors, considers context, makes nice recommendations, and balances competing priorities.
A self-driving car must make a split-second decision:
What should the car do? Provide your reasoning, considering:
ethical_prompt = """
A self-driving car must make a split-second decision:
Swerve left: Hit two pedestrians
Swerve right: Hit a wall, seriously injuring the passenger
Continue straight: Hit one pedestrian
What should the car do? Provide your reasoning, considering:
Ethical frameworks used
Assumptions made
Priority hierarchy
Long-term implications
"""
ethical_prompt_response = llm.complete(ethical_prompt)
display(
Markdown(f"**Question:** {ethical_prompt}\n **Answer:** {ethical_prompt_response}")
)
Output:
These types of problems are most problematic for the generative AI models. It tests ethical reasoning, multiple perspectives, moral dilemmas, and value judgments. Overall, it was one well. I think more ethical domain-specific fine-tuning will produce a more profound response.
A study claims that coffee drinkers live longer than non-coffee drinkers. The study observed 1000 people aged 40-50 for 5 years.
Identify:
stat_prompt = '''
A study claims that coffee drinkers live longer than non-coffee drinkers. The study observed 1000 people aged 40-50 for 5 years.
Identify:
Potential confounding variables
Sampling biases
Alternative explanations
What additional data would strengthen or weaken the conclusion"
'''
stat_prompt_response = llm.complete(stat_prompt)
display(
Markdown(f"**Question:** {stat_prompt}\n **Answer:** {stat_prompt_response}")
)
Output:
It understands the statistical concepts well enough, identifies research limitations, and critical thinking on data, and proposes methodological improvements.
time_series_prompt = '''
A water tank loses 10% of its water to evaporation each day. If it starts with 1000 liters:
How much water remains after 7 days?
After how many days will less than 500 liters remain?
Create a formula for the amount remaining after n days
What assumptions are you making?
'''
time_series_prompt_res = llm.complete(time_series_prompt)
display(
Markdown(f"**Question:** {time_series_prompt}\n **Answer:** {time_series_prompt_res}")
)
Output:
DeepSeek loves Mathematical problems, handles exponential decay, provides good mathematical models, and provides calculations.
constrain_sat_prompt = '''
Schedule these 5 meetings with these constraints:
Marketing (1 hour)
Sales (30 mins)
Development (2 hours)
Client call (1 hour)
Team lunch (1 hour)
Constraints:
Working hours: 9 AM to 5 PM
Client call must be between 2-4 PM
Team lunch must be between 12-2 PM
Development team is only available in the morning
Marketing and Sales must be consecutive
Provide a valid schedule and explain your reasoning.
'''
constrain_sat_prompt_res = llm.complete(constrain_sat_prompt)
display(
Markdown(f"**Question:** {constrain_sat_prompt}\n **Answer:** {constrain_sat_prompt_res}")
)
Output:
It can handle multiple constraints, produce optimized schedules, and provide the problem-solving process.
cross_domain_analogical_prompt = '''
Consider these three scenarios:
A. A computer network handling packet loss
B. A city's traffic system during rush hour
C. A cell's response to protein misfolding
Create a detailed analogy that maps corresponding elements across all three scenarios.
Identify which elements don't have clear correspondences.
Explain how a solution in one domain could inspire solutions in the others.
Where does the analogy break down and why?
'''
cross_domain_analogical_prompt_res = llm.complete(cross_domain_analogical_prompt)
display(
Markdown(f"**Question:** {cross_domain_analogical_prompt}\n **Answer:** {cross_domain_analogical_prompt_res}")
)
Output:
It nicely done the job of comparing different types of domains together which is very impressive. This type of reasoning helps different types of domains entangle together so one domain’s problems can be solved by the solutions from other domains. It helps research on the cross-domain understanding.
Although, there are plenty of example prompts you can experiment with the model on your local systems without spending any penny. I will use DeepSeek-R1 for more research, and learning about different areas. All you need is a Laptop, your time, and a nice place.
All the code used in this article here.
DeepSeek-R1 shows promising capabilities across various reasoning tasks, showcasing its advanced reasoning capabilities in structured logical analysis, step-by-step problem solving, multi-context understanding, and knowledge accumulation from different subjects. However, there are areas for improvement, such as complex temporal reasoning, handling deep ambiguity, and generating creative solutions. Most importantly, it demonstrates how a model like DeepSeek-R1 can be developed without the burden of huge training costs of GPUs.
Its open-sourced model pushes AI toward more democratic realms. New research will soon be conducted on this training method, leading to more potent and powerful AI models with even better reasoning capabilities. While AGI may still be in the distant future, DeepSeek-R1’s advancements point toward a future where AGI will emerge hand in hand with people. DeepSeek-R1 is undoubtedly a key step forward in realizing more advanced AI reasoning systems.
A. While it may not match the power of larger 32B or 70B models, it shows comparable performance in structure reasoning tasks, particularly in mathematical and logical analysis.
A. Write step-by-step requirements, focus on clear instructions, and explicit evaluation criteria. Multipart questions often yield better insight than single questions.
A. We are human, we must use our brains to evaluate the response. It should be used as part of a broader evaluation strategy that includes quantitative metrics and real-world testing. Following this principle will help better evaluation.
Human->Prompt->AI->Response-> Human -> Actual Response
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.