As we reach Day 6 of #OpenSourceWeek, DeepSeek presented an in-depth overview of the DeepSeek-V3/R1 inference system. This article will dig into the system’s design principles, optimization strategies, and performance statistics, highlighting the significant advancements made in throughput and latency optimization.
The primary objectives of the DeepSeek-V3/ DeepSeek R1 inference system are to achieve higher throughput and lower latency. To meet these goals, they have implemented a sophisticated architecture that leverages cross-node Expert Parallelism (EP). This approach not only enhances the efficiency of GPU matrix computations but also optimizes the overall system performance.
However, the implementation of EP introduces complexities, particularly in terms of cross-node communication and the need for effective load balancing across different Data Parallelism (DP) instances.
To tackle these challenges, they focused on three key strategies:
The architecture of DeepSeek-V3/R1 employs different degrees of parallelism during the prefill and decode phases:
To optimize throughput, they have developed a communication-computation overlapping mechanism. During the prefilling phase, it alternates between two microbatches, allowing the communication cost of one microbatch to be hidden behind the computation of the other. In the decoding phase, it subdivides the attention layer into two steps and utilizes a 5-stage pipeline to achieve seamless overlapping.
🚀 Day 6 of #OpenSourceWeek: One More Thing – DeepSeek-V3/R1 Inference System Overview
— DeepSeek (@deepseek_ai) March 1, 2025
Optimized throughput and latency via:
🔧 Cross-node EP-powered batch scaling
🔄 Computation-communication overlap
⚖️ Load balancing
Statistics of DeepSeek's Online Service:
⚡ 73.7k/14.8k…
This diagram depicts a system with two main components: Prefill and Decode services, each managed by load balancers for parallel processing. The API Server directs requests to these services. Both services utilize an optional external key-value cache (KVCache) for storage. The system is designed for efficient and scalable handling of API requests through parallel processing and caching.
The performance of the DeepSeek-V3/R1 inference system has been impressive. Over 24 hours, the system achieved the following statistics:
The operational costs and revenue generated by the DeepSeek-V3/R1 system are noteworthy. The total daily cost for running the inference services, assuming a leasing cost of $2 per hour per H800 GPU, amounted to $87,072.
If all tokens were billed at DeepSeek-R1’s pricing, the theoretical total daily revenue would be $562,027, resulting in a remarkable cost profit margin of 545%. The pricing structure is as follows:
However, actual revenue is lower due to several factors:
Notes: The theoretical income is based on API pricing calculations and does not reflect actual earnings.
For detailed analysis, please refer to the GitHub link of day 6 GitHub.
Previous Updates:
The DeepSeek-V3/R1 inference system represents a significant advancement in the field of artificial intelligence, particularly in optimizing throughput and latency. Through the innovative use of cross-node Expert Parallelism, effective load balancing, and communication-computation overlapping, we have achieved impressive performance metrics.
As they continue to refine our systems and share insights with the community, they are contributing to the broader goals of artificial general intelligence (AGI). The insights gained from this week will not only enhance our understanding but also pave the way for future innovations in AI technology
They are encouraging the community to engage with these resources, as they provide valuable insights into the ongoing developments in the DeepSeek project and its implications for the future of AI.