DeepSeek #OpenSourceWeek Day 3: Release of DeepGEMM

Harsh Mishra Last Updated : 27 Feb, 2025
4 min read

As part of the ongoing #OpenSourceWeek, DeepSeek announced the release of DeepGEMM, a cutting-edge library designed for efficient FP8 General Matrix Multiplications (GEMMs). This library is tailored to support both dense and Mix-of-Experts (MoE) GEMMs, making it a powerful tool for V3/R1 training and inference. With DeepGEMM, we aim to push the boundaries of performance and efficiency in AI workloads, furthering our commitment to advancing open-source innovation in the field.

This release marks Day 3 of our Open Source Week celebrations, following the successful launches of DeepSeek FlashML on Day 1 and DeepSeek DeepEP on Day 2.

What is GEMM?

General Matrix Multiplication (GEMM) is a operation that takes two matrices and multiplies them by storing the result into a third matrix. It is a fundamental operation in Linear Algebra, widely used in various applications. Its formula is 

GEMM is critical for optimizing the performance of the models. It is particularly useful in Deep learning, where it is mostly used in training and inference of neural networks.

DeepGEMM
Source: NVIDIA

This image depicts GEMM (General Matrix Multiplication), showing matrices A, B, and the resulting C. It highlights tiling, dividing matrices into smaller blocks (Mtile, Ntile, Ktile) for optimized cache usage. The blue and yellow tiles illustrate the multiplication process, contributing to the green “Block_m,n” tile in C. This technique improves performance by enhancing data locality and parallelism.

What is FP8?

FP8, or 8-bit floating point, is a format designed for high-performance computing which allows reduced precision as well as efficient representation of numerical data with real values. Huge datasets can result in high computational overload in machine learning and deep learning applications, this is where FP8 plays a vital role by reducing the computational complexity.

The FP8 format typically consists of:

  • 1 sign bit
  • 5 exponent bits
  • 2 fraction bits

This compact representation allows for faster computations and reduced memory usage, making it ideal for training large models on modern hardware. The trade-off is a potential loss of precision, but in many deep learning scenarios, this loss is acceptable and can even lead to improved performance due to reduced computational load.

DeepGEMM
Source: NVIDIA

This image illustrates FP8 (8-bit Floating Point) formats, specifically E4M3 and E5M2, alongside FP16 and BF16 for comparison. It shows how FP8 representations allocate bits for sign, exponent, and mantissa, affecting precision and range. E4M3 uses 4 exponent bits and 3 mantissa bits, while E5M2 uses 5 and 2 respectively. The image highlights the trade-offs in precision and range between different floating-point formats, with FP8 offering reduced precision but lower memory footprint.

Need for DeepGEMM

DeepGEMM addresses the challenges in Matrix Multiplication by providing a lightweight, high-performance library that is easy to use and flexible enough to handle a variety of GEMM operations.

  • Addresses a Critical Need: DeepGEMM fills a gap in the AI community by providing optimized FP8 GEMM.
  • High-Performance and Lightweight: It offers fast computation with a small memory footprint.
  • Supports Dense and MoE Layouts: It’s versatile, handling both standard and Mixture-of-Experts model architectures.
  • Essential for Large-Scale AI: Its efficiency is crucial for training and running complex AI models.
  • Optimizes MoE Architectures: DeepGEMM implements specialized GEMM types (contiguous-grouped, masked-grouped) for MoE efficiency.
  • Enhances DeepSeek’s Models: It directly improves the performance of DeepSeek’s AI models.
  • Benefits the Global AI Ecosystem: By offering a highly efficient tool, it aids AI developers worldwide.

Key Features of DeepGEMM

DeepGEMM stands out with its impressive features:

  • High Performance: Achieving up to 1350+ FP8 TFLOPS on NVIDIA Hopper GPUs, DeepGEMM is optimized for speed and efficiency.
  • Lightweight Design: The library has no heavy dependencies, making it as clean and straightforward as a tutorial. This method simplifies the process, ensuring that the focus remains on the core functionality without the distraction of elaborate setups.
  • Just-In-Time Compilation: DeepGEMM’s approach, fully Just-In-Time (JIT) compilation, compiles all kernels at runtime, offering a streamlined user experience. By sidestepping the intricacies of complex configurations, users can concentrate on the actual implementation.Β 
  • Concise Core Logic: With core logic comprising approximately 300 lines of code, DeepGEMM outperforms many expert-tuned kernels across a wide range of matrix sizes. This compact design not only facilitates easier understanding and modification but also ensures high efficiency.
  • Support for Diverse Layouts: The library supports both dense layouts and two types of MoE layouts, catering to different computational needs.

Performance Metrics

DeepGEMM has been rigorously tested across various matrix shapes, demonstrating significant speedups compared to existing implementations. Below is a summary of performance metrics:

M N K Computation Memory Bandwidth Speedup
64 2112 7168 206 TFLOPS 1688 GB/s 2.7x
128 7168 2048 510 TFLOPS 2277 GB/s 1.7x
4096 4096 7168 1304 TFLOPS 500 GB/s 1.1x

Table 1: Performance metrics showcasing DeepGEMM’s efficiency across various configurations.

Installation Guide

Getting started with DeepGEMM is straightforward. Here’s a quick guide to install the library:

Step 1: Prerequisites

  • Hopper architecture GPUs (sm_90a)
  • Python 3.8 or above
  • CUDA 12.3 or above (recommended: 12.8 or above)
  • PyTorch 2.1 or above
  • CUTLASS 3.6 or above (can be cloned as a Git submodule)

    Step 2: Clone the DeepGEMM Repository

    Run
    git clone --recursive [email protected]:deepseek-ai/DeepGEMM.git

    Step 3: Install the Library

    python setup.py install

    Step 4: Import DeepGEMM in your Python Project

    import deep_gemm

      For detailed installation instructions and additional information, visit the DeepGEMM GitHub repository.

      Conclusion

      DeepGEMM stands out as a powerful FP8 GEMM library, known for its speed and ease of use, making it a great fit for tackling the challenges of advanced machine learning tasks. With its lightweight design, fast execution, and flexibility to work with different data layouts, DeepGEMM is a go-to tool for developers everywhere. Whether you’re working on training or inference, this library is built to simplify complex workflows, helping researchers and practitioners push the boundaries of what’s possible in AI.

      Stay tuned toΒ Analytics Vidhya BlogΒ for our detailed analysis on DeepSeek’s Day 4 release!

      Harsh Mishra is an AI/ML Engineer who spends more time talking to Large Language Models than actual humans. Passionate about GenAI, NLP, and making machines smarter (so they don’t replace him just yet). When not optimizing models, he’s probably optimizing his coffee intake. πŸš€β˜•

      Responses From Readers

      Clear

      We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

      Show details