DeepSeek #OpenSourceWeek Day 5: Launch of 3FS and Smallpond Framework

Harsh Mishra Last Updated : 28 Feb, 2025
4 min read

On February 28, 2025, DeepSeek made significant strides in the open-source community by launching the Fire-Flyer File System (3FS) and the Smallpond data processing framework. These innovations are designed to enhance data access and processing capabilities, particularly for AI training and inference workloads.

Fire-Flyer File System (3FS)

The Fire-Flyer File System (3FS) is a high-performance distributed file system that leverages modern SSDs and RDMA networks. It aims to provide a robust shared storage layer that simplifies the development of distributed applications.

What is RDMA?

By bypassing the operating system of each device, this technique called remote direct memory access (RDMA) enables the seamless transfer of data between the memory of two distinct computers, allowing for direct and unobstructed communication between their respective memory spaces.

Key Features of 3FS

  • Performance and Usability
    • Achieves an impressive 6.6 TiB/s aggregate read throughput in a 180-node cluster.
    • Supports 3.66 TiB/min throughput on the GraySort benchmark in a 25-node cluster.
    • Delivers 40+ GiB/s peak throughput per client node for KVCache lookups.
  • Disaggregated Architecture
    • Combines the throughput of thousands of SSDs with the network bandwidth of hundreds of storage nodes.
    • Enables applications to access storage resources in a locality-oblivious manner.
  • Strong Consistency
    • Implements Chain Replication with Apportioned Queries (CRAQ) for strong consistency, simplifying application code.
  • File Interfaces
    • Develops stateless metadata services backed by a transactional key-value store (e.g., FoundationDB).
    • Familiar file interface eliminates the need for learning a new storage API.

Diverse Workloads Supported

  • Data Preparation
    • Organizes outputs of data analytics pipelines into hierarchical directory structures.
    • Efficiently manages large volumes of intermediate outputs.
  • Dataloaders
    • Enables random access to training samples across compute nodes, eliminating the need for prefetching or shuffling datasets.
  • Checkpointing
    • Supports high-throughput parallel checkpointing for large-scale training.
  • KVCache for Inference
    • Provides a cost-effective alternative to DRAM-based caching, offering high throughput and significantly larger capacity.

Performance Insights

The performance of 3FS has been validated through rigorous testing. For instance, a read stress test on a large 3FS cluster demonstrated an aggregate read throughput of 6.6 TiB/s with background traffic from training jobs.

Smallpond Framework

DeepSeek has also introduced the Smallpond framework alongside 3FS and designed it for data processing on 3FS. Smallpond provides a lightweight distributed data processing framework. It uses duckdb as the compute engine and stores data in parquet format on a distributed file system (e.g. 3FS).

Key Features of Smallpond

  • Performance: Smallpond uses DuckDB to deliver native-level performance for efficient data processing.
  • Scalability: Leverages high-performance distributed file systems for intermediate storage, enabling PB-scale data handling without memory bottlenecks.
  • Simplicity: No long-running services or complex dependencies, making it easy to deploy and maintain.
  • Efficient Data Processing
    • Utilizes a two-phase approach for sorting large-scale datasets, enhancing performance and efficiency.
    • Successfully sorted 110.5 TiB of data across 8,192 partitions in just 30 minutes and 14 seconds, achieving an average throughput of 3.66 TiB/min.
  • Integration with 3FS
    • Smallpond works seamlessly with 3FS, leveraging its high throughput and strong consistency features.

Getting Started with 3FS and Smallpond

 3FS Installation Instructions

Clone the repository and install the necessary dependencies to get started with 3FS.

1. # Clone the 3FS repository

git clone https://github.com/deepseek-ai/3fs

2. # Navigate to the directory and initialize submodules

cd 3fs
git submodule update --init --recursive
./patches/apply.sh

For more usage and options, please refer to the 3FS documentation.

Getting Started with Smallpond

To get started with Smallpond, please follow these steps:

Installation

  • Make sure you have Python 3.8+ installed on your device.
  • Install Smallpond using pip:
!pip install smallpond

Initialisation

The first step is to initialize a Smallpond session:

import smallpond
sp = smallpond.init()

Loading Data

You can create a DataFrame from a set of files. For example, to load Parquet files:

df = sp.read_parquet("path/to/dataset/*.parquet")

Partitioning Data

Smallpond requires users to manually specify data partitions. Here are some examples:

df = df.repartition(3)  # Repartition by files
df = df.repartition(3, by_row=True)  # Repartition by rows
df = df.repartition(3, hash_by="host")  # Repartition by hash of a column

Transforming Data

You can apply Python functions or SQL expressions to transform your data, these are some of the examples:

df = df.map('a + b as c')  # Using SQL-like syntax
df = df.map(lambda row: {'c': row['a'] + row['b']})  # Using a Python function

Saving Data

After processing your data, you can save it back to various formats. For instance, to save your DataFrame as a Parquet file:

df.write_parquet("path/to/output/dataset.parquet")

Running Smallpond Jobs

To execute a job in Smallpond, you can use the following command:

sp.run(df)

This command will trigger the execution of the transformations and save the results as specified.

Monitoring and Debugging

Smallpond provides tools for monitoring job progress and debugging. When encountering job execution problems, delving into the log data and analyzing it can be instrumental in troubleshooting and resolving issues. Additionally, users have access to a comprehensive knowledge base that includes detailed documentation and tutorials on utilizing Smallpond effectively. This resource offers real-world examples and expert insights, ensuring users can efficiently navigate the platform and unlock its full potential.

The availability of use cases and step-by-step guides further enhances Smallpond’s capabilities, and users can access them through the official support channel. These resources provide users with valuable information and expert assistance to optimize their Smallpond experience and address any difficulties they encounter.

Smallpond Documentation.

Previous Updates:

Conclusion

The open source of 3FS and Smallpond Framework is a significant leap forward in the field of data processing. Their high abilities, ease of use, as well as consistency empower the researchers and developers in the Open source field. Now the applications of data-intensive tasks evolve at a faster pace, 3FS and Smallpond promise a great infrastructure to meet the workloads of modern applications. 

Harsh Mishra is an AI/ML Engineer who spends more time talking to Large Language Models than actual humans. Passionate about GenAI, NLP, and making machines smarter (so they don’t replace him just yet). When not optimizing models, he’s probably optimizing his coffee intake. 🚀☕

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details