On February 28, 2025, DeepSeek made significant strides in the open-source community by launching the Fire-Flyer File System (3FS) and the Smallpond data processing framework. These innovations are designed to enhance data access and processing capabilities, particularly for AI training and inference workloads.
The Fire-Flyer File System (3FS) is a high-performance distributed file system that leverages modern SSDs and RDMA networks. It aims to provide a robust shared storage layer that simplifies the development of distributed applications.
By bypassing the operating system of each device, this technique called remote direct memory access (RDMA) enables the seamless transfer of data between the memory of two distinct computers, allowing for direct and unobstructed communication between their respective memory spaces.
The performance of 3FS has been validated through rigorous testing. For instance, a read stress test on a large 3FS cluster demonstrated an aggregate read throughput of 6.6 TiB/s with background traffic from training jobs.
DeepSeek has also introduced the Smallpond framework alongside 3FS and designed it for data processing on 3FS. Smallpond provides a lightweight distributed data processing framework. It uses duckdb as the compute engine and stores data in parquet format on a distributed file system (e.g. 3FS).
Clone the repository and install the necessary dependencies to get started with 3FS.
1. # Clone the 3FS repository
git clone https://github.com/deepseek-ai/3fs
2. # Navigate to the directory and initialize submodules
cd 3fs
git submodule update --init --recursive
./patches/apply.sh
For more usage and options, please refer to the 3FS documentation.
To get started with Smallpond, please follow these steps:
!pip install smallpond
The first step is to initialize a Smallpond session:
import smallpond
sp = smallpond.init()
You can create a DataFrame from a set of files. For example, to load Parquet files:
df = sp.read_parquet("path/to/dataset/*.parquet")
Smallpond requires users to manually specify data partitions. Here are some examples:
df = df.repartition(3) # Repartition by files
df = df.repartition(3, by_row=True) # Repartition by rows
df = df.repartition(3, hash_by="host") # Repartition by hash of a column
You can apply Python functions or SQL expressions to transform your data, these are some of the examples:
df = df.map('a + b as c') # Using SQL-like syntax
df = df.map(lambda row: {'c': row['a'] + row['b']}) # Using a Python function
After processing your data, you can save it back to various formats. For instance, to save your DataFrame as a Parquet file:
df.write_parquet("path/to/output/dataset.parquet")
To execute a job in Smallpond, you can use the following command:
sp.run(df)
This command will trigger the execution of the transformations and save the results as specified.
Smallpond provides tools for monitoring job progress and debugging. When encountering job execution problems, delving into the log data and analyzing it can be instrumental in troubleshooting and resolving issues. Additionally, users have access to a comprehensive knowledge base that includes detailed documentation and tutorials on utilizing Smallpond effectively. This resource offers real-world examples and expert insights, ensuring users can efficiently navigate the platform and unlock its full potential.
The availability of use cases and step-by-step guides further enhances Smallpond’s capabilities, and users can access them through the official support channel. These resources provide users with valuable information and expert assistance to optimize their Smallpond experience and address any difficulties they encounter.
Previous Updates:
The open source of 3FS and Smallpond Framework is a significant leap forward in the field of data processing. Their high abilities, ease of use, as well as consistency empower the researchers and developers in the Open source field. Now the applications of data-intensive tasks evolve at a faster pace, 3FS and Smallpond promise a great infrastructure to meet the workloads of modern applications.