Scalable Data Processing for Large Language Models

About the Event

Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII reduction—improving data quality and training efficiency. Attendees will gain insights into configurable pipeline that enhance the overall data curation process for training LLM models.

Key Takeaways:

Challenges in Data Processing for LLMs – Understanding the bottlenecks of traditional CPU-based pipelines.
Introduction to NeMo Curator – A GPU-accelerated, scalable framework for efficient dataset curation.
Key Features & Modules – Exploring synthetic data generation, deduplication, filtering, classification, and PII reduction.
Real-World Use Cases – How NeMo Curator enhances data quality and model training efficiency.
Live Demonstration – Configuring and running a data curation pipeline using NeMo Curator.

About the Speaker

Allison Ding

Developer Advocate - Data Science at NVIDIA

Allison Ding is a developer advocate for GPU-accelerated AI APIs, libraries, and tools at NVIDIA, with a specialization in advanced data science techniques and large language models (LLMs). She brings over eight years of hands-on experience as a data scientist, focusing on managing and delivering end-to-end data science solutions. Her academic background includes a strong emphasis on natural language processing (NLP). Allison holds a master’s degree in Applied Statistics from Cornell University and a master’s degree in Computer Science from San Francisco Bay University. You can reach her on LinkedIn.

Participate in discussion

Registration Details

00 :00 :00 :00

Event starts in

2311

Registered till now

Scalable Data Processing for Large Language Models