Scalable Data Processing for Large Language Models
Scalable Data Processing for Large Language Models
08 Apr 202513:04pm - 08 Apr 202514:04pm
Scalable Data Processing for Large Language Models
About the Event
Training Large Language Models (LLMs) requires processing massive-scale datasets efficiently. Traditional CPU-based data pipelines struggle to keep up with the exponential growth of data, leading to bottlenecks in model training. In this talk, we present NeMo Curator, an accelerated, scalable Python-based framework designed to curate high-quality datasets for LLMs efficiently. Leveraging GPU-accelerated processing with RAPIDS, NeMo Curator provides modular pipelines for synthetic data generation, deduplication, filtering, classification, and PII reduction—improving data quality and training efficiency. Attendees will gain insights into configurable pipeline that enhance the overall data curation process for training LLM models.
Key Takeaways:
- Challenges in Data Processing for LLMs – Understanding the bottlenecks of traditional CPU-based pipelines.
- Introduction to NeMo Curator – A GPU-accelerated, scalable framework for efficient dataset curation.
- Key Features & Modules – Exploring synthetic data generation, deduplication, filtering, classification, and PII reduction.
- Real-World Use Cases – How NeMo Curator enhances data quality and model training efficiency.
- Live Demonstration – Configuring and running a data curation pipeline using NeMo Curator.
- Best articles get published on Analytics Vidhya’s Blog Space
- Best articles get published on Analytics Vidhya’s Blog Space
- Best articles get published on Analytics Vidhya’s Blog Space
- Best articles get published on Analytics Vidhya’s Blog Space
- Best articles get published on Analytics Vidhya’s Blog Space
Who is this DataHour for?
- Best articles get published on Analytics Vidhya’s Blog Space
- Best articles get published on Analytics Vidhya’s Blog Space
- Best articles get published on Analytics Vidhya’s Blog Space
About the Speaker
Allison Ding is a developer advocate for GPU-accelerated AI APIs, libraries, and tools at NVIDIA, with a specialization in advanced data science techniques and large language models (LLMs). She brings over eight years of hands-on experience as a data scientist, focusing on managing and delivering end-to-end data science solutions. Her academic background includes a strong emphasis on natural language processing (NLP). Allison holds a master’s degree in Applied Statistics from Cornell University and a master’s degree in Computer Science from San Francisco Bay University. You can reach her on LinkedIn.
Participate in discussion
Registration Details
00 :00 :00 :00
Event starts in
Registered till now
Become a Speaker
Share your vision, inspire change, and leave a mark on the industry. We're calling for innovators and thought leaders to speak at our event
- Professional Exposure
- Networking Opportunities
- Thought Leadership
- Knowledge Exchange
- Leading-Edge Insights
- Community Contribution
