Pandas on Ray – A Library to Make Pandas Faster with Just One Line of Code

Pranav Dar Last Updated : 07 Mar, 2018

2 min read

Overview

Pandas on Ray is a library that makes the Pandas library significantly faster
It only requires just one line of code in the import statement
Read on to see the import statement and make your computation speed faster!

Introduction

Dealing with large scale data has always been a challenging task for data scientists. With limited resources and computational power, it often becomes a daunting experience.

Pandas is one of the most commonly used python libraries but using it on a single core to deal with large datasets becomes insufficient. Most users do not want to optimise their entire workflow just to meet the existing hardware requirements; they do, however, want Pandas to run faster regardless of the size of the data.

So researchers at Berkeley have come up with Pandas on Ray, a library that wraps Pandas and transparently distributes the data and computation. It’s targeted towards existing Pandas users who want their programs to run quicker and and scale better without making huge changes to the code.

Ray is basically a flexible and high performance distributed execution framework.

According to the researchers, “The user does not need to know how many cores their system or cluster has, nor do they need to specify how to distribute the data”. Even on a single machine, users can continue using their usual Pandas notebooks but will experience a significant upgrade in processing speed.

All the user needs to do is modify the old Pandas import statement in the below format:

import ray.dataframe as pd

And you’re good to go! Ray is initialized automatically with the number of cores available to you.

You can read the official research paper, which includes a dataset and a demo on how to use the library, on Berkeley’s blog here.

We have also covered another product from Ray, a reinforcement library called RLlib, which you can read about here.

Our take on this

While still very much in it’s nascent stages, this is shaping up to be a very promising library. Heavy datasets always tend to be problematic with limited computational resources, so Pandas on Ray should provide a workaround for that.

This is a good alternative to Dask, but not at the same level yet. You can read about the different between Ray and Dask here.

It is not available for Windows yet and there is no word on when that might happen. Currently, it can be used on both Mac and Linux machines.

Are you planning to use this library? Let us know in the comments section below.

Subscribe to AVBytes here to get regular data science, machine learning and AI updates in your inbox!

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Sarthak Agarwal

Can you share an example code for this.

Show 1 reply

Faizan Shaikh

Hi Sarthak - you can take a look at their official github repository for documentation and tutorials

Jean-Claude KOUASSI

That is awesome! I took several weeks for writing my own generators to deal with more than memory data during machine learning model training (several frameworks). I am mostly fascinated by the speed of their queries, they succeeded in reading the whole dataset in so less time. For only 45 days of work, they did a great job. I will start use Pandas on Ray now.

Steven Breij

Package seems to be unavailable for both Anaconda environment and Windows Machines. Any chance I'm missing something?