This article was published as a part of the Data Science Blogathon.
Pandas is a python library that needs no introduction. Pandas provide an easier way to do preprocessing and analysis on our data. However, if we are working on a larger data, pandas takes too much time for data preprocessing. The primary reason for the slowdown is pandas can’t run the program parallelly, and it only uses one CPU core for running the program. We have to shift to distributed computing platforms like Spark for working on large data. But maintaining distributed computing platforms is quite difficult as it involves understanding of it’s computational paradigms and inner workings of the platform.
Moreover, a steep learning curve is required to use distributed computing platforms, and it is challenging for beginners to use them. This is where the Modin comes from. It makes your Pandas code run parallelly by changing just one line of code.
Modin is an open source library developed by UC Berkeley’s RISELab to speed up the computation by distributive computing. Modin uses Ray/Dask libraries in backend to parallelize the code and also we don’t need any distributive computing knowledge to use Modin. Modin Dataframe has a similar API to Pandas. So all we are to do is to continue using Pandas API as was before. Modin provides the speedup of upto 4x on 4 core laptop. Modin can be used for dataset size ranging from 1MB to 1TB.
Modin can be installed through the pip command and it uses Ray/Dask library as the backend. Modin will automatically detect the ray/dask engine in our computer. In case if Ray/Dask library is not preinstalled on your computer, Modin can be installed with its dependencies by the below commands.
pip install "modin[ray]" # Install Modin dependencies and Ray to run on Ray
pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
pip install "modin[all]" # Install all of the above
The high-level architecture diagram of Modin can be given below
Currently, we can use Modin with pandas API. SQLite API is in experimental mode with Modin. In the future Modin, developers are planning to come up with separate APIs for Modin but nothing such is developed yet. The Query Compiler layer which is beneath the API layer will compose the query and perform some optimizations based on the format of the data.
Modin will run with Ray/Dask as its backend. We can also make Modin work with our own backend library since it is an open-source library and also even though we can run Modin directly with python without the backend it won’t serve our purpose as Modin can’t run code in a parallel fashion on its own.
The Modin Dataframe is partitioned along both rows and columns and each partition is a separate pandas dataframe.
We can change the default partitions in modin by using the repartition() method.
We can replace pandas with modin with just one line of code
import modin.pandas as pd
Modin vs Pandas Comparision pd.read_csv()
import modin.pandas as pd import time start_time=time.time() data_modin=pd.read_csv("../input/uwmgi-mask-dataset/train.csv") end_time=time.time() duration=end_time-start_time print("Time taken to run the code "+str(duration)) Time taken to run the code 0.4164540767669678 import pandas import time start_time=time.time() data_pandas=pandas.read_csv("../input/uwmgi-mask-dataset/train.csv") end_time=time.time() duration=end_time-start_time print("Time taken to run the code "+str(duration)) Time taken to run the code 0.6549224853515625
df.fillna()
%%time data_modin.fillna(0)
CPU times: user 7.68 ms, sys: 5.06 ms, total: 12.7 ms Wall time: 10.9 ms
%%time data_pandas.fillna(0) CPU times: user 59.2 ms, sys: 5.38 ms, total: 64.6 ms Wall time: 62.3 ms
Pandas is a heavy library with a wide collection of APIs. Even though Modin supports popular APIs of Pandas it won’t support all the APIs of Pandas.
Dask, Vaex, Ray, Cudf and Koalas are some of the popular alternatives to Modin.
Libraries like Dask and Koalas try to resolve the performance issue for large datasets in their own ways but it won’t preserve the Pandas API behaviour and we have to make significant changes to our panda’s code to make it run on dask/Koalas. Also, Dask/Koalas support only row partitioning whereas Modin supports row, column and cell partitioning of dataframe which helps Modin to support a wide variety of Pandas API. Due to the control over partitioning Modin supports pandas’ methods like transpose(). quantile(), and median() which are difficult to apply in the row-based partition of data.
Similarly, libraries like Vaex are designed for data visualization for large datasets but not as a replacement for pandas. So, it’s better to give a try for Vaex for data visualization but for ease of adoption and performance, Modin beats the Vaex library. Also, we can’t compare the libraries like Ray/Cudf with Modin. Ray/Cudf don’t provide any high-level pandas API. Ray/Cudf can be used as the backend support for Modin for optimal performance with fewer code changes.
Also, Dask/Ray uses lazy evaluation which executes the code only when the user explicitly evaluates the result. Even though lazy evaluation decreases time complexity, sometimes it increases space complexity. Unlike Dask/Ray, Modin by default doesn’t execute lazy evaluation. But Modin also supports lazy evaluation with the Omnisci engine.
Modin is still in its early stages of development. Developers are planning to bring more features into it.
Modin’s core motto is to make users use the same tools for small and large datasets without changing the APIs. In the future developers are planning to integrate Modin with Pyarrow, Sqlite and various other libraries. The key takeaways of the article are
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.