As we all know, Pandas is Python’s polars data manipulation library. However, it has a few drawbacks. In this article, we will learn about another powerful data manipulation library of Python written in Rust programming language. Although it is written in Rust, it provides us with an additional package for Python programmers. It is the easiest way to start with Polars using Python, similar to Pandas.
In this tutorial, you will learn about
This article was published as a part of the Data Science Blogathon.
Polars has two different APIs., an eager API and a lazy API. Eager execution is similar to pandas, where the code is run as soon as it is encountered, and the results are returned immediately. On the other hand, lazy execution is not run until you need the development. Lazy execution can be more efficient because it avoids running unnecessary code. Lazy execution can be more efficient because it avoids running unnecessary code, which can lead to better performance.
Let us look at a few applications of this library as follows:
Apart from these, there are many other applications such as Data joining and merging, filtering and querying data using its powerful expression syntax, analyzing statistics and summarizing, etc. Due to its powerful applications can be used in various domains such as business, e-commerce, finance, healthcare, education, government sectors, etc. One example would be to collect real-time data from a hospital, analyze the patient’s health conditions, and generate visualizations such as the percentage of the patients suffering from a particular disease, etc.
Before using any library, you must install it. The Polars library can be installed using the pip command as follows:
pip install polars
To check if it is installed, run the commands below
import polars as pl
print(pl.__version__)
0.17.3
Before using the Polars library, you need to import it. This is similar to creating a data frame in pandas.
import polars as pl
#Creating a new dataframe
df = pl.DataFrame(
{
'name': ['Alice', 'Bob', 'Charlie','John','Tim'],
'age': [25, 30, 35,27,39],
'city': ['New York', 'London', 'Paris','UAE','India']
}
)
df
Polars library provides various methods to load data from multiple sources. Let us look at an example of loading a CSV file.
df=pl.read_csv('/content/sample_data/california_housing_test.csv')
df
Let us compare the read time of both libraries to know how fast the Polars library is. To do so, we use the ‘time’ module of Python. For example, read the above-loaded csv file with pandas and Polars.
import time
import pandas as pd
import polars as pl
# Measure read time with pandas
start_time = time.time()
pandas_df = pd.read_csv('/content/sample_data/california_housing_test.csv')
pandas_read_time = time.time() - start_time
# Measure read time with Polars
start_time = time.time()
polars_df = pl.read_csv('/content/sample_data/california_housing_test.csv')
polars_read_time = time.time() - start_time
print("Pandas read time:", pandas_read_time)
print("Polars read time:", polars_read_time)
Pandas read time: 0.014296293258666992
Polars read time: 0.002387523651123047
As you can observe from the above output, it is evident that the reading time of Polars library is lesser than that of Panda’s library. As you can see in the code, we get the read time by calculating the difference between the start time and the time after the read operation.
Let us look at one more example of a simple filter operation on the same data frame using both pandas and Polars libraries.
start_time = time.time()
res1=pandas_df[pandas_df['total_rooms']<20]['population'].mean()
pandas_exec_time = time.time() - start_time
# Measure read time with Polars
start_time = time.time()
res2=polars_df.filter(pl.col('total_rooms')<20).select(pl.col('population').mean())
polars_exec_time = time.time() - start_time
print("Pandas execution time:", pandas_exec_time)
print("Polars execution time:", polars_exec_time)
Output:
Pandas execution time: 0.0010499954223632812
Polars execution time: 0.0007154941558837891
You can print the summary statistics of the data, such as count, mean, min, max, etc, using the method “describe” as follows.
df.describe()
The shape method returns the shape of the data frame meaning the total number of rows and the total number of columns.
print(df.shape)
(3000, 9)
The head() function returns the first five rows of the dataset by default as follows:
df.head()
The sample() functions give us an impression of the data. You can get an n number of sample rows from the dataset. Here, we are getting 3 random rows from the dataset as shown below:
df.sample(3)
Similarly, the rows and columns return the details of rows and columns correspondingly.
df.rows
df.columns
The select function applies selection expression over the columns.
Examples:
df.select('latitude')
selecting multiple columns
df.select('longitude','latitude')
df.select(pl.sum('median_house_value'),
pl.col("latitude").sort(),
)
Similarly, the filter function allows you to filter rows based on a certain condition.
Examples:
df.filter(pl.col("total_bedrooms")==200)
df.filter(pl.col("total_bedrooms").is_between(200,500))
You can group data based on specific columns using the “groupby” function.
Example:
df.groupby(by='housing_median_age').
agg(pl.col('median_house_value').mean().
alias('avg_house_value'))
Here we are grouping data by the column ‘housing_median_age’ and calculating the mean “median_house_value” for each group and creating a column with the name “avg_house_value”.
You can join or concatenate two data frames using various functions provided by Polars.
Join: Let us look at an example of an inner join on two data frames. In the inner join, the resultant data frames consist of only those rows where the join key exists.
Example 1:
import polars as pl
# Create the first DataFrame
df1 = pl.DataFrame({
'id': [1, 2, 3, 4],
'emp_name': ['John', 'Bob', 'Khan', 'Mary']
})
# Create the second DataFrame
df2 = pl.DataFrame({
'id': [2, 4, 5,7],
'emp_age': [35, 20, 25,32]
})
df3=df1.join(df2, on="id")
df3
In the above example, we perform the join operation on two different data frames and specify the join key as an “id” column. The other types of join operations are left join, outer join, cross join, etc.
Concatenate:
To perform the concatenation of two data frames, we use the concat() function in Polars as follows:
import polars as pl
# Create the first DataFrame
df1 = pl.DataFrame({
'id': [1, 2, 3, 4],
'name': ['John', 'Bob', 'Khan', 'Mary']
})
# Create the second DataFrame
df2 = pl.DataFrame({
'id': [2, 4, 5,7],
'name': ['Anny', 'Lily', 'Sana','Jim']
})
df3=pl.concat([df2,df1] )
df3
The ‘concat()’ function merges the data frames vertically, one below the other. The resultant data frame consists of the rows from ‘df2’ followed by the rows from ‘df1’, as we have given the first data frame as ‘df2’. However, the column names and data types must match while performing concatenation operations on two data frames.
The main benefit of using the Polars library is it supports lazy execution. It allows us to postpone the computation until it is needed. This benefits large datasets where we can avoid executing unnecessary operations and execute only required ones. Let us look at an example of this:
lazy_plan = df.lazy().
filter(pl.col('housing_median_age') > 2).
select(pl.col('median_house_value') * 2)
result = lazy_plan.collect()
print(result)
In the above example, we use the lazy() method to define a lazy computation plan. This computation plan filters the col ‘housing_median_age’ if it is greater than 2 and then selects col ‘median_house_value’ multiplied by 2. Further, to execute this plan, we use the’ collect’ method and store it in the result variable.
In Conclusion, Python’s Polars data manipulation library is the most efficient and powerful toolkit for large datasets. Polars library fully uses Python as a programming language and works efficiently with other widespread libraries such as NumPy, Pandas, and Matplotlib. This interoperability provides a simplistic data combination and examination across different fields, creating an adaptable resource for many uses. The library’s core capabilities, including data filtering, aggregation, grouping, and merging, equip users with the ability to process data at scale and generate valuable insights.
A. Polars is a powerful and fastest data manipulation library built in RUST which is similar to Panda’s data frames library of Python.
A. If you are working with large datasets and speed is your concern, you can definitely go with Polars; it is much faster than pandas.
A. Polars is completely written in Rust programming language.
A. Yes, polars is faster than NumPy as it focuses on efficient data handling, and the reason would be its implementation in Rust. However, the choice depends on the specific use case.
A. Polar Data frame is a Data Structure of Polars used for handling tabular data. In a Data Frame, the data is organized as rows and columns.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.