In this article, we’ll discuss 10 functions of PySpark that are most useful and essential to perform efficient data analysis of structured data.
#installing pyspark !pip install pyspark
#importing pyspark import pyspark #importing sparksessio from pyspark.sql import SparkSession #creating a sparksession object and providing appName spark=SparkSession.builder.appName("pysparkdf").getOrCreate()
This SparkSession object will interact with the functions and methods of Spark SQL. Now, let’s create a Spark DataFrame by reading a CSV file. We will be using simple dataset i.e. Nutrition Data on 80 Cereal products available on Kaggle.
#creating a dataframe using spark object by reading csv file df = spark.read.option("header", "true").csv("/content/cereal.csv")
#show df created top 10 rows df.show(10)
This is the Dataframe we are using for Data analysis. Now, let’s print the schema of the DataFrame to know more about the dataset.
df.printSchema()
The DataFrame consists of 16 features or columns. Each column contains string-type values.
Let’s get started with the functions:
df.select('name', 'mfr', 'rating').show(10)
In the output, we got the subset of the dataframe with three columns name, mfr, rating.
In the DataFrame schema, we saw that all the columns are of string type. Let’s change the data type of calorie column to an integer.
df.withColumn("Calories",df['calories'].cast("Integer")).printSchema()
In the schema, we can see that the Datatype of calories column is changed to the integer type.
Let’s find out the count of each cereal present in the dataset.
df.groupBy("name", "calories").count().show()
Let’s sot the dataframe based on the protein column of the dataset.
df.orderBy("protein").show()
We can see that the entire dataframe is sorted based on the protein column.
The name column of the dataframe contains values in two string words. Let’s split the name column into two columns from space between two strings.
fropm pyspark.sql.functions import split
df1 = df.withColumn('Name1', split(df['name'], " ").getItem(0)) .withColumn('Name2', split(df['name'], " ").getItem(1))
df1.select("name", "Name1", "Name2").show()
In this output, we can see that the name column is split into columns.
Let’s add a column “intake quantity” which contains a constant value for each of the cereals along with the respective cereal name.
from pyspark.sql.functions import lit
df2 = df.select(col("name"),lit("75 gm").alias("intake quantity")) df2.show()
In the output, we can see that a new column is created “intak quantity” that contains the in-take a quantity of each cereal.
Let’s see the cereals that are rich in vitamins.
from pyspark.sql.functions import when
df.select("name", when(df.vitamins >= "25", "rich in vitamins")).show()
from pyspark.sql.functions import filter
df.filter(df.calories == "100").show()
In this output, we can see that the data is filtered according to the cereals which have 100 calories.
Let’s find out is there any null value present in the dataset.
#isNotNull()
from pyspark.sql.functions import * #filter data by null values df.filter(df.name.isNotNull()).show()
There are no null values present in this dataset. Hence, the entire dataframe is displayed.
isNull():
df.filter(df.name.isNull()).show()
Again, there are no null values. Therefore, an empty dataframe is displayed.
In this blog, we have discussed the 9 most useful functions for efficient data processing. These PySpark functions are the combination of both the languages Python and SQL.
Thanks for reading. Do let me know if there is any comment or feedback.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
from pyspark.sql.functions import filter This is not required filter is already a part of Spark dataframe
Thanks for the tutorial, the concepts are clearly explained.
Excellent and useful