A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!

Aniruddha Bhandari Last Updated : 23 Nov, 2020
7 min read

Overview

  • Understand the integration of PySpark in Google Colab
  • We’ll also look at how to perform Data Exploration with PySpark in Google Colab

 

Introduction

Google Colab is a life savior for data scientists when it comes to working with huge datasets and running complex models.

While for data engineers, PySpark is, simply put, a demigod!

So what happens when we take these two, each the finest player in their respective category, and combine them together?

We get the perfect solution (almost) for all your data science and machine learning problems!

python colab

In this article, we will see how we can run PySpark in a Google Colaboratory notebook. We will also perform some basic data exploratory tasks common to most data science problems. So, let’s get cracking!

Note – I am assuming you are already familiar with the basics of Spark and Google Colab. If not, I recommend going over the following articles before reading this one:

 

Table of Contents

  • Connecting Google Drive to Colab
  • Reading data from Google Drive
  • Setting up PySpark in Google Colab
  • Load data into PySpark
  • Understanding the Data
  • Data Exploration with PySpark Dataframes
    • Show column details
    • Display rows
    • Number of rows in dataframe
    • Display specific columns
    • Describing the columns
    • Distinct values for Categorical columns
    • Aggregate with Groupby
    • Counting and Removing Null values
    • Save to file

 

Connecting Drive to Colab

The first thing you want to do when you are working on Colab is mounting your Google Drive. This will enable you to access any directory on your Drive inside the Colab notebook.

from google.colab import drive
drive.mount('/content/drive')

Once you have done that, the next obvious step is to load the data.

Bonus – You can find some amazing hacks for Google Colab in this article!

 

Reading Data from Drive

Now, I am assuming that you will be working with a large enough dataset. Therefore, the best way to upload data to Drive is in a zip format. Just drag and drop your zip folder inside any directory you want on Drive.

Unzipping this data is not a hassle at all. You just have to provide the path to the zip folder along with the !unzip command.

!unzip "/content/drive/My Drive/AV articles/PySpark on Colab/black_friday_train.zip"

If you aren’t really sure what is the exact location of the folder, you can check it out from the side panel on Colab.

PySpark Colab - colab file path

Right, let’s set up Spark

 

Setting up PySpark in Colab

Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
Next, we will install Apache Spark 3.0.1 with Hadoop 2.7 from here.
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
Now, we just need to unzip that folder.
!tar xf spark-3.0.1-bin-hadoop2.7.tgz

Note – At the time of writing this article, 3.0.1 was the latest version of Apache Spark. But Spark is developing quite rapidly. So, if there is a newer version of Spark when you are executing this code, then you just need to replace 3.0.1, wherever you see it, with the latest version.

There is one last thing that we need to install and that is the findspark library. It will locate Spark on the system and import it as a regular library.

!pip install -q findspark
Now that we have installed all the necessary dependencies in Colab, it is time to set the environment path. This will enable us to run Pyspark in the Colab environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

Time for the real test!

We need to locate Spark in the system. For that, we import findspark and use the findspark.init() method.

import findspark
findspark.init()
Bonus – If you want to know the location where Spark is installed, use findspark.find()
findspark.find()

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

You can give a name to the session using appName() and add some configurations with config() if you wish.

from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()
Finally, print the SparkSession variable.
spark

PySpark Colab - spark variable

If everything goes well, you should be able to view the above output.

If you want to view the Spark UI, you would have to include a few more lines of code to create a public URL for the UI page.

!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!curl -s http://localhost:4040/api/tunnels

 

spark ui public url

Now you should be able to view the jobs and their stages at the link created.

PySpark Colab - spark ui

Great! Now let’s get started with PySpark!

 

Loading data into PySpark

First thing first, we need to load the dataset. We will use the read.csv module. The inferSchema parameter provided will enable Spark to automatically determine the data type for each column but it has to go over the data once. If you don’t want that to happen, then you can instead provide the schema explicitly in the schema parameter.

df = spark.read.csv("train.csv", header=True, inferSchema=True)

This will create a Spark dataframe.

Bonus – There are multiple data sources in Spark and you can know all about them in this article!

 

Understanding the Data

We have the Black Friday dataset here from the DataHack Platform. There are purchase summaries of various customers of a retail company from the past month. We are provided with customer demographics, purchase details, and total purchase amount. The goal is to predict the purchase amount per customer against various products.

PySpark Colab - dataset

 

Data Exploration with PySpark DF

It is now time to use the PySpark dataframe functions to explore our data. And along the way, we will keep comparing it with the Pandas dataframes.

Show column details

The first step in an exploratory data analysis is to check out the schema of the dataframe. This will give you a bird’s-eye view of the columns in the dataframe along with their data types.

df.printSchema()

PySpark Colab - spark df schema

Display Rows

Now you would obviously want to have a view of the actual data as well.

Just like in Pandas Dataframe you have the df.head() function, here you have the show() function. You can provide the number of rows you want to print within the parenthesis.

df.show(5)

PySpark Colab - spark df show

Number of rows in DF

If you want to know the total number of rows in the dataframe, which you would, just use the count() function.

df.count()
550068

Display specific columns

Sometimes you might want to view some specific columns from the dataframe. For those purposes, you can leverage the capabilities of Spark’s SQL.

Using the select() function you can mention any columns you want to view.

df.select("User_ID","Gender","Age","Occupation").show(5)

 df select

Describing the columns

Often when we are working with numeric features, we want to have a look at the statistics regarding the dataframe. The describe() function is best suited for such purposes.

It is pretty similar to Panda’s describe function but the statistical values are far less and the string columns are described as well.

df.describe().show()

Working with PySpark on Google Colab for Data Scientists!pyspark describe

Distinct values for Categorical columns

The distinct() will come in handy when you want to determine the unique values in the categorical columns in the dataframe.

df.select("City_Category").distinct().show()

spark df distinct

Aggregate with Groupby

We can use the groupBy function to group the dataframe column values and then apply an aggregate function on them to derive some useful insight.

Here, we can group the various city categories in the dataframe and determine the total Purchase per City category. For this, we have to use the sum aggregate function from the Spark SQL functions module.

from pyspark.sql import functions as F
df.groupBy("City_Category").agg(F.sum("Purchase")).show()

spark df groupby aggregate

Counting and Removing Null values

Now we all know that real-world data is not oblivious to missing values. Therefore, it is prudent to always check for missing values and remove them if present.

df.select([F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns]).show()

We have a few columns with null values. So it’s best to replace them with some values. According to our dataset, a null value in the Product Category column could mean that the user didn’t buy the product. Therefore, it is best to replace the null value with 0.

We will use the fillna() function to replace the null values. Since Spark dataframes are immutable, we need to store the result in a new dataframe.

df = df.fillna({'Product_Category_2':0, 'Product_Category_3':0})
We can check the null values again to verify the change.
df.select([F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns]).show()
count null values in pyspark
Perfect! There are no more null values in the dataframe.

Save to file

Finally, after doing all the analysis if you want to save your results into a new CSV file, you can do that using the write.csv() function.

df.write.csv("/content/drive/My Drive/AV articles/PySpark on Colab/preprocessed_data")
But there is a catch here. There won’t be just a single CSV saved but multiple depending on the number of partitions of the dataframe. So if there are 2 partitions, then there will be two CSV files saved for each partition.
df.rdd.getNumPartitions()
2
save files in pyspark
Bonus – I converted the Spark dataframe to an RDD here. What’s the difference between the two? Check out this article!
But this isn’t very convenient when we have to load these files again. So, we can instead convert the Spark df to the good old Pandas df and then use the usual to_csv() method to store the results.
# Spark df to Pandas df
df_pd = df.toPandas()

# Store result
df_pd.to_csv("/content/drive/My Drive/AV articles/PySpark on Colab/pandas_preprocessed_data.csv")

saving files with pandas

 

End Notes

I hope you enjoyed working with PySpark in Colab as much as I did in writing this article!
This by no means is an exhaustive article on the capabilities of PySpark dataframes. For that, you can check out this awesome article on PySpark Dataframes. And if you are looking to go the extra mile and build a machine learning model using PySpark, then I highly recommend going over this article!

I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from the challenges that I face, with the community to make the world a better place!

Responses From Readers

Clear

Moh
Moh

Nice tutorial. I cannot seem to find where I can download the dataset "black_friday_train.zip" from. Can you please post the link to it?

Prithwis Mukerjee
Prithwis Mukerjee

Thanks .. Everything works EXCEPT the public URL via the ngrok! when I run !curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])" i get --- Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.6/json/__init__.py", line 299, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/lib/python3.6/json/__init__.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) --------------------------------- stackoverflow gave too many confusing possibilities

pits
pits

thanks. Can we use pyspark in colab to setup a cluster of multiple colab sessions for shared computing to training big models?

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details