A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!

Aniruddha Bhandari Last Updated : 23 Nov, 2020

7 min read

Overview

Understand the integration of PySpark in Google Colab
We’ll also look at how to perform Data Exploration with PySpark in Google Colab

Introduction

Google Colab is a life savior for data scientists when it comes to working with huge datasets and running complex models.

While for data engineers, PySpark is, simply put, a demigod!

So what happens when we take these two, each the finest player in their respective category, and combine them together?

We get the perfect solution (almost) for all your data science and machine learning problems!

python colab

In this article, we will see how we can run PySpark in a Google Colaboratory notebook. We will also perform some basic data exploratory tasks common to most data science problems. So, let’s get cracking!

Note – I am assuming you are already familiar with the basics of Spark and Google Colab. If not, I recommend going over the following articles before reading this one:

Connecting Google Drive to Colab
Reading data from Google Drive
Setting up PySpark in Google Colab
Load data into PySpark
Understanding the Data
Data Exploration with PySpark Dataframes
- Show column details
- Display rows
- Number of rows in dataframe
- Display specific columns
- Describing the columns
- Distinct values for Categorical columns
- Aggregate with Groupby
- Counting and Removing Null values
- Save to file

Connecting Drive to Colab

The first thing you want to do when you are working on Colab is mounting your Google Drive. This will enable you to access any directory on your Drive inside the Colab notebook.

from google.colab import drive
drive.mount('/content/drive')

Once you have done that, the next obvious step is to load the data.

Bonus – You can find some amazing hacks for Google Colab in this article!

Reading Data from Drive

Now, I am assuming that you will be working with a large enough dataset. Therefore, the best way to upload data to Drive is in a zip format. Just drag and drop your zip folder inside any directory you want on Drive.

Unzipping this data is not a hassle at all. You just have to provide the path to the zip folder along with the !unzip command.

!unzip "/content/drive/My Drive/AV articles/PySpark on Colab/black_friday_train.zip"

If you aren’t really sure what is the exact location of the folder, you can check it out from the side panel on Colab.

Right, let’s set up Spark

Setting up PySpark in Colab

Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Next, we will install Apache Spark 3.0.1 with Hadoop 2.7 from here.

!wget -q https://www-us.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

Now, we just need to unzip that folder.

!tar xf spark-3.0.1-bin-hadoop2.7.tgz

Note – At the time of writing this article, 3.0.1 was the latest version of Apache Spark. But Spark is developing quite rapidly. So, if there is a newer version of Spark when you are executing this code, then you just need to replace 3.0.1, wherever you see it, with the latest version.

There is one last thing that we need to install and that is the findspark library. It will locate Spark on the system and import it as a regular library.

!pip install -q findspark

Now that we have installed all the necessary dependencies in Colab, it is time to set the environment path. This will enable us to run Pyspark in the Colab environment.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

Time for the real test!

We need to locate Spark in the system. For that, we import findspark and use the findspark.init() method.

import findspark
findspark.init()

Bonus – If you want to know the location where Spark is installed, use findspark.find()

findspark.find()

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

You can give a name to the session using appName() and add some configurations with config() if you wish.

from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

Finally, print the SparkSession variable.

spark

PySpark Colab - spark variable

If everything goes well, you should be able to view the above output.

If you want to view the Spark UI, you would have to include a few more lines of code to create a public URL for the UI page.

!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('./ngrok http 4050 &')
!curl -s http://localhost:4040/api/tunnels

spark ui public url

Now you should be able to view the jobs and their stages at the link created.

PySpark Colab - spark ui

Great! Now let’s get started with PySpark!

Loading data into PySpark

First thing first, we need to load the dataset. We will use the read.csv module. The inferSchema parameter provided will enable Spark to automatically determine the data type for each column but it has to go over the data once. If you don’t want that to happen, then you can instead provide the schema explicitly in the schema parameter.

df = spark.read.csv("train.csv", header=True, inferSchema=True)

This will create a Spark dataframe.

Bonus – There are multiple data sources in Spark and you can know all about them in this article!

Understanding the Data

We have the Black Friday dataset here from the DataHack Platform. There are purchase summaries of various customers of a retail company from the past month. We are provided with customer demographics, purchase details, and total purchase amount. The goal is to predict the purchase amount per customer against various products.

PySpark Colab - dataset

Data Exploration with PySpark DF

It is now time to use the PySpark dataframe functions to explore our data. And along the way, we will keep comparing it with the Pandas dataframes.

Show column details

The first step in an exploratory data analysis is to check out the schema of the dataframe. This will give you a bird’s-eye view of the columns in the dataframe along with their data types.

df.printSchema()

Display Rows

Now you would obviously want to have a view of the actual data as well.

Just like in Pandas Dataframe you have the df.head() function, here you have the show() function. You can provide the number of rows you want to print within the parenthesis.

df.show(5)

Number of rows in DF

If you want to know the total number of rows in the dataframe, which you would, just use the count() function.

df.count()

Display specific columns

Sometimes you might want to view some specific columns from the dataframe. For those purposes, you can leverage the capabilities of Spark’s SQL.

Using the select() function you can mention any columns you want to view.

df.select("User_ID","Gender","Age","Occupation").show(5)

Describing the columns

Often when we are working with numeric features, we want to have a look at the statistics regarding the dataframe. The describe() function is best suited for such purposes.

It is pretty similar to Panda’s describe function but the statistical values are far less and the string columns are described as well.

df.describe().show()

Working with PySpark on Google Colab for Data Scientists!

Distinct values for Categorical columns

The distinct() will come in handy when you want to determine the unique values in the categorical columns in the dataframe.

df.select("City_Category").distinct().show()

Aggregate with Groupby

We can use the groupBy function to group the dataframe column values and then apply an aggregate function on them to derive some useful insight.

Here, we can group the various city categories in the dataframe and determine the total Purchase per City category. For this, we have to use the sum aggregate function from the Spark SQL functions module.

from pyspark.sql import functions as F
df.groupBy("City_Category").agg(F.sum("Purchase")).show()

Counting and Removing Null values

Now we all know that real-world data is not oblivious to missing values. Therefore, it is prudent to always check for missing values and remove them if present.

df.select([F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns]).show()

We have a few columns with null values. So it’s best to replace them with some values. According to our dataset, a null value in the Product Category column could mean that the user didn’t buy the product. Therefore, it is best to replace the null value with 0.

We will use the fillna() function to replace the null values. Since Spark dataframes are immutable, we need to store the result in a new dataframe.

df = df.fillna({'Product_Category_2':0, 'Product_Category_3':0})

We can check the null values again to verify the change.

df.select([F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns]).show()

Perfect! There are no more null values in the dataframe.

Save to file

Finally, after doing all the analysis if you want to save your results into a new CSV file, you can do that using the write.csv() function.

df.write.csv("/content/drive/My Drive/AV articles/PySpark on Colab/preprocessed_data")

But there is a catch here. There won’t be just a single CSV saved but multiple depending on the number of partitions of the dataframe. So if there are 2 partitions, then there will be two CSV files saved for each partition.

df.rdd.getNumPartitions()

Bonus – I converted the Spark dataframe to an RDD here. What’s the difference between the two? Check out this article!

But this isn’t very convenient when we have to load these files again. So, we can instead convert the Spark df to the good old Pandas df and then use the usual to_csv() method to store the results.

# Spark df to Pandas df
df_pd = df.toPandas()

# Store result
df_pd.to_csv("/content/drive/My Drive/AV articles/PySpark on Colab/pandas_preprocessed_data.csv")

End Notes

I hope you enjoyed working with PySpark in Colab as much as I did in writing this article!

This by no means is an exhaustive article on the capabilities of PySpark dataframes. For that, you can check out this awesome article on PySpark Dataframes. And if you are looking to go the extra mile and build a machine learning model using PySpark, then I highly recommend going over this article!

Aniruddha Bhandari

I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from the challenges that I face, with the community to make the world a better place!

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Moh

Nice tutorial. I cannot seem to find where I can download the dataset "black_friday_train.zip" from. Can you please post the link to it?

Show 1 reply

Hi, I provided the link in the dataset hyperlink. But you can find here as well - https://datahack.analyticsvidhya.com/contest/black-friday/#ProblemStatement?utm_source=blog&utm_medium=working-with-pyspark-on-google-colab-for-data-scientists Thanks!

Prithwis Mukerjee

Thanks .. Everything works EXCEPT the public URL via the ngrok! when I run !curl -s http://localhost:4040/api/tunnels | python3 -c "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])" i get --- Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.6/json/__init__.py", line 299, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/lib/python3.6/json/__init__.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) --------------------------------- stackoverflow gave too many confusing possibilities

Hi, I have updated the code. Could you please try again and see if it works. Thanks!

pits

thanks. Can we use pyspark in colab to setup a cluster of multiple colab sessions for shared computing to training big models?

A Must-Read Guide on How to Work with PySpark on Google Colab for Data Scientists!

Overview

Introduction

Table of Contents

Connecting Drive to Colab

Reading Data from Drive

Setting up PySpark in Colab

Loading data into PySpark

Understanding the Data

Data Exploration with PySpark DF

Show column details

Display Rows

Number of rows in DF

Display specific columns

Describing the columns

Working with PySpark on Google Colab for Data Scientists!

Distinct values for Categorical columns

Aggregate with Groupby

Counting and Removing Null values

Save to file

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID