Learning Path : Step by Step Guide for Beginners to Learn SparkR

shashwat.2014 Last Updated : 04 Jul, 2016

5 min read

Introduction

Lately, I’ve been reading the book Data Scientist at Work to draw some inspiration from successful data scientists. Among other things, I found that most of the data scientists have emphasized upon the evolution of Spark and its incredible extent of computational power.

This piqued my interest to know more about Spark. Since then, I’ve done an extensive research on this topic to come across every possible bit of information I could find.

Fortunately, Spark has extensive packages for different programming languages. I think, being an R user, my inherent inclination to SparkR is justified.

After I finished with the research, I realized there is no structured learning path available on SparkR. I even connected with folks who are keen to learn SparkR, but none came across such structured learning path. Have you faced the same difficulty ? If yes, here’s your answer.

This inspired me to create this step by step learning path. I’ve listed the best resources available on SparkR. If you manage to complete the 7 steps thoroughly, you are expected to acquire intermediate level of adeptness on Spark. However, your journey from intermediate to expert level would require hours of practice. You knew that, right ? Let’s begin!

Learning Path : Step by Step Guide for Beginners to Learn SparkR

Step 1: What is Spark? Why do we need it?

Spark is an Apache project promoted as “lightning fast cluster computing”. It’s astonishing computing speed makes it 100x faster than hadoop and 10x faster than Mapreduce in memory. For large data processing, Spark has become first choice of every data scientist or engineer today.

You see Amazon, eBay, Yahoo, Facebook, everyone is using Spark for data processing on insanely large data sets. Apache Spark has one of the fastest growing big data community with more than 750 contributors from 200+ companies worldwide. According to the 2015 Data Science Salary Survey by O’Reilly, presence of Apache Spark skills added $11,000 extra to the median salary.

To explore the amazing world of Spark in detail, you can refer this article.

You can also watch this video to learn more about the value that Spark has added to the business world:

However, if you more of a person who read stuffs, you can skip the video and check this recommended blog.

Interesting Read: Apache officially sets a new record in large scale sorting

Step 2: What is Spark R?

Being an R user, let’s channelize our focus on SparkR.

R is one of the most widely used programming languages in data science. With its simple syntax and ability to run complex algorithms, it is probably the first choice of language for beginners.

But, R suffers from a problem. That is, its data processing capacity is limited to memory on a single node. This limits the amount of data you can process with R. Now, you know why does R runs out of memory when you attempt to work on large data sets. To overcome this memory problem, we can use SparkR.

Along with R, Apache Spark provides APIs for various languages such as Python, Scala, Java, SQL and many more. These APIs act as a bridge in connecting these tools with Spark.

For a detailed view of SparkR, this is a must watch video:

Note: SparkR has a limitation. Currently, it only support linear predictive models. Therefore, if you were excited to run boosting algorithm on SparkR, you might have to wait until the next version is rolled out.

Step 3 : Setting up your Machine

If you are still reading, I presume that this new technology has sparked a curiosity in you and that you would be determined to complete this journey. So, lets move on with setting up the machine:

To install SparkR, firstly, we need to install Spark in our systems, since it runs at the backend.

Following resources will help you in installation on your respective OS:

After you’ve successfully installed, it just takes few extra steps to initiate SparkR , once you are done with Spark installation. Following resources will help you to initiate SparkR locally:

Step 4 : Getting the Basics Right

Start with R: Though I assume that you would be knowing R if you are interested to work with Big Data. However, if R is not your domain, this course by data camp will help you to get started with R.

Exercise: Install a package swirl in R and do the complete set of exercises.

Database handling with SQL: SQL is widely used in SparkR in order to implement functions easily using simple commands. This helps in reducing the code lines you have to write. Also, increases the speed of operations. If you are not familiar with SQL, you should do this course by codecademy.

Exercise: Practice 1 and Practice 2

Step 5 : Data Exploration with SparkR and SQL

Once your basics are at place, it’s time to learn to work with SparkR & SQL.

SparkR enables us to use a number of data exploration operations using a combination of R and SQL simultaneously. The most common ones being select, collect, group_By, summarize, subset and arrange. You can learn these operations with this article.

Exercise: Do this exercise by AmpBerkley

Dataset used in above exercise: Download

Step 6 : Building Predictive Models (Linear) on SparkR

As mentioned above, SparkR only supports linear modeling algorithms such as Regression. However, it’s just a matter of time until we are facing this constraint. I am expecting them to soon roll out an updated version which would support non-linear models as well.

SparkR implements linear modeling using the function glm. On the other hand, at present, Spark has a machine learning library known as MLlib (for more info on MLlib, click here), which supports non-linear modeling.

Learn and Practice: To build your first linear regression model on SparkR, follow this link. To build a logistic regression model, follow this link.

Step 7 : Integrating SparkR with Hive for Faster Computation

SparkR works even faster with Apache Hive for database management.

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Integrating Hive with SparkR would help running queries even faster and more efficiently.

If you want to step into bigdata, the use of hive would really be a great advantage for efficient data processing. You can install Hive by following the links given for respective OS:

For Windows
For Ubuntu
For Mac OS

After you’ve installed R successfully, you can start integrating Hive with SparkR using the steps demonstrated in this video. Alternatively, if you are more comfortable in reading, this video is also available in text format on this blog.

For a quick overview on SparkR, you can also follow its official documentation.

End Notes

I hope that I have made the learning path clear enough to accelerate your journey into data science using SparkR.

SparkR is often being seen as an intermediate step to switch into Big Data using R. I learned SparkR because I used to find immense difficulty in working on large data sets in R. SparkR provided me a convenient and cost free way to continue with my learning.

In addition, for a R user, SparkR can also provide headstart to someone who wishes to transition into big data industry. It’s is much powerful than I have explored yet.

Did you find this article helpful ? Have you worked on SparkR ? Do share your suggestions / experience in the comments section below.

You want to apply your analytical skills and test your potential? Then participate in our Hackathons to compete with many Data Scientists from all over the world.

shashwat.2014

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

subro

Thank you, Great learning Path, BTW what else books do you read on datascience

Show 1 reply

Shashwat Srivastava

Thanks Subro. You can check out some of the best books on data science in one of our blogs. http://www.analyticsvidhya.com/blog/2014/06/books-data-scientists-or-aspiring-ones/

Dox vK

After reading th is article I did some research myself and found the following link that I thought deserves a read, https://databricks.com/blog/2016/06/15/an-introduction-to-writing-apache-spark-applications-on-databricks.html

Show 2 reply

Shashwat Srivastava

Thanks for sharing the article. It would definitely be helpful for those wanting to write spark applications.

Neil Dewar

Databricks is a great resource for people wanting to learn Spark. If you don't want to go through having to do a local installation of Spark, Hive etc you can use Spark in the cloud by signing up for a free databricks Community Edition account. Once you have the account, you will have access to databricks great training materials. Here's a link to the community edition sign-up: https://databricks.com/try-databricks?x=1&utm_expid=115457034-1.KtU7kcGAQAu9HxHVJLjLHg.1

Vinod Pathak

What are your Views on sparklyr? I think this is also a new way to handle big data. It gives dplyr capability and several machine learning algorithms. Check out this link : http://spark.rstudio.com/

Show 1 reply

Shashwat Srivastava

Hi Vinod, Sparklyr is quite a new package and is very useful for implementing Spark packages. There is no doubt about the fact that it is very useful for beginners in Spark. Infact it is still developing with more and more features getting embedded into it. Thanks for mentioning it. I'm sure it will be utilised by those interested in Spark.

Reading list

Learning Path : Step by Step Guide for Beginners to Learn SparkR

Introduction

Step 1: What is Spark? Why do we need it?

Step 2: What is Spark R?

Step 3 : Setting up your Machine

Step 4 : Getting the Basics Right

Step 5 : Data Exploration with SparkR and SQL

Step 6 : Building Predictive Models (Linear) on SparkR

Step 7 : Integrating SparkR with Hive for Faster Computation

End Notes

You want to apply your analytical skills and test your potential? Then participate in our Hackathons to compete with many Data Scientists from all over the world.

Login to continue reading and enjoy expert-curated content.