A Beginner’s Guide to Tidyverse – The Most Powerful Collection of R Packages for Data Science

Analytics Vidhya Last Updated : 14 Jun, 2020

10 min read

Introduction

Data scientists spend close to 70% (if not more) of their time cleaning, massaging and preparing data. That’s no secret – multiple surveys have confirmed that number. I can attest to it as well – it is simply the most time-taking aspect in a data science project.

Unfortunately, it is also among the least interesting things we do as data scientists. There is no getting around it, though. It is an inevitable part of our role. We simply cannot build powerful and accurate models without ensuring our data is well prepared.

So how can we make this phase of our job interesting?

New Feature

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Welcome to the wonderful world of Tidyverse! It is the most powerful collection of R packages for preparing, wrangling and visualizing data. Tidyverse has completely changed the way I work with messy data – it has actually made data cleaning and massaging fun!

Source: tidyverse.org

If you’re a data scientist and have not yet come across Tidyverse, this article will blow your mind. I will show you the top R packages bundled with in Tidyverse that make data preparation an enjoyable experience. We’ll also look at code snippets for each package to help you get started.

You can also check out my pick of the top eight useful R packages you should incorporate into your data science work.

What is Tidyverse?
Core R Packages in Tidyverse
1. Data Wrangling and Transformation
  - dplyr
  - tidyr
  - stringr
  - forcats
2. Data Import and Management
  - tibble
  - readr
3. Functional Programming
  - purrr
4. Data Visualization and Exploration
  - ggplot2
Some more useful Tidyverse libraries

What is Tidyverse?

Tidyverse is a collection of essential R packages for data science. The packages under the tidyverse umbrella help us in performing and interacting with the data. There are a whole host of things you can do with your data, such as subsetting, transforming, visualizing, etc.

Tidyverse was created by the great Hadley Wickham and his team with the aim of providing all these utilities to clean and work with data.

Let’s now look at some versatile Tidyverse libraries that the majority of data scientists use to manage and streamline their data workflows.

Core R Packages in Tidyverse

Ready to explore the tidyverse? Go ahead and install it directly from within RStudio:

install.packages("tidyverse")

We’ll be working on the food demand forecasting challenge in this article. I have taken a random 10% sample from the train file for faster computation. You can take the entire dataset if you want (and if your machine can support it!).

Let’s begin!

Data Wrangling and Transformation

dplyr

dplyr is one of my all-time favorite packages. It is simply the most useful package in R for data manipulation. One of the greatest advantages of this package is you can use the pipe function “%>%” to combine different functions in R. From filtering to grouping the data, this package does it all.

Here is the complete list of functions dplyr offers:

select(): Select columns from your dataset
filter(): Filter out certain rows that meet your criteria(s)
group_by(): Group different observations together such that the original dataset does not change. Only the way it is represented is changed in the form of a list
summarise(): Summarise any of the above functions
arrange(): Arrange your column data in ascending or descending order
join(): Perform left, right, full, and inner joins in R
mutate(): Create new columns by preserving the existing variables

Let’s look at an example to understand how to use these different functions in R.

Open up the food forecasting dataset we downloaded earlier. We have 2 other files apart from the training set. We can join them with our train file to add more features. Let’s use dplyr and merge all the files. Again, I’m just using 10% of the overall data to make the computation faster.

	library(dplyr)
	joined_data <- left_join(data,fc,by="center_id")

view raw left_join.R hosted with ❤ by GitHub

Output:

       id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured
1 1448490    1        55    2631         243.50     242.50                     0                 0
2 1446016    1        55    2290         311.43     310.43                     0                 0
3 1313873    1        55    2306         243.50     340.53                     0                 0
4 1440008    1        55    1962         582.03     612.13                     1                 0
5 1107611    1        24    1770         340.53     486.03                     0                 0
6 1298505    1        24    1198         147.50     191.09                     0                 0
  num_orders city_code region_code center_type op_area
1         40        NA          NA        <NA>      NA
2        162        NA          NA        <NA>      NA
3         28        NA          NA        <NA>      NA
4        231        NA          NA        <NA>      NA
5         54        NA          NA        <NA>      NA
6        148        NA          NA        <NA>      NA

Note: We see a lot of NAs here. This is because we randomly chose samples from each of the three files and then merged them. If you use the whole dataset, you will not observe this amount of missing values.

Next, let’s use three dplyr functions simultaneously to summarise the data. Here, we’ll select ‘TYPE_A’ from the ‘center_type’ variable and calculate the mean of the ‘num_orders’ variable at this particular center:

	data %>%
	select(center_type,num_orders) %>%
	filter(center_type=="TYPE_A") %>%
	summarise(avg_A=mean(num_orders))

view raw dplyr_2.R hosted with ❤ by GitHub

Here, %>% is called the piping operator. This comes in handy when we want to use one or more functions together.

Output:

   avg_A
1 286.3757

Go ahead and try out the other functions. Trust me, they will completely change the way you do data preparation.

tidyr

The tidyr package complements dplyr perfectly. It boosts the power of dplyr for data manipulation and pre-processing. Below is the list of functions tidyr offers:

gather(): The function “gathers” multiple columns from your dataset and converts them into key-value pairs
spread(): This takes two columns and “spreads” them into multiple columns
separate(): As the name suggests, this function helps in separating or splitting a single column into numerous columns
unite(): Works completely opposite to the separate() function. It helps in combining two or more columns into one

Let’s see a quick example of how to use tidyr. We’ll unite two binary variables and create only one column for both:

	data %>%
	unite_(.,"email_home",c("emailer_for_promotion","homepage_featured")) %>%
	head()

view raw tidyr hosted with ❤ by GitHub

Output:

    id week center_id meal_id checkout_price base_price email_home num_orders city_code region_code
1 1448490    1        55    2631         243.50     242.50        0_0         40        NA          NA
2 1446016    1        55    2290         311.43     310.43        0_0        162        NA          NA
3 1313873    1        55    2306         243.50     340.53        0_0         28        NA          NA
4 1440008    1        55    1962         582.03     612.13        1_0        231        NA          NA
5 1107611    1        24    1770         340.53     486.03        0_0         54        NA          NA
6 1298505    1        24    1198         147.50     191.09        0_0        148        NA          NA
  center_type op_area
1        <NA>      NA
2        <NA>      NA
3        <NA>      NA
4        <NA>      NA
5        <NA>      NA

Here’s another example of how tidyr works:

	data<- data.frame(variable1 = rep(LETTERS[1:3], each = 3),
	variable2 = rep(paste0("factor", c(1, 2, 3)), 3),
	num = 1:9)
	head(data)
	spread(data,variable2,num)

view raw spread.R hosted with ❤ by GitHub

Output:

  variable1 variable2 num
1         A   factor1   1
2         A   factor2   2
3         A   factor3   3
4         B   factor1   4
5         B   factor2   5
6         B   factor3   6
> spread(data,variable2,num)
  variable1 factor1 factor2 factor3
1         A       1       2       3
2         B       4       5       6
3         C       7       8       9

We easily converted the factor variables into a table that can be swiftly interpreted without much pre-processing.

stringr

Dealing with string variables is a tricky challenge. They can often trip up to our final analysis because we skipped over those variables initially thinking they won’t affect our model. That’s a mistake.

stringr is my go-to package in R for such situations. It plays a big role in processing raw data into a cleaner and an easily understandable format. stringr contains a variety of functions that make working with string data really easy.

Some basic functions that you can perform with the stringr package are:

str_sub(): Extract substrings from a character vector
str_trim():Trim white spaces
str_length(): Checks the length of the string
str_to_lower/str_to_upper: Converts the string into upper case or lower case

There are many more functions inside the stringr package. Let’s look at a couple of functions:

	x <- "AnaLytics VidHya 001"
	str_to_lower(x)
	str_to_upper(x)

view raw stringr_3.R hosted with ❤ by GitHub

Output:

> str_to_lower(x)
[1] "analytics vidhya 001"
> str_to_upper(x)
[1] "ANALYTICS VIDHYA 001"

Combine two strings:

	str_c("x", "y")
	> "xy" #output

	str_c("x", "y", "z")
	>"xyz" #output

view raw stringr.R hosted with ❤ by GitHub

	x <- c("Apple", "Banana", "Pear")
	str_sub(x, 1, 3)

	> "App" "Ban" "Pea" #output

view raw stringr_2.R hosted with ❤ by GitHub

forcats

The forcats package is dedicated to dealing with categorical variables or factors. Anyone who has worked with categorical data knows what a nightmare they can be. forcats feels like a godsend.

It is quite frustrating when a factor appears in a place where we least expect it. If we’re using the tibble format, we don’t need to worry about this issue. The aim is to fill in those missing pieces so we can access the power of factors with minimum effort.

Use the following example to experiment with factors in your data:

	center <- factor(data$center_type)
	levels(center)
	fct_count(center) #counts the number of categories

view raw forcats hosted with ❤ by GitHub

Output:

# A tibble: 4 x 2
  f          n
  <fct>  <int>
1 TYPE_A  1890
2 TYPE_B   569
3 TYPE_C   537
4 NA     42657

Data Import and Management

Source: effiasoft.com

readr

We have plenty of ways to read data in R. So why use the readr package? The readr package solves the problem of parsing a flat file into a tibble. This provides an improvement over the standard file importing methods and significantly improves the computation speed.

You can easily read a .CSV file in the following way:

read_delim("filename.csv",delim=",")

Use this function and you’ll automatically see the difference in the time RStudio takes to read in huge data files.

tibble

We work with dataframes in R. It’s one of the first things we learn about R – convert your data into a dataframe before we can proceed with any sort of data science steps.

Tibble is a type of dataframe in R. It truly stands out when we’re trying to detect anomalies in our dataset. How? Tibble does not change variable names or types. It certainly doesn’t throw up errors when a variable does not exist or a value is missing.

Along with the print() function, the Tibble package helps in easy handling of big datasets containing complex objects. Such features enable us to treat the inherent data issues early on, hence producing cleaner code and data.

data<- as.tibble(train)
head(data)

Notice how the data type is mentioned along with the column names. This is a very useful way to present data. Using the above example we can easily see how R gives a “tibble” output to the users:

Output:

# A tibble: 456,548 x 9
       id  week center_id meal_id checkout_price base_price emailer_for_pro~ homepage_featur~
    <int> <int>     <int>   <int>          <dbl>      <dbl>            <int>            <int>
 1 1.38e6     1        55    1885           137.       152.                0                0
 2 1.47e6     1        55    1993           137.       136.                0                0
 3 1.35e6     1        55    2539           135.       136.                0                0
 4 1.34e6     1        55    2139           340.       438.                0                0
 5 1.45e6     1        55    2631           244.       242.                0                0
 6 1.27e6     1        55    1248           251.       252.                0                0
 7 1.19e6     1        55    1778           183.       184.                0                0
 8 1.50e6     1        55    1062           182.       183.                0                0
 9 1.03e6     1        55    2707           193.       192.                0                0
10 1.05e6     1        55    1207           326.       384.                0                1
# ... with 456,538 more rows, and 1 more variable: num_orders <int>

The train file that we converted to the tibble format now gives us a more clear look at the data types and number of variables. Looks pretty neat and tidy, right?

Functional Programming

purrr

The purrr package in R provides a complete toolkit for enhancing R’s functional programming. We can use the functions provided by purrr to avoid many loops with just one line of code.

Which function do you typically use to check the mean of every column in your data? Most data scientists using R tend to lean on the summary() function. It gives us the descriptive statistics for each column.

An even better way to just deduce the mean value, without using any ugly loops, is to use the “map” function. Let’s see how we can do that using our training set:

map_dbl(train,~mean(.x))

Output:

                  id                  week             center_id               meal_id 
         1.250096e+06          7.476877e+01          8.210580e+01          2.024337e+03 
       checkout_price            base_price emailer_for_promotion     homepage_featured 
         3.322389e+02          3.541566e+02          8.115247e-02          1.091999e-01 
           num_orders 
         2.618728e+02

Data Visualization and Exploration

ggplot2

I’m sure you must have heard of ggplot2. It is far and away from the best visualization package I have ever used. Data scientists universally love using ggplot2 to produce their charts and visualizations. It’s such a useful and popular package that they’ve integrated it into the Python language!

There is so much we can do with this package. Whether it’s building box plots, density plots, violin plots, tile plots, time series plots – you name it and ggplot2 has a function for it.

Let’s see a few examples of how to create some really interactive plots with ggplot2 in R.

‘num_orders’ is the target variable in our food forecasting dataset. Let’s look at its distribution by generating a density chart:

	install.packages("ggplot2")
	library(ggplot2)

	ggplot(data = data) +
	aes(x = num_orders) +
	geom_density(adjust = 1, fill = "#0c4c8a") +
	theme_minimal()

view raw ggplot.R hosted with ❤ by GitHub

As you can see above, the dependent variable is right-skewed.

Now, how about drawing up a violin plot? It’s a nice alternative to boxplots for detecting outliers:

	ggplot(data = data) +
	aes(x = num_orders, y = num_orders) +
	geom_violin(scale = "area", adjust = 1, fill = "#0c4c8a") +
	theme_minimal()

view raw violin_plot.R hosted with ❤ by GitHub

Woah. There are plenty of outliers in our data. Don’t you love how a simple visualization offers up so many insights?

Next, plot a scatterplot to check the relationship between the checkout price and the base price:

	ggplot(data = data) +
	aes(x = checkout_price, y = base_price) +
	geom_point(color = "#1f9e89") +
	theme_minimal()

view raw scatter-plot.R hosted with ❤ by GitHub

Interestingly, there seems to be a pretty strong linear relationship between the two variables. We can certainly dig deeper into this when we’re working on this challenge to understand how these variables affect our overall model building strategy.

The power of visualization never ceases to amaze me.

Some More Tidyverse Packages

These packages are not included directly in the tidyverse bundle. So you won’t be able to load them through the function library(tidyverse). Hence, I have provided the installation commands for each package in this section.

Importing Data

readxl: This package is very useful when you want to import Excel sheets in R:

install.packages("readxl")
library(readxl)
data <- read_xlsx("filename.xlxs")

haven: For importing SPSS, STATA and SAS data:

install.packages("haven")
library(haven)
dat = read_sas("path to file", "path to formats catalog")

googledrive: For importing Google Drive files:

	install.packages("googledrive")
	library("googledrive")
	temp <- tempfile(fileext = ".zip")
	dl <- drive_download( as_id("id"), path = temp, overwrite = TRUE)
	out <- unzip(temp, exdir = tempdir())
	bank <- read.csv(out, sep = ";")

view raw googledrive.R hosted with ❤ by GitHub

Data Wrangling

lubridate: The best R package for working with date-time data. lubridate provides a series of functions that are a permutation of the letters “m”, “d” and “y” to represent the ordering of month, day and year:

	install.packages("lubridate")
	library(lubridate)
	dates <- c("January 11,2019" , "September 12, 2018", "April 1, 2019")
	dates <- mdy(dates)

view raw lubridate.R hosted with ❤ by GitHub

Output:

"2019-01-11" "2018-09-12" "2019-04-01"

hms: This packages works similar to lubridate but only with time-based variables:

	install.packages("hms")
	library(hms)
	x <- c("09:10:01", "09:10:02", "09:10:03")
	hms(x)

view raw hms.R hosted with ❤ by GitHub

Output:

"9H 10M 1S" "9H 10M 2S" "9H 10M 3S"

Pretty awesome!

End Notes

Tidyverse is the most popular collection of R packages. Which isn’t all that surprising given how useful and easy to use they are. You’re definitely missing out on saving time and making your work much more efficient if you aren’t using the Tidyverse packages.

Have you used these R packages before? Are there any other packages you feel should be incorporated into Tidyverse? I want to hear hear your thoughts, feedback, and experience with Tidyverse. Let me know in the comments section below!

And if you get stuck at any point while using these packages, I’ll be happy to help you out.

We have summarised the use of every package under tidyverse in this amazing cheatsheet, you can access it here.

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Alex Rosental

Perfect timing Akshat!i am now starting my first 10k row assignment flying solo with no help from our instructor. He works for xtol I let you know later this week. You call it training set. Should I do my 80/20 split before tydyverse? Tks Alex Rosental 35 yr experience MsChE

Show 1 reply

Akshat Arora

Hi Alex! You should do the split after all the pre-processing in order to maintain the similar nature of the train and the test set. Plus, these packages will help in EDA, which will then aid you in feature engineering. Moreover, there should be the same number of features in the train and test set it is advisable that you use tidyverse before splitting.

sebastian

very good post, keep going!

Alain

Nice article, though there is an error when you mention "some basic functions that you can perform with the stringr package are: substr, paste, strsplit, tolower/toupper". Functions in the stringr package starts with str_ like in: str_sub, str_split, str_to_lower/str_to_upper. There is actually no function replacement for paste, nor paste0 on the stringr package.

Hi Alain! Thank you for going through the article. These errors have been rectified, thank you for the feedback :)

Reading list

Introduction

Tools

Libraries

Plots

Use cases

A Beginner’s Guide to Tidyverse – The Most Powerful Collection of R Packages for Data Science

Introduction

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Table of contents

What is Tidyverse?

Core R Packages in Tidyverse

Data Wrangling and Transformation

dplyr

tidyr

stringr

forcats

Data Import and Management

readr

tibble

Functional Programming

purrr

Data Visualization and Exploration

ggplot2

Some More Tidyverse Packages

Importing Data

Data Wrangling

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme