Data scientists spend close to 70% (if not more) of their time cleaning, massaging and preparing data. That’s no secret – multiple surveys have confirmed that number. I can attest to it as well – it is simply the most time-taking aspect in a data science project.
Unfortunately, it is also among the least interesting things we do as data scientists. There is no getting around it, though. It is an inevitable part of our role. We simply cannot build powerful and accurate models without ensuring our data is well prepared.
So how can we make this phase of our job interesting?
Welcome to the wonderful world of Tidyverse! It is the most powerful collection of R packages for preparing, wrangling and visualizing data. Tidyverse has completely changed the way I work with messy data – it has actually made data cleaning and massaging fun!
If you’re a data scientist and have not yet come across Tidyverse, this article will blow your mind. I will show you the top R packages bundled with in Tidyverse that make data preparation an enjoyable experience. We’ll also look at code snippets for each package to help you get started.
You can also check out my pick of the top eight useful R packages you should incorporate into your data science work.
Tidyverse is a collection of essential R packages for data science. The packages under the tidyverse umbrella help us in performing and interacting with the data. There are a whole host of things you can do with your data, such as subsetting, transforming, visualizing, etc.
Tidyverse was created by the great Hadley Wickham and his team with the aim of providing all these utilities to clean and work with data.
Let’s now look at some versatile Tidyverse libraries that the majority of data scientists use to manage and streamline their data workflows.
Ready to explore the tidyverse? Go ahead and install it directly from within RStudio:
install.packages("tidyverse")
We’ll be working on the food demand forecasting challenge in this article. I have taken a random 10% sample from the train file for faster computation. You can take the entire dataset if you want (and if your machine can support it!).
Let’s begin!
dplyr is one of my all-time favorite packages. It is simply the most useful package in R for data manipulation. One of the greatest advantages of this package is you can use the pipe function “%>%” to combine different functions in R. From filtering to grouping the data, this package does it all.
Here is the complete list of functions dplyr offers:
Let’s look at an example to understand how to use these different functions in R.
Open up the food forecasting dataset we downloaded earlier. We have 2 other files apart from the training set. We can join them with our train file to add more features. Let’s use dplyr and merge all the files. Again, I’m just using 10% of the overall data to make the computation faster.
Output:
id week center_id meal_id checkout_price base_price emailer_for_promotion homepage_featured 1 1448490 1 55 2631 243.50 242.50 0 0 2 1446016 1 55 2290 311.43 310.43 0 0 3 1313873 1 55 2306 243.50 340.53 0 0 4 1440008 1 55 1962 582.03 612.13 1 0 5 1107611 1 24 1770 340.53 486.03 0 0 6 1298505 1 24 1198 147.50 191.09 0 0 num_orders city_code region_code center_type op_area 1 40 NA NA <NA> NA 2 162 NA NA <NA> NA 3 28 NA NA <NA> NA 4 231 NA NA <NA> NA 5 54 NA NA <NA> NA 6 148 NA NA <NA> NA
Note: We see a lot of NAs here. This is because we randomly chose samples from each of the three files and then merged them. If you use the whole dataset, you will not observe this amount of missing values.
Next, let’s use three dplyr functions simultaneously to summarise the data. Here, we’ll select ‘TYPE_A’ from the ‘center_type’ variable and calculate the mean of the ‘num_orders’ variable at this particular center:
Here, %>% is called the piping operator. This comes in handy when we want to use one or more functions together.
Output:
avg_A 1 286.3757
Go ahead and try out the other functions. Trust me, they will completely change the way you do data preparation.
The tidyr package complements dplyr perfectly. It boosts the power of dplyr for data manipulation and pre-processing. Below is the list of functions tidyr offers:
Let’s see a quick example of how to use tidyr. We’ll unite two binary variables and create only one column for both:
Output:
id week center_id meal_id checkout_price base_price email_home num_orders city_code region_code 1 1448490 1 55 2631 243.50 242.50 0_0 40 NA NA 2 1446016 1 55 2290 311.43 310.43 0_0 162 NA NA 3 1313873 1 55 2306 243.50 340.53 0_0 28 NA NA 4 1440008 1 55 1962 582.03 612.13 1_0 231 NA NA 5 1107611 1 24 1770 340.53 486.03 0_0 54 NA NA 6 1298505 1 24 1198 147.50 191.09 0_0 148 NA NA center_type op_area 1 <NA> NA 2 <NA> NA 3 <NA> NA 4 <NA> NA 5 <NA> NA
Here’s another example of how tidyr works:
Output:
variable1 variable2 num 1 A factor1 1 2 A factor2 2 3 A factor3 3 4 B factor1 4 5 B factor2 5 6 B factor3 6 > spread(data,variable2,num) variable1 factor1 factor2 factor3 1 A 1 2 3 2 B 4 5 6 3 C 7 8 9
We easily converted the factor variables into a table that can be swiftly interpreted without much pre-processing.
Dealing with string variables is a tricky challenge. They can often trip up to our final analysis because we skipped over those variables initially thinking they won’t affect our model. That’s a mistake.
stringr is my go-to package in R for such situations. It plays a big role in processing raw data into a cleaner and an easily understandable format. stringr contains a variety of functions that make working with string data really easy.
Some basic functions that you can perform with the stringr package are:
There are many more functions inside the stringr package. Let’s look at a couple of functions:
Output:
> str_to_lower(x) [1] "analytics vidhya 001" > str_to_upper(x) [1] "ANALYTICS VIDHYA 001"
Combine two strings:
The forcats package is dedicated to dealing with categorical variables or factors. Anyone who has worked with categorical data knows what a nightmare they can be. forcats feels like a godsend.
It is quite frustrating when a factor appears in a place where we least expect it. If we’re using the tibble format, we don’t need to worry about this issue. The aim is to fill in those missing pieces so we can access the power of factors with minimum effort.
Use the following example to experiment with factors in your data:
Output:
# A tibble: 4 x 2 f n <fct> <int> 1 TYPE_A 1890 2 TYPE_B 569 3 TYPE_C 537 4 NA 42657
We have plenty of ways to read data in R. So why use the readr package? The readr package solves the problem of parsing a flat file into a tibble. This provides an improvement over the standard file importing methods and significantly improves the computation speed.
You can easily read a .CSV file in the following way:
read_delim("filename.csv",delim=",")
Use this function and you’ll automatically see the difference in the time RStudio takes to read in huge data files.
We work with dataframes in R. It’s one of the first things we learn about R – convert your data into a dataframe before we can proceed with any sort of data science steps.
Tibble is a type of dataframe in R. It truly stands out when we’re trying to detect anomalies in our dataset. How? Tibble does not change variable names or types. It certainly doesn’t throw up errors when a variable does not exist or a value is missing.
Along with the print() function, the Tibble package helps in easy handling of big datasets containing complex objects. Such features enable us to treat the inherent data issues early on, hence producing cleaner code and data.
data<- as.tibble(train) head(data)
Notice how the data type is mentioned along with the column names. This is a very useful way to present data. Using the above example we can easily see how R gives a “tibble” output to the users:
Output:
# A tibble: 456,548 x 9 id week center_id meal_id checkout_price base_price emailer_for_pro~ homepage_featur~ <int> <int> <int> <int> <dbl> <dbl> <int> <int> 1 1.38e6 1 55 1885 137. 152. 0 0 2 1.47e6 1 55 1993 137. 136. 0 0 3 1.35e6 1 55 2539 135. 136. 0 0 4 1.34e6 1 55 2139 340. 438. 0 0 5 1.45e6 1 55 2631 244. 242. 0 0 6 1.27e6 1 55 1248 251. 252. 0 0 7 1.19e6 1 55 1778 183. 184. 0 0 8 1.50e6 1 55 1062 182. 183. 0 0 9 1.03e6 1 55 2707 193. 192. 0 0 10 1.05e6 1 55 1207 326. 384. 0 1 # ... with 456,538 more rows, and 1 more variable: num_orders <int>
The train file that we converted to the tibble format now gives us a more clear look at the data types and number of variables. Looks pretty neat and tidy, right?
The purrr package in R provides a complete toolkit for enhancing R’s functional programming. We can use the functions provided by purrr to avoid many loops with just one line of code.
Which function do you typically use to check the mean of every column in your data? Most data scientists using R tend to lean on the summary() function. It gives us the descriptive statistics for each column.
An even better way to just deduce the mean value, without using any ugly loops, is to use the “map” function. Let’s see how we can do that using our training set:
map_dbl(train,~mean(.x))
Output:
id week center_id meal_id 1.250096e+06 7.476877e+01 8.210580e+01 2.024337e+03 checkout_price base_price emailer_for_promotion homepage_featured 3.322389e+02 3.541566e+02 8.115247e-02 1.091999e-01 num_orders 2.618728e+02
I’m sure you must have heard of ggplot2. It is far and away from the best visualization package I have ever used. Data scientists universally love using ggplot2 to produce their charts and visualizations. It’s such a useful and popular package that they’ve integrated it into the Python language!
There is so much we can do with this package. Whether it’s building box plots, density plots, violin plots, tile plots, time series plots – you name it and ggplot2 has a function for it.
Let’s see a few examples of how to create some really interactive plots with ggplot2 in R.
‘num_orders’ is the target variable in our food forecasting dataset. Let’s look at its distribution by generating a density chart:
As you can see above, the dependent variable is right-skewed.
Now, how about drawing up a violin plot? It’s a nice alternative to boxplots for detecting outliers:
Woah. There are plenty of outliers in our data. Don’t you love how a simple visualization offers up so many insights?
Next, plot a scatterplot to check the relationship between the checkout price and the base price:
Interestingly, there seems to be a pretty strong linear relationship between the two variables. We can certainly dig deeper into this when we’re working on this challenge to understand how these variables affect our overall model building strategy.
The power of visualization never ceases to amaze me.
These packages are not included directly in the tidyverse bundle. So you won’t be able to load them through the function library(tidyverse). Hence, I have provided the installation commands for each package in this section.
install.packages("readxl") library(readxl) data <- read_xlsx("filename.xlxs")
install.packages("haven")
library(haven)
dat = read_sas("path to file", "path to formats catalog")
Output:
"2019-01-11" "2018-09-12" "2019-04-01"
Output:
"9H 10M 1S" "9H 10M 2S" "9H 10M 3S"
Pretty awesome!
Tidyverse is the most popular collection of R packages. Which isn’t all that surprising given how useful and easy to use they are. You’re definitely missing out on saving time and making your work much more efficient if you aren’t using the Tidyverse packages.
Have you used these R packages before? Are there any other packages you feel should be incorporated into Tidyverse? I want to hear hear your thoughts, feedback, and experience with Tidyverse. Let me know in the comments section below!
And if you get stuck at any point while using these packages, I’ll be happy to help you out.
We have summarised the use of every package under tidyverse in this amazing cheatsheet, you can access it here.
Perfect timing Akshat!i am now starting my first 10k row assignment flying solo with no help from our instructor. He works for xtol I let you know later this week. You call it training set. Should I do my 80/20 split before tydyverse? Tks Alex Rosental 35 yr experience MsChE
Hi Alex! You should do the split after all the pre-processing in order to maintain the similar nature of the train and the test set. Plus, these packages will help in EDA, which will then aid you in feature engineering. Moreover, there should be the same number of features in the train and test set it is advisable that you use tidyverse before splitting.
very good post, keep going!
Nice article, though there is an error when you mention "some basic functions that you can perform with the stringr package are: substr, paste, strsplit, tolower/toupper". Functions in the stringr package starts with str_ like in: str_sub, str_split, str_to_lower/str_to_upper. There is actually no function replacement for paste, nor paste0 on the stringr package.
Hi Alain! Thank you for going through the article. These errors have been rectified, thank you for the feedback :)