The Ultimate Swiss Army Knife of ‘apply’ Family in R

Purva Huilgol Last Updated : 06 Jan, 2022

10 min read

Overview

Data manipulation is one of the most crucial steps one has to perform in the machine learning lifecycle
Let’s learn to use the most widely-used set of apply functions for transforming data in R

Introduction

Data manipulation is one of the most crucial steps one has to perform in the machine learning lifecycle. It entails transforming the provided data so that it can be used for building predictive models.

Additionally, it is here that a skilled data scientist applies their intuition and experience to extract as much information from the data as possible. Thus, it should come as no surprise that there exists a plethora of functions and tools in both Python and R to help us in this task.

Featured Image

Today, we will be using R and learning about the most widely-used set of ‘apply‘ functions for transforming data in R. This family of functions offers efficient and quick operations on data. This is particularly useful when we want to work only with certain columns. This set of functions is called the apply() functions. Along with its variants like sapply(), mapply(), etc., we are provided a multipurpose swiss-army knife for data manipulation.

If you are interested in having a career in Data Science and learning about these amazing things, I recommend you check out our Certified AI & ML BlackBelt Accelerate Program.

The various set of functions in this family are:

Setting the Context
apply
lapply
sapply
vapply
tapply
mapply

Setting the Context

I will first cover how each function above works by using simple datasets and then we will take up a real dataset to use these functions.

So fire up your Notebooks or R studio, and let us get started!

We don’t need to install any other libraries to use the apply functions. So let us start by creating a simple matrix of numerical values from 1 to 20 distributed among 5 rows and 4 columns:

data <- matrix(c(1:20), nrow = 5 , ncol = 4)
data

This is how our matrix looks. Now, let us start with the apply() function

apply()

The general syntax of the apply() function can be obtained using the help section. Just execute this code to get a detailed documentation

?apply

As we can see, the apply function has the structure of apply(X, MARGIN, FUN, …). Here,

X refers to the dataset(in our case the matrix) we will be applying the operations on
MARGIN parameters allow us to specify if we want to apply the operation by the row or by the column.
- MARGIN = 1 for rows
- MARGIN = 2 for columns
FUN refers to any user-defined or in-built function we want to ‘apply’ on X

Let us look at the simple example of calculating the mean of each row:

mean_rows <- apply(data, 1, mean)
mean_rows

That was fairly simple! We can see how the apply() function can be used to summarise our data. In the same vein, Let us try finding e sum along each column:

sum_cols <- apply(data, 2, sum)
sum_cols

If we want to apply the function on all the elements, we just write the apply function like this:

all_sqrt <- apply(data, 1:2, sqrt)
all_sqrt

apply square root all function

What if we want to apply a user-defined function to the data? For example, I have a function that finds the square root of(x – 1) for each row:

fn = function(x)
{
   return(sqrt(x - 1))
}

We then apply this function along each row:

apply(data, 1, fn)

apply function row wise

So far, we have used functions that take only 1 parameter and applied them to the data. The best part about the apply family is that they work with functions having multiple arguments as well! Let us apply a user-defined function that takes 3 arguments:

fn = function(x1, x2, x3)
{    
    return(x1^2 + x2 * x1 + x3)
}

We take x1 as each value from ‘data’ and x2, x3 as other arguments that will be first declared and then passed through the apply function:

b = 2
c = 1

# apply along each row:
row_fn <- apply(data, 1, fn, x2 = b, x3 = c)

# apply along each column:
col_fn <- apply(data, 2, fn, x2 = b, x3 = c)

Let us check row_fn and col_fn

row_fn

apply row_fn

col_fn

Apply column fn

The rest of the apply() family follows a similar structure with similar arguments except for a few changes. Let us next use the lapply() function.

lapply()

The apply() function above has a constraint the data needs to be a matrix of at least 2 dimensions for the apply() function to be performed on it. The lapply() function removes this constraint. Short for list-apply, you can use the lapply function on a list or a vector. Be it a list of vectors or just a simple vector, the lapply() can be used on both. Since we are now dealing with vectors/lists, the lapply function does not need the MARGIN parameter either. That being said, the return type of lapply is also a list.

It takes only the data and the function as the basic parameters:

lapply(X, FUN)

Let us look at some examples:

# define a list
cart <- c("BREAD","BUTTER","MILK","COOKIES")

# use lapply to convert all to lower case
cart_lower <- lapply(cart, tolower)

#output
cart_lower

lapply lower case

We will now take a more complex list of lists:

data <- list(l1 = c(1, 2, 3, 4),
             l2 = c(5, 6, 7, 8),
             l3 = c(9, 10, 11, 12))

# apply the 'sum' function on data:
sum_list <- lapply(data, sum)

#output
sum_list

sapply()

The sapply() function(short for simplified-apply) is similar to the lapply function. The only difference is the return type of the output – sapply() simplifies the output based on the values returned. I have created a simple table that tells us what type is returned:

Return Value	Length of each element	Output
List	1	Vector
List	> 1 and of the same length	Matrix
List	> 1 and of the varying length	List

We will see examples of all the above scenarios:

Scenario 1: Length of each element = 1

data <- list(l1 = c(1, 2, 3, 4))

# apply the 'sum' function on data:
sum_sapply1 <- sapply(data, sum)

#output
sum_sapply1

Using lapply to see the difference in outputs:

sum_lapply1 <- lapply(data, sum)
sum_lapply1

Scenario 2: Length of each element > 1 and same

data <- list(l1 = c(1, 2, 3, 4),
             l2 = c(5, 6, 7, 8),
             l3 = c(9, 10, 11, 12))

# apply the 'sum' function on data:
sum_sapply2 <- sapply(data, sum)

#output
sum_sapply2

sapply sum

What output does lapply() give us?

sum_lapply2 <- lapply(data, sum)
sum_lapply2

Scenario 3: Length of each element > 1 and different

data <- list(l1 = c(1, 2, 3),
l2 = c(5, 6, 7, 8),
l3 = c(9, 10))

# apply the 'sum' function on data:
sum_sapply3 <- sapply(data, sum)

#output
sum_sapply3

laaply length

Let us compare it with the output of lapply() on the same data:

sum_lapply3 <- lapply(data, sum)

#output
sum_lapply3

You can see how the output differs from lapply above which returned a list

vapply()

Coming to the vapply() function. The trio of lapply(), apply(), and vapply() are specially tailored for vectors of all types. Unlike lapply() and sapply() which decide the data type of the output for us, vapply() allows us to choose the data type of the output structure. Thus, the parameters for vapply() are:

vapply(X, FUN, FUN.VALUE)

Here FUN.VALUE is used to provide the data type you want.

This is most useful when our lost/vectors contain a mix of numbers and strings:

data <- list(l1 = c(1, 2, 3, 4),
             l2 = c(5, 6, 7, 8),
             l3 = c(9, 10, 11, 12),
             l4 = c("a", "b", "c", "a"))

# apply the 'max' function on data:
sum_vapply <- vapply(data, max, numeric(1))

vapply error

As expected, we got an error because it is not possible to compute the max value from a list of characters. The numeric(1) specifies that we want the output to be individual numeric values where the length of each element is 1. What if we use lapply() or sapply()?

lapply(data, max)
sapply(data, max)

Thus, we can see that both lapply() and sapply() actually provided outputs for the same. In fact, sapply() even converted the output to a vector of type character. Ideally, this is not what we want. Generally, this is how we use the vapply() function

data <- list(l1 = c(1, 2, 3, 4),
             l2 = c(5, 6, 7, 8),
             l3 = c(9, 10, 11, 12))

max_vapply <- vapply(data, max, numeric(1))

max_vapply

vapply numeric

Thus, it is always better to use vapply() when you are working with data frames that have different datatypes of features.

tapply()

In the simplest terms, tapply() allows us to divide the data into groups and perform the operation on each group. Thus, when you provide a vector as input, tapply() performs the specified operation on each subset of the vector. The parameters it takes are:

tapply(X, INDEX, FUN)

where INDEX represents the factor you want to use to separate the data. Sounds familiar? Yes, tapply() is nothing but a simple way to perform a groupy operation and apply some function on this grouped data!

To observe how tapply() works let us create 2 simple vectors

item_cat <- c("HOME", "SNACKS", "BEVERAGE", "STORAGE", "CLEANING", "STORAGE", "HOME", "BEVERAGE", "ELECTRONICS", "SNACKS")
item_qty <-c(25, 30, 45, 66, 15, 50, 35, 20, 15, 35)

Let us now use tapply to get the mean quantity of each item category:

tapply(item_qty, item_cat, mean)

What did the tapply() function do? We grouped the item_qty vector by the item_cat vector to create subsets of the vectors. We then calculate the mean of each subset.

What makes it so easy to use tapply() is that it automatically takes the unique values from the item_cat vector and applies the function we want on the data almost instantly. We can even get more than one value on each subset:

tapply(item_qty, item_cat, function(x) c(mean(x), sum(x)))

tapply mean and sum of elements

Now, we come to the last function in the apply() family of functions – the mapply() function.

mapply()

mapply() stands for multivariate-apply and basically is a multivariate version of sapply(). The mapply function is best explained by examples – so let us use it first and then try to understand how it works.

Let us first take a function that generally does not take 2 lists or 2 vectors as arguments – for example, the max function. We take 2 lists first:

list1 <- list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
list2 <- list(a = c(10, 11, 12), b = c(13, 14, 15), c = c(16, 17, 18))

Now, what if we want to find the max values between each pair of lists elements?

max(list1$a, list2$a)

Now, this function cannot be applied at the same time across all elements of list1 and list2. In such cases, we use the mapply() function:

mapply(function(num1, num2) max(c(num1, num2)), list1, list2)

mapply() function apply

Thus, the mapply function is used to perform functions on data that don’t generally accept multiple lists/vectors as arguments. It is also useful when you want to create new columns. Let us first create a dataframe from the matrix we defined initially:

df <- as.data.frame(data)

dataframe.data

We will now create a new variable that contains the product of columns V1 and V3:

mapply(function(x, y) x/y, df$V1, df$V3)

Thus, we see that mapply is a really handy function when working with data frames.

Now, let us see how to use these functions on real-life datasets. For simplicity, let us take up the iris dataset:

iris_df<-datasets::iris

head(iris_df)

We can now compute the mean of sepal length and sepal width of each row using the apply() function:

iris_df['Sepal_mean'] <- apply(iris_df[c("Sepal.Length", "Sepal.Width")], 1, mean)

Similarly, we can obtain summary values of each column for each species in the dataframe:

tapply(iris_df$Sepal.Width, iris_df$Species, mean)

We can also create a new column showing the sum of petal length and petal width using the mapply() function:

iris_df['Sum_Petal'] <- mapply(function(x, y) x+y, iris_df$Petal.Length, iris_df$Petal.Width)

End Notes

So far, we learned the various functions in the apply() family of functions in R. These set of functions provide extremely efficient means to apply various operations on data in a split second. The article covers the basics of these functions for the purpose of making you understand how these functions work.

I encourage you to try more complicated functions on more complicated datasets to fully understand how useful these functions are. How have these functions helped you in working with datasets in R? Please share your replies and any questions below!

Purva Huilgol

Associate of Data Science @ JP Morgan

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

The Ultimate Swiss Army Knife of ‘apply’ Family in R

Overview

Introduction

Table of Contents

Setting the Context

apply()

lapply()

sapply()

Scenario 1: Length of each element = 1

Scenario 2: Length of each element > 1 and same

Scenario 3: Length of each element > 1 and different

vapply()

tapply()

mapply()

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg