Since you are reading this article, I am assuming that right now you are in your journey of becoming a data scientist. There is a high possibility that you already are aware of some of the data visualization and analytics tools like Excel , SQL , Tableau and might have heard the name ‘Python’. In this article we will be bringing out another fruit from the data scientist’s basket and introducing R. But you might be thinking ‘What is R?’ , ‘Why R?’ , ‘Is it not just an alternative to Python?’
Don’t worry ! Before going on to the wide range of topics that we are going to cover in this article about R we will be beginning with a very basic question : ‘What is R?’
Such an easy question! It is a programming language.
No ! R is not just a programming language rather it is a Statistical Programming Language.
R is not a typical programming language and this is what sets it apart from the Python language. Apparently it is very different from Python which is a general purpose programming language and not made particularly for data manipulation , statistical analysis and moreover solving statistical problems.
Essentially , R is:
A Statistical Language : It is popularly said that ‘R is a language Made by the statisticians for the statisticians’.
A Programming Language : Like every other programming language including Python here we have to write codes i.e. do programming to derive desired results and accomplish the tasks.
An Object Based Language : In R everything we create is saved as an object and all the operations are done on those objects.
A Dynamically Typed Language : Unlike programming languages like C,C++ in R there is no need to declare the class of the objects we create. It automatically understands the datatype of the object.
An Open Source Language : It is available to the public and is completely free to use.
A Modular Language : There are various libraries(also known as modules) available here which have pre written codes in them and can be used to solve various purposes and expand the capability of the language.
We can gain a lot from this language. We can perform a lot of tasks here. Stating a few of them :
If you are finding it interesting till now , then the next step in the process of learning R is going to be the installation of the software. Just go to Google and from there download R and R Studio. Alternatively you can install R Studio from Anaconda as well.
Get familiar with the R Studio interface majorly with Script , Console , Environment , History , Plots , Packages and Help.
We will be using a R Script for this article. To open an R Script just follow these steps.
R Studio → File → New File → R Script
Remember : We will be writing the commands/code in the R Script and will look for the output in the Console. (There are reasons for this , Trust Me!)
You are good to go now! Let’s begin!
In R Studio we can execute the commands in multiple ways. To execute :
This displays all the icons.
Shortcut for the Run button : Ctrl + Enter
Shortcut for the Source button : Ctrl + Shift + S
Shortcut for the Source with Echo button : Ctrl + Shift + Enter
Since the code we write in any programming language consists of logic , function and syntax, it becomes necessary to learn about the R syntax here.
Point 1 : R is a case sensitive language. Both user defined and predefined objects need to be written as it is.
Point 2 : Since we talked about the way the objects must be written in R in the above point, it becomes vital to mention some of the rules we need to follow while creating objects in R i.e for the user defined objects. These rules are known as the Object Naming Rules and states that the object name :
Can contain only alphabets
Can contain both alphabets and numbers
Cannot contain only numbers
Cannot start with a number
Cannot contain any special characters except . and _ which can be included just in between the name and not in the beginning or end
Cannot contain any spaces
Must not coincide with any other object name whether predefined or user defined.
Not following these naming conventions sometimes leads to an error and at other times must be avoided.
Writing Comments in R :
A) Single Line Comments : Use the # symbol at the beginning of the line. This symbol can be placed anywhere in the line and everything following the symbol gets commented i.e. is not executed.
B) Multiple Line Comments : To comment a set of lines i.e. make them unexecutable , select those lines and press Ctrl+Shift+C.
Fun Fact : We can write multiple commands in a single line just by separating the different commands with the ; operator.
How? For the following code :
var1=100
var2=200
var3=var1-var2
We can simply write :
var1=100 ; var2=200 ; var3=var1-var2
Operators are some symbols that are used to perform certain operations on the operands. The various categories of operators and the symbols within each category are somewhat similar in all the data science tools and languages that we have and so it would not be a surprise to you. Let’s quickly look into the various operators R has !
Arithmetic Operators : + , – , * , / (gives exact answer) , %% (the modulus operator which gives the remainder) , %/% (results in integer division)
Input in Script
Output in Console
Relational Operators : , = , == , !=
Input in Script
Output in Console
So the relational operators result in the output as TRUE or FALSE.
We just learnt relational operators and there are instances where we need to combine the output these relational operators result in and hence for that we have the logical operators.
Logical Operators : & (and) , | (or)
Input in Script
Output in Console
Assignment Operators : = , ← , →
← is going to be the most used operator while writing the R commands whereas → is going to be the least used.
What about = ?
You must have encountered that = is used as an assignment operator in most of the other tools that we have including Python of course.
Well , in R both = and ← can be used to assign values however they do differ in some sense which you will get to know when you learn about data manipulation in R.
Let’s create some objects using ← operator ,
Input in Script
The objects x ,y ,z get the values as:
The values are just stored in the objects that we can see in the Environment.
Lastly , introducing the operator we will be requiring every time we want to use a function defined in a particular library. (And we require this A LOTTT !)
Package Reference Operator : It is made up of 2 colons represented as :: and allows us to use the functions defined in R libraries.
Want to see how? The syntax is :
LibraryName::FunctionName()
By now we have talked a bit about the ‘objects’ in R. Every object in R has a class and this class can be a data type or a data structure. The concept of class is of real importance in R majorly because the class of an object helps us determine the various functions associated with it.
So we got 2 new terms namely : data type and data structure. Let’s get to their detailed explanation.
The concept of data type is ancient and by now we are quite familiar with the major data types we have for the data in various tools and technologies. Encountering this term what immediately comes in my mind is text, numbers, date and boolean .That’s pretty much it.
All the structured data that we have can be categorized into these 4 categories majorly but the names could be slightly different in different tools. In R we can classify them as:
Text : This category can further be classified into character and factor. If anything is stored in “ “ or ‘ ‘ it has a class of character.
Numbers : This category can further be classified into numeric , integer and complex. However we will be encountering the numeric type the most in this case.
Boolean : We have logical data type here which is apparently for the TRUE and FALSE output we receive.
Date : The most complicated data type in R , it is not a direct data type but instead a derived data type and hence a whole new topic which needs to be discussed separately.
To understand the concept of dates in R , refer this article : Link to Article to be added before publishing
Previously we created few objects :
Input in Script
Let’s determine their class (data type here):
Input in Script
We get the data types as:
Output in Console
Does the name ‘Type Casting’ suggest anything about the concept?
It does! Type refers to the data type we just learnt about and casting refers to the conversion of this data type from one to another.
Essentially , Type Casting is the process of changing the data type of an object in R to another data type.
Suppose we have an object “demo” with us having any particular data type. To see this object in the form of another data type say “new_datatype” we write the command as as.new_datatype(demo) and we are done.
Note : Do you know that we can use only Console for both the input and the output part. Let us use Console for the rest of the article!
a is an object having value as 100 so its class is numeric.
To display this object as character we can write :
And we get the value of numeric object a as a character i.e “100”
Remember that the object a remains numeric as we didn’t save our result into any object.
Similarly we can write commands like :
as.character(demo)
as.numeric(demo)
as.logical(demo)
as.integer(demo)
and so on….
But this process has some rules , not every data type can be transformed to another type. There is a precedence that these data types follow according to which type casting is done.
Taking the most used data types : character, numeric ,logical.
Here type casting can be done from bottom to top but not vice versa.
In general , for any object the class cannot be converted from character to numeric or logical. Similarly for any object the class cannot be converted from numeric to logical.
However there are various exceptions and special cases to this.
To learn about the concept of Type Casting in detail I strongly suggest you to go through this article : Link to Article to be added before publishing
We touched on this part at the beginning when we discussed that R is a modular language and therefore has something known as a library (also called module). R has two kinds of libraries: System libraries and User libraries which are more than 18,000 in number.
These libraries are available on CRAN (Comprehensive R Archive Network) which is a global repository of open source packages.
Now when it comes to libraries in R there are 3 things to keep in mind : Available Libraries , Installed Libraries and Loaded Libraries.
While available libraries refer to all the packages there on CRAN , installed libraries refer to all those libraries which are installed in your system and the loaded libraries are the ones that you explicitly load each time you open R in order to use the various functions listed there.
How to Install a Library ?
I will be using the GUI method here which is quite easy !
Go to Packages → Install → Give the Package Name
How to Load a Library?
Having a library installed in your system doesn’t mean that you can use it (functions defined inside) any time rather you need to explicitly load that library and the preferred way of doing the same is : library(library_name)
(Note : dplyr is one of the most commonly used libraries in R which contains various important functions (predefined functions) in it used for the purpose of data manipulation in R. Some of the commonly used functions defined in dplyr are mutate() , rename() , filter() etc. )
Since we already touched on the concept of package reference operator in the beginning , let me throw some light. It is used in the function calling part of the code whenever we are using a particular function from a particular library.
The operator :: denotes that we are referring to the mutate function from the dplyr library.
Just a few minutes ago you read about the data types in R , just like that we have another concept known as the data structures. The major ones are Vector and Data Frame.
A vector in R is an object and indeed an integral part of a data frame around which everything revolves in R. Vectors are created to store multiple elements in just a single object.
How to create a vector?
Use c and pass the values inside it.
We get our vector name and vector numbers.
Here we took similar values to create the vectors , all character in case of name and all numeric in case of numbers however we can create vectors with mixed values as well having character , numeric , logical etc. together but in that case the vector takes the highest data type (type casting occurs) as its class according to the precedence rules we learnt above.
It is a two dimensional data structure which is essentially made up of multiple one dimensional data structures called vectors. Since its 2D therefore has rows and columns where the columns are nothing but vectors and the rows are made up by the data that these vectors contain.
There are 2 ways to get a data frame:
Method 1 : Importing the data from some source and saving it as a data frame. We will be talking about this at the end.
Method 2 : Combining Vectors
For this we use data.frame as the function.
Let’s create some vectors first :
Now we will combine v1,v2,v3 to form a data frame details.
Our data frame is ready , to display it simply write details and we get :
However to view the data frame properly in a new tab we can use the View function. Simply write View(details) and we get :
Did you notice something?
Yes ! We took all the 3 vectors of the same size. Well that’s necessary here.
So this is how we can create data frames in R.
In R we can import the data (our dataframe) from various sources.
To name a few we can import csv files , delimited files (tab delimited files) , excel files (spreadsheets) ,SAS files ,XML files etc.
The format we follow to import these data frames in R is :
df ← function_name(“file_absolute_path/filename.extension”)
R has different functions to read different type of files and hence the function_name
file_absolute_path refers to the absolute path of the file which can be obtained by simply replacing with / in the path.
Finally df is the name of the data frame in which we are storing the imported data.
Additional arguments can be added towards the end of the command in case of some files.
Try importing the data frames yourself with a little bit of research on the functions and arguments required for various types of files.
We might already have used the word ‘function’ by now. Functions come in handy when we want to perform a certain task multiple times. While there are some functions already defined in R like sum() , min() etc. which can be used directly to perform tasks like finding the sum of numbers and the least number amongst a set of numbers respectively, we too can create our own functions in R which are popularly known as the UDF’s.
Functions can be categorized broadly as:
Build-In Functions : Those functions which are already defined in R
User Defined Functions : Those functions in R which we can write on our own.
Let’s get to know how we can write our own functions !
Quite a simple process ! To define our own function in R we use the function keyword. The syntax is :
Your_function_name ← function(argument(s)){
Statement 1
Statement 2
.
.
}
Your_function_name → Any identifier name you want to give for the function
argument(s) → The User Input
Let’s create a function cube which finds the cube of numbers the user inputs:
Function Definition
By this we have created our own function ‘cube’ which can be called using the statement cube() passing the number (you want to find the cube of) as an argument in it.
Function Call
Similarly, we can create more such functions according to our usage.
So here we come to the end of this article covering the basics of R. But it is just the starting, to get the actual taste of programming in R we need to cover many more advanced topics.
In this article, we covered the entire introductory part to programming in R Language. I hope that by now you must be confident enough to write commands in the R language and perform tasks like data manipulation but this isn’t it! With R programming we can perform many more sophisticated tasks and hence the learning must not stop here. Let’s move on to Learning Statistical Analysis and Machine Learning with R Language.
Read more articles from our blog page!
If you still have any queries on R Language with respect to these myths, do let me know in the comments below. We can get on a quick chat there.
You can connect with me on LinkedIn: https://www.linkedin.com/in/ayushi-gupta25/