This article was published as a part of the Data Science Blogathon
Pandas is a widely used Python library. It is used in multiple stages of data analytics starting from data manipulation to data analysis. Pandas is supported by two important python packages i.e. NumPy & Matplotlib. In this article, we will explore Pandas’ package & various functions on an Income dataset. The use of these functions will efficiently help in data exploration.
Source: Google Images
Each dataset in Pandas is represented in a tabular format know as data frames. Every value within a column has the same data type, either text or numeric, but different columns can contain different data types.
Here are some pandas’ functions we will discuss in this article.
Once you first receive a new dataset, you want to quickly explore it and get a sense of its contents. Pandas have several methods for this. The first is the head, which returns the first few rows of the DataFrame. The head function is used to see the first few rows of the dataset.
income_data.head()
Name | Age | Gender | Occupation | Income | Expenses | |
0 | Santosh | 32 | Male | Service | 50000 | 20000 |
1 | Manish | 34 | Male | Business | 100000 | 40000 |
2 | Michael | 25 | Male | Service | 20000 | 10000 |
3 | Ramesh | 45 | Male | Service | 50000 | 20000 |
There is a total of 6 columns in the data frame. Some of the columns are numerical while others are categorical in nature.
E.g., Age, Income is a continuous variable while Gender, Occupation is a categorical variable. Once you are aware of how the data looks like you can plan your analytics strategies like whether to go for classification or regression kind of solution.
The info function highlights the total number of rows in the dataset, names of the columns, their data type, and any missing value. It is used to print the summary of a data frame.
income_data.info()
Column | Non-Null Count | Dtype | |
0 | Name | 7 non-null | object |
1 | Age | 7 non-null | int |
2 | Gender | 7 non-null | object |
3 | Occupation | 7 non-null | object |
4 | Income | 7 non-null | int |
5 | Expenses | 7 non-null | int |
It is very important to know the data types of variables that aides in understanding the nature of data. The info function also highlights which column has missing values.
The describe method computes some summary statistics for numerical columns, like mean and median. “count” is the number of non-missing values in each column. Describe is good for a quick overview of numeric variables.
income_data.describe()
Age | Income | Expenses | |
count | 7 | 7 | 7 |
mean | 31.571429 | 68571.42857 | 27142.85714 |
std | 6.900656 | 34965.96985 | 12198.75091 |
min | 25 | 20000 | 10000 |
25% | 26.5 | 50000 | 20000 |
50% | 32 | 60000 | 25000 |
75% | 33 | 87500 | 35000 |
max | 45 | 125000 | 45000 |
The shape attribute of the data frame contains a tuple that holds the number of rows followed by the number of columns. Since this is an attribute instead of a method, we write it without parentheses.
income_data.shape
Output:
(7,6)
.columns & .index
There are two more components of a DataFrame i.e. labels for columns and rows. The columns attribute contains column names, and the index attribute contains row numbers or row names.
income_data.columns
Output
Index(['Name', 'Age', 'Gender', 'Occupation', 'Income', 'Expenses'], dtype='object')
income_data.index
Output
RangeIndex(start=0, stop=7, step=1)
astype()
If you want to cast a Python object to a particular datatype astype() is used for this. It is a useful function in case data is not stored in the correct format. In the below example savings column is stored as a string, using astype() we will convert it back to integer.
income_data['Income']= income_data['Income'].astype('int')
Output
Income is an integer column now.
The first thing you can do is to change the order of the rows by sorting them so that the most interesting data is at the top of the DataFrame. You can sort rows using the sort_values method, passing in a column name that you want to sort by. For example, when we apply sort_values on the age, income column of the income DataFrame, we get the oldest person at the top, Mr. Ramesh, and the youngest person at the bottom.
income_data.sort_values(['Age','Income'], ascending=[False, True]).head()
Name | Age | Gender | Occupation | Income | Expenses |
Ramesh | 45 | Male | Service | 50000 | 20000 |
Manish | 34 | Male | Business | 100000 | 40000 |
Santosh | 32 | Male | Service | 50000 | 20000 |
Mohan | 32 | Male | Business | 125000 | 45000 |
Ragini | 28 | Female | Service | 60000 | 25000 |
We are often required to sort the dataset by a single column or multiple columns. Sorting can be done in both ascending & descending order.
value_counts() is used when you want to see the count of unique values of various columns. Below is the example which highlights the same on income data for the income column.
income_data['Income'].value_counts()
Output:
50000 2 75000 1 100000 1 60000 1 20000 1 125000 1 Name: Income, dtype: int64
In the above result, we can see that income 50000 is present two times.
drop_duplicates() removes duplicate rows from the dataset. This is a very important function when the dataset contains a lot of duplicate values
income_data.drop_duplicates(inplace=True)
Inplace=True implies that the changes will be made to the original dataset.
Subsetting columns & rows
Sometimes we may want to select just one column. We can do this using the name of the DataFrame, followed by square brackets with a column name inside. To select multiple columns, you need two pairs of square brackets. In this code, the inner and outer square brackets are performing different tasks. The outer square brackets are responsible for subsetting the DataFrame, and the inner square brackets are creating a list of column names to subset as shown below.
Name | Age | Occupation |
Santosh | 32 | Service |
Manish | 34 | Business |
Michael | 25 | Service |
Ramesh | 45 | Service |
Mohan | 32 | Business |
There are lots of different ways to subset rows. The most common way to do this is by creating a logical condition to filter against. For example, let’s find all the persons whose income is greater than 50000.
Name | Age | Gender | Occupation | Income | Expenses |
Manish | 34 | Male | Business | 100000 | 40000 |
Mohan | 32 | Male | Business | 125000 | 45000 |
Shweta | 25 | Female | Service | 75000 | 30000 |
Ragini | 28 | Female | Service | 60000 | 25000 |
Adding a column in the data frame goes by various names, including mutating a DataFrame, transforming a DataFrame, and feature engineering. In the below example, we have created a new column ‘Savings’ from the formula (Income-Expenses)
income_data['Savings']= income_data['Income'] - income_data['Expenses'] income_data.head()
Output
Name | Age | Gender | Occupation | Income | Expenses | Savings |
Santosh | 32 | Male | Service | 50000 | 20000 | 30000 |
Manish | 34 | Male | Business | 100000 | 40000 | 60000 |
Michael | 25 | Male | Service | 20000 | 10000 | 10000 |
Ramesh | 45 | Male | Service | 50000 | 20000 | 30000 |
Mohan | 32 | Male | Business | 125000 | 45000 | 80000 |
set_index() & reset_index()
If you want to convert any column of a dataset as its index, it can be done via set_index. In the below example, passenger id is set as an index column. This is useful when you want to set one of the columns as the index.
Age | Gender | Occupation | Income | Expenses | |
Name | |||||
Santosh | 32 | Male | Service | 50000 | 20000 |
Manish | 34 | Male | Business | 100000 | 40000 |
Michael | 25 | Male | Service | 20000 | 10000 |
Ramesh | 45 | Male | Service | 50000 | 20000 |
Mohan | 32 | Male | Business | 125000 | 45000 |
Similarly, if you want to remove the index you can use reset_index. This is useful when you want to add an index as one of the columns.
Name | Age | Gender | Occupation | Income | Expenses | |
0 | Santosh | 32 | Male | Service | 50000 | 20000 |
1 | Manish | 34 | Male | Business | 100000 | 40000 |
2 | Michael | 25 | Male | Service | 20000 | 10000 |
3 | Ramesh | 45 | Male | Service | 50000 | 20000 |
4 | Mohan | 32 | Male | Business | 125000 | 45000 |
With loc and iloc you can do almost any data selection operation on DataFrames. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index-based, so you have to specify rows and columns by their integer index. loc and iloc also allow you to select both rows and columns from a DataFrame as shown below example.
Name | Age | Gender | Occupation | Income | Expenses |
Santosh | 32 | Male | Service | 50000 | 20000 |
Manish | 34 | Male | Business | 100000 | 40000 |
Ramesh | 45 | Male | Service | 50000 | 20000 |
Mohan | 32 | Male | Business | 125000 | 45000 |
Age | Gender |
32 | Male |
34 | Male |
25 | Male |
45 | Male |
Pandas groupBy() function is used to split the data into groups using some predefined criteria. In the below example we will calculate the mean income of the candidates by grouping it by age.
income_agg = income_data.groupby("Age")["Income"].mean() income_agg
Age 25 47500 28 60000 32 87500 34 100000 45 50000 Name: Income, dtype: int64
The pivot table takes column-wise data as input and groups the entries into a tabular format. The first argument is the column name Age that contains values to aggregate. The index argument highlights the columns to group by. The columns argument lists the columns to group by and display in columns. The default aggregation function is mean.
Occupation | Business | Service |
Gender | ||
Female | NaN | 26.5 |
Male | 33 | 34 |
I hope you enjoyed reading, and feel free to use my code to try it out for your purposes. Also, if there is any feedback on code or just the blog post, feel free to email me at [email protected].
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.