30 Quick Pandas Tips and Tricks for Beginners

Ayushi Trivedi Last Updated : 06 May, 2024
9 min read

Introduction

Reading this article is essential for anyone looking to enhance their proficiency in Python data manipulation and analysis using Pandas. Whether you’re a beginner or an experienced data practitioner, this tutorial offers a wealth of insights and techniques to streamline your workflow and maximize your productivity. By mastering the 30 tips and tricks outlined in this article, you’ll gain valuable skills for reading and exporting data, modifying datasets, performing analysis, and visualizing results. Explaining each tip clearly and concisely, the article makes it easy to understand and implement both basic concepts and advanced strategies. By investing your time in this article, you’ll gain the knowledge and confidence needed to excel in your data projects and propel your career forward.

Pandas

You can also enroll in our free python course today!

30 Quick Pandas Tips and Tricks for Beginners

Here are 30 Python tips and tricks using pandas:

1. Read Data from CSV

Python users can easily manipulate and analyze data by reading it from a CSV file into a DataFrame using the pandas function pd.read_csv(). It handles incomplete or distorted data and recognizes the delimiter automatically.

Read data from a CSV file into a DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')

This code snippet allows data analysis and manipulation using pandas functionalities by importing the pandas library and reading data from a CSV file called “data.csv” into a pandas DataFrame called “df.”

2. Display DataFrame

The code df.head() prints the first few rows of a DataFrame. The head() function in pandas defaults to returning the first n rows of a DataFrame, which is 5 if no argument is passed.

Display the first few rows of a DataFrame:

print(df.head())

This code returns the first five rows, as no argument has been passed.

3. Select Columns

Select specific columns from a DataFrame:

selected_columns = df[['column1', 'column2']]

By extracting specified columns ‘column1’ and ‘column2’ from the original DataFrame df, this code creates a new DataFrame named selected_columns, enabling focussed analysis or operations on those specific columns without changing the original DataFrame.

4. Filter Rows

Filter rows based on a condition:

filtered_data = df[df['column'] > 0]

The code filters DataFrame df to include rows where the value in column ‘column’ is greater than 0. It selects rows from DataFrame df where the values in the column named ‘column’ are greater than 0 and assigns them to filtered_data.

5. Group By and Aggregate

groupby in Pandas organizes data into groups based on specified columns or conditions. agg applies aggregate functions, like mean or sum, to each group, producing a summary statistic for each.

Group by a column and perform aggregation:

grouped_data = df.groupby('column').agg({'column2': 'mean'})

This code aggregates data in DataFrame ‘df’ by unique values in ‘column’, then computes the mean of ‘column2’ for each group. The result is stored in a new DataFrame named ‘grouped_data’.

6. Sort DataFrame

The sort_values() function in Pandas arranges the DataFrame ‘df’ based on its values, by default in ascending order, without specifying a particular column. This function helps in reordering rows based on the values across all columns.

Sort DataFrame by one or more columns:

sorted_df = df.sort_values(by='column', ascending=False)

This code creates a new DataFrame called “sorted_df” by sorting the DataFrame “df” based on values in “column” in descending order.

7. Handle Missing Values

The dropna() function removes rows containing missing values (NaN) from the DataFrame ‘df’, effectively reducing its size. Conversely, fillna(value) fills missing values in ‘df’ with a specified ‘value’, providing a way to handle or impute missing data.

Handle missing values in DataFrame:

df.dropna()  # Drop rows with missing values
df.fillna(value)  # Fill missing values with a specific value

8. Apply Function Row-wise

The apply() function in Pandas applies a function along either the rows (axis=0) or columns (axis=1) of the DataFrame ‘df’. It allows for flexible operations on DataFrame elements using custom or built-in functions.

Apply a function row-wise to DataFrame:

df['new_column'] = df.apply(lambda row: row['column1'] + row['column2'], axis=1)

This code creates a new column ‘new_column’ in the DataFrame ‘df’. By applying a lambda function row-wise, it adds values from “column1” and “column2” for each row to determine the values in this column.

9. Rename Columns

The rename() function in Pandas allows modifying the column or index labels of a DataFrame. It takes a dictionary where keys represent the old column names/index labels and values represent the new names, allowing for straightforward renaming.

Rename columns in DataFrame:

df.rename(columns={'old_column': 'new_column'}, inplace=True)

This code renames the ‘old_column’ column in DataFrame ‘df’ to ‘new_column’. When inplace=True, ‘df’ receives the modification straight away; no new DataFrame is created.

10. Concatenate DataFrames

Pandas’ pd.concat() function joins DataFrames together into a single DataFrame by concatenating them along rows (axis=0) or columns (axis=1). It offers an adaptable method for vertically or horizontally combining data from several sources.

Concatenate multiple DataFrames:

concatenated_df = pd.concat([df1, df2], axis=0)

This code concatenates DataFrame ‘df1’ and ‘df2’ along rows (axis=0), creating a new DataFrame named ‘concatenated_df’ stacking ‘df2’ below ‘df1’. It merges the rows of both DataFrames vertically.

11. Merge DataFrames

Pandas’ pd.concat() function joins DataFrames together into a single DataFrame by concatenating them along rows (axis=0) or columns (axis=1). It offers a method for vertically or horizontally combining data from several sources.

Merge two DataFrames on a common column:

merged_df = pd.merge(df1, df2, on='common_column', how='inner')

This code creates a new DataFrame called “concatenated_df” by concatenating DataFrames “df1” and “df2” along the rows (axis=0) and stacking the rows of “df2” underneath “df1.”

12. Pivot Table

The pivot_table() function in Pandas reshapes a DataFrame to create a pivot table based on supplied columns. You can use one column as the new index, another to create additional columns, and a third column to provide values for the pivot table’s cell values. It enables the aggregation of duplicate entries through the use of different functions.

Create a pivot table from DataFrame:

pivot_table = df.pivot_table(index='index_column', columns='column', values='value_column', aggfunc='mean')

With ‘index_column’ as the rows, ‘column’ as the columns, and ‘value_column’ as the data to aggregate, this code builds a pivot table from DataFrame ‘df’. To combine duplicate entries, the’mean’ function is utilized.

13. Date and Time Operations

In order to manipulate and analyze dates and times in a DataFrame, Pandas’ pd.to_datetime() function turns input into datetime objects. It can parse a number of different date formats and return a DatetimeIndex or datetime objects.

Convert string to datetime format and extract date/time components:

df['datetime_column'] = pd.to_datetime(df['datetime_column'])
df['year'] = df['datetime_column'].dt.year

This code converts the ‘datetime_column’ in DataFrame ‘df’ to datetime format using pd.to_datetime(). Then, it extracts the year from the datetime values and assigns them to a new column named ‘year’ using the dt.year accessor.

14. Resample Time Series Data

Time-series data in a DataFrame can have its frequency changed using Pandas’ resample() method. When using aggregation functions like sum, mean, etc., it can either upsample (increase frequency), downsample (reduce frequency), or maintain the same frequency.

Resample time series data to a different frequency:

resampled_df = df.resample('D').sum()

This code resamples the original DataFrame ‘df’ on a daily frequency (‘D’) to build a new DataFrame called’resampled_df’. The values inside each day are then added up by adding them together.

15. Apply Function Element-wise

Apply a function element-wise to DataFrame:

df['new_column'] = df['column'].apply(lambda x: func(x))

By applying the function ‘func’ to each element in ‘column’ using the apply() function and a lambda expression, this code produces a new column called ‘new_column’ in DataFrame ‘df’.

16. Convert Categorical to Numeric

Categorical variables in Pandas are transformed into dummy or indicator variables using the pd.get_dummies() function. In order to represent each category in the original categorized column, it constructs a new DataFrame and assigns 1 or 0 to each category depending on whether it is present or absent.

Convert categorical variable to numeric using one-hot encoding:

encoded_df = pd.get_dummies(df, columns=['categorical_column'])

Each category in the ‘categorical_column’ of DataFrame ‘df’ is represented by dummy variables in a new DataFrame ‘encoded_df’ created by this code. It transforms numerical data into a format appropriate for machine learning models using categorical variables.

17. Rolling Window Operations

The rolling() function in Pandas creates a rolling window object, enabling the application of functions like mean, sum, etc., over a specified window size along a series or DataFrame axis. It facilitates computation of rolling statistics for time-series or sequential data analysis.

Perform rolling window calculations on DataFrame:

rolling_mean = df['column'].rolling(window=3).mean()

This code computes the rolling mean of the ‘column’ in DataFrame ‘df’ using a window size of 3. It calculates the mean value for each window of consecutive elements, providing a smoothed representation of the data.

18. String Operations

Perform string operations on DataFrame columns:

df['new_column'] = df['column'].str.upper()

The DataFrame ‘df’ gains a new column called ‘new_column’, with each value being the uppercase equivalent of the corresponding value in the ‘column’. It converts the strings to uppercase using Pandas’ str.upper() method.

19. Drop Duplicates

Remove duplicate rows from DataFrame:

unique_df = df.drop_duplicates()

This code creates a new column ‘new_column’ in DataFrame ‘df’, where each value is the uppercase version of the corresponding value in the ‘column’, achieved by using the str.upper() method in Pandas.

20. Sampling

Pandas’ sample() function provides a random sample of the data by selecting a predetermined number of rows (one by default) at random from the DataFrame ‘df’. It is helpful for investigating or examining a portion of the dataset.

Randomly sample rows from DataFrame:

sampled_df = df.sample(n=100)

This code creates a new DataFrame ‘sampled_df’ by randomly sampling 100 rows from DataFrame ‘df’. It provides a subset of the original data for analysis or processing.

21. Interpolate Missing Values

Pandas’ interpolate() function uses spline, polynomial, and linear interpolation techniques to fill in missing values in DataFrame ‘df’. It helps to smooth out the data by estimating the missing values based on the values of nearby data points.

Interpolate missing values in DataFrame:

df.interpolate(method='linear', inplace=True)

In DataFrame “df,” this code replaces missing values with interpolated values based on nearby data points through linear interpolation. When inplace=True, ‘df’ receives the changes directly, negating the need to create a new DataFrame.

22. Apply Custom Aggregation

DataFrame rows can be grouped using the groupby() function in Pandas based on unique values in one or more columns. It generates a GroupBy object, which may be applied to the grouped data to perform several operations such as filtering, transformation, and aggregation.

Apply custom aggregation functions in groupby:

custom_agg = df.groupby('column').agg(custom_func)

‘custom_func’ is a custom aggregation function that is applied to each group of DataFrame ‘df’ based on unique values in ‘column’. The ‘custom_agg’ DataFrame is created by aggregating data in accordance with the supplied custom logic.

23. Convert Data Types

To convert a DataFrame column’s data type to a desired type in Pandas, use the astype() method. The column’s values are transformed into the designated data type, guaranteeing consistency and suitability for additional operations or analysis.

Convert data types of DataFrame columns:

df['column'] = df['column'].astype('float')

In order to ensure that numerical values are represented in a floating-point format for arithmetic operations, this code uses the astype() method to convert the data type of the ‘column’ in DataFrame ‘df’ to float.

24. Ranking

The rank() function in Pandas assigns ranks to the values in a Series or DataFrame column. By default, it ranks the values in ascending order. You can specify ascending=False to rank them in descending order.

Rank rows in DataFrame:

df['rank'] = df['column'].rank(ascending=False)

This code calculates the rank of values in the ‘column’ of DataFrame ‘df’, assigning a rank to each value based on its position when sorted in descending order. The ranks are stored in a new column named ‘rank’.

25. Crosstabulation

Pandas’ pd.crosstab() function computes a cross-tabulation from a DataFrame containing two or more factors, or variables. By tallying the frequency of occurrences for every combination of factor values, it offers valuable information about the correlations among variables.

Create a cross-tabulation of two or more factors:

cross_tab = pd.crosstab(df['factor1'], df['factor2'])

By generating a cross-tabulation or contingency table from DataFrame ‘df’, this code illustrates the frequency of occurrences for every possible combination of factor levels for the factors ‘factor1’ and ‘factor2’. The outcome is kept in a brand-new DataFrame called “cross_tab.”

26. Convert DataFrame to Numpy Array

The data in DataFrame ‘df’ is converted into a NumPy array using Pandas’ to_numpy() method. This makes it simple to integrate with other libraries and makes numerical calculations more effective.

Convert DataFrame to a Numpy array:

np_array = df.to_numpy()

This code converts DataFrame ‘df’ into a NumPy array ‘np_array’, maintaining the underlying data structure while allowing for compatibility with NumPy-based operations and libraries.

27. Datetime Indexing

To set a certain column as a DataFrame’s index, use Pandas’ set_index() function. In order to facilitate indexing and alignment operations, it modifies the DataFrame’s index labels to correspond to the values in the selected column.

Set datetime column as index for time series analysis:

df.set_index('datetime_column', inplace=True)

By setting the ‘datetime_column’ as the DataFrame ‘df’ index, this code essentially modifies the index labels to reflect the values in that column. When inplace=True, ‘df’ immediately receives the modification; no new DataFrame is created.

28. Memory Usage

Pandas’ memory_usage() function determines how much memory is used by each column in DataFrame ‘df’. It gives information about the memory footprint of the DataFrame by returning a Series with the memory use of each column expressed in bytes.

Check memory usage of DataFrame:

memory_usage = df.memory_usage(deep=True)

This code computes the memory usage of DataFrame ‘df’, accounting for the memory consumed by the data and index values, particularly useful for analyzing memory usage of large datasets. The deep=True parameter ensures accurate measurement by considering the memory usage of object dtypes and strings.

29. Drop Columns

Pandas’ drop() function eliminates rows or columns from a DataFrame according to provided labels (column names or index values). It makes it possible to eliminate unnecessary rows and columns, allowing for flexible data handling.

Drop columns from DataFrame:

df.drop(columns=['column1', 'column2'], inplace=True)

This code deletes ‘Column1’ and ‘Column2’ from DataFrame ‘df’. When inplace=True, ‘df’ receives the modification straight away; no new DataFrame is created.

30. Export Data to CSV

The to_csv() method in Pandas writes the DataFrame ‘df’ to a CSV (Comma-Separated Values) file. It allows for exporting DataFrame data to a CSV file format, making it accessible for storage, sharing, or further analysis in other applications.

Export DataFrame to a CSV file:

df.to_csv('output.csv', index=False)

This code saves the DataFrame ‘df’ to a CSV file named ‘output.csv’ without including the index values as a separate column. Writing the data to the CSV file makes it accessible for storage or sharing with other applications.

Conclusion

Gaining proficiency with Pandas gives up a vast array of Python data manipulation, analysis, and visualization options. You may expedite your data chores, extract more insightful information from your datasets, and improve your proficiency with Pandas data manipulation by implementing these 30 short tips and techniques into your workflow. Pandas offers the tools you need to effectively manage your data chores, whether you’re reading data from many sources, transforming and cleaning it, or doing sophisticated analysis. You’ll continue to hone your Pandas skills and succeed in your data projects if you keep trying, investigating, and learning.

You can also enroll in our free python course today!

My name is Ayushi Trivedi. I am a B. Tech graduate. I have 3 years of experience working as an educator and content editor. I have worked with various python libraries, like numpy, pandas, seaborn, matplotlib, scikit, imblearn, linear regression and many more. I am also an author. My first book named #turning25 has been published and is available on amazon and flipkart. Here, I am technical content editor at Analytics Vidhya. I feel proud and happy to be AVian. I have a great team to work with. I love building the bridge between the technology and the learner.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details