30 Quick Pandas Tips and Tricks for Beginners

Ayushi Trivedi Last Updated : 06 May, 2024

9 min read

Introduction

Reading this article is essential for anyone looking to enhance their proficiency in Python data manipulation and analysis using Pandas. Whether you’re a beginner or an experienced data practitioner, this tutorial offers a wealth of insights and techniques to streamline your workflow and maximize your productivity. By mastering the 30 tips and tricks outlined in this article, you’ll gain valuable skills for reading and exporting data, modifying datasets, performing analysis, and visualizing results. Explaining each tip clearly and concisely, the article makes it easy to understand and implement both basic concepts and advanced strategies. By investing your time in this article, you’ll gain the knowledge and confidence needed to excel in your data projects and propel your career forward.

You can also enroll in our free python course today!

30 Quick Pandas Tips and Tricks for Beginners

Here are 30 Python tips and tricks using pandas:

1. Read Data from CSV

Python users can easily manipulate and analyze data by reading it from a CSV file into a DataFrame using the pandas function pd.read_csv(). It handles incomplete or distorted data and recognizes the delimiter automatically.

Read data from a CSV file into a DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')

This code snippet allows data analysis and manipulation using pandas functionalities by importing the pandas library and reading data from a CSV file called “data.csv” into a pandas DataFrame called “df.”

2. Display DataFrame

The code df.head() prints the first few rows of a DataFrame. The head() function in pandas defaults to returning the first n rows of a DataFrame, which is 5 if no argument is passed.

Display the first few rows of a DataFrame:

print(df.head())

This code returns the first five rows, as no argument has been passed.

3. Select Columns

Select specific columns from a DataFrame:

selected_columns = df[['column1', 'column2']]

By extracting specified columns ‘column1’ and ‘column2’ from the original DataFrame df, this code creates a new DataFrame named selected_columns, enabling focussed analysis or operations on those specific columns without changing the original DataFrame.

4. Filter Rows

Filter rows based on a condition:

filtered_data = df[df['column'] > 0]

The code filters DataFrame df to include rows where the value in column ‘column’ is greater than 0. It selects rows from DataFrame df where the values in the column named ‘column’ are greater than 0 and assigns them to filtered_data.

5. Group By and Aggregate

groupby in Pandas organizes data into groups based on specified columns or conditions. agg applies aggregate functions, like mean or sum, to each group, producing a summary statistic for each.

Group by a column and perform aggregation:

grouped_data = df.groupby('column').agg({'column2': 'mean'})

This code aggregates data in DataFrame ‘df’ by unique values in ‘column’, then computes the mean of ‘column2’ for each group. The result is stored in a new DataFrame named ‘grouped_data’.

6. Sort DataFrame

The sort_values() function in Pandas arranges the DataFrame ‘df’ based on its values, by default in ascending order, without specifying a particular column. This function helps in reordering rows based on the values across all columns.

Sort DataFrame by one or more columns:

sorted_df = df.sort_values(by='column', ascending=False)

This code creates a new DataFrame called “sorted_df” by sorting the DataFrame “df” based on values in “column” in descending order.

7. Handle Missing Values

The dropna() function removes rows containing missing values (NaN) from the DataFrame ‘df’, effectively reducing its size. Conversely, fillna(value) fills missing values in ‘df’ with a specified ‘value’, providing a way to handle or impute missing data.

Handle missing values in DataFrame:

df.dropna()  # Drop rows with missing values
df.fillna(value)  # Fill missing values with a specific value

8. Apply Function Row-wise

The apply() function in Pandas applies a function along either the rows (axis=0) or columns (axis=1) of the DataFrame ‘df’. It allows for flexible operations on DataFrame elements using custom or built-in functions.

Apply a function row-wise to DataFrame:

df['new_column'] = df.apply(lambda row: row['column1'] + row['column2'], axis=1)

This code creates a new column ‘new_column’ in the DataFrame ‘df’. By applying a lambda function row-wise, it adds values from “column1” and “column2” for each row to determine the values in this column.

9. Rename Columns

The rename() function in Pandas allows modifying the column or index labels of a DataFrame. It takes a dictionary where keys represent the old column names/index labels and values represent the new names, allowing for straightforward renaming.

Rename columns in DataFrame:

df.rename(columns={'old_column': 'new_column'}, inplace=True)

This code renames the ‘old_column’ column in DataFrame ‘df’ to ‘new_column’. When inplace=True, ‘df’ receives the modification straight away; no new DataFrame is created.

10. Concatenate DataFrames

Pandas’ pd.concat() function joins DataFrames together into a single DataFrame by concatenating them along rows (axis=0) or columns (axis=1). It offers an adaptable method for vertically or horizontally combining data from several sources.

Concatenate multiple DataFrames:

concatenated_df = pd.concat([df1, df2], axis=0)

This code concatenates DataFrame ‘df1’ and ‘df2’ along rows (axis=0), creating a new DataFrame named ‘concatenated_df’ stacking ‘df2’ below ‘df1’. It merges the rows of both DataFrames vertically.

11. Merge DataFrames

Pandas’ pd.concat() function joins DataFrames together into a single DataFrame by concatenating them along rows (axis=0) or columns (axis=1). It offers a method for vertically or horizontally combining data from several sources.

Merge two DataFrames on a common column:

merged_df = pd.merge(df1, df2, on='common_column', how='inner')

This code creates a new DataFrame called “concatenated_df” by concatenating DataFrames “df1” and “df2” along the rows (axis=0) and stacking the rows of “df2” underneath “df1.”

12. Pivot Table

The pivot_table() function in Pandas reshapes a DataFrame to create a pivot table based on supplied columns. You can use one column as the new index, another to create additional columns, and a third column to provide values for the pivot table’s cell values. It enables the aggregation of duplicate entries through the use of different functions.

Create a pivot table from DataFrame:

pivot_table = df.pivot_table(index='index_column', columns='column', values='value_column', aggfunc='mean')

With ‘index_column’ as the rows, ‘column’ as the columns, and ‘value_column’ as the data to aggregate, this code builds a pivot table from DataFrame ‘df’. To combine duplicate entries, the’mean’ function is utilized.

13. Date and Time Operations

In order to manipulate and analyze dates and times in a DataFrame, Pandas’ pd.to_datetime() function turns input into datetime objects. It can parse a number of different date formats and return a DatetimeIndex or datetime objects.

Convert string to datetime format and extract date/time components:

df['datetime_column'] = pd.to_datetime(df['datetime_column'])
df['year'] = df['datetime_column'].dt.year

This code converts the ‘datetime_column’ in DataFrame ‘df’ to datetime format using pd.to_datetime(). Then, it extracts the year from the datetime values and assigns them to a new column named ‘year’ using the dt.year accessor.

14. Resample Time Series Data

Time-series data in a DataFrame can have its frequency changed using Pandas’ resample() method. When using aggregation functions like sum, mean, etc., it can either upsample (increase frequency), downsample (reduce frequency), or maintain the same frequency.

Resample time series data to a different frequency:

resampled_df = df.resample('D').sum()

This code resamples the original DataFrame ‘df’ on a daily frequency (‘D’) to build a new DataFrame called’resampled_df’. The values inside each day are then added up by adding them together.

15. Apply Function Element-wise

Apply a function element-wise to DataFrame:

df['new_column'] = df['column'].apply(lambda x: func(x))

By applying the function ‘func’ to each element in ‘column’ using the apply() function and a lambda expression, this code produces a new column called ‘new_column’ in DataFrame ‘df’.

16. Convert Categorical to Numeric

Categorical variables in Pandas are transformed into dummy or indicator variables using the pd.get_dummies() function. In order to represent each category in the original categorized column, it constructs a new DataFrame and assigns 1 or 0 to each category depending on whether it is present or absent.

Convert categorical variable to numeric using one-hot encoding:

encoded_df = pd.get_dummies(df, columns=['categorical_column'])

Each category in the ‘categorical_column’ of DataFrame ‘df’ is represented by dummy variables in a new DataFrame ‘encoded_df’ created by this code. It transforms numerical data into a format appropriate for machine learning models using categorical variables.

17. Rolling Window Operations

The rolling() function in Pandas creates a rolling window object, enabling the application of functions like mean, sum, etc., over a specified window size along a series or DataFrame axis. It facilitates computation of rolling statistics for time-series or sequential data analysis.

Perform rolling window calculations on DataFrame:

rolling_mean = df['column'].rolling(window=3).mean()

This code computes the rolling mean of the ‘column’ in DataFrame ‘df’ using a window size of 3. It calculates the mean value for each window of consecutive elements, providing a smoothed representation of the data.

18. String Operations

Perform string operations on DataFrame columns:

df['new_column'] = df['column'].str.upper()

The DataFrame ‘df’ gains a new column called ‘new_column’, with each value being the uppercase equivalent of the corresponding value in the ‘column’. It converts the strings to uppercase using Pandas’ str.upper() method.

19. Drop Duplicates

Remove duplicate rows from DataFrame:

unique_df = df.drop_duplicates()

This code creates a new column ‘new_column’ in DataFrame ‘df’, where each value is the uppercase version of the corresponding value in the ‘column’, achieved by using the str.upper() method in Pandas.

20. Sampling

Pandas’ sample() function provides a random sample of the data by selecting a predetermined number of rows (one by default) at random from the DataFrame ‘df’. It is helpful for investigating or examining a portion of the dataset.

Randomly sample rows from DataFrame:

sampled_df = df.sample(n=100)

This code creates a new DataFrame ‘sampled_df’ by randomly sampling 100 rows from DataFrame ‘df’. It provides a subset of the original data for analysis or processing.

21. Interpolate Missing Values

Pandas’ interpolate() function uses spline, polynomial, and linear interpolation techniques to fill in missing values in DataFrame ‘df’. It helps to smooth out the data by estimating the missing values based on the values of nearby data points.

Interpolate missing values in DataFrame:

df.interpolate(method='linear', inplace=True)

In DataFrame “df,” this code replaces missing values with interpolated values based on nearby data points through linear interpolation. When inplace=True, ‘df’ receives the changes directly, negating the need to create a new DataFrame.

22. Apply Custom Aggregation

DataFrame rows can be grouped using the groupby() function in Pandas based on unique values in one or more columns. It generates a GroupBy object, which may be applied to the grouped data to perform several operations such as filtering, transformation, and aggregation.

Apply custom aggregation functions in groupby:

custom_agg = df.groupby('column').agg(custom_func)

‘custom_func’ is a custom aggregation function that is applied to each group of DataFrame ‘df’ based on unique values in ‘column’. The ‘custom_agg’ DataFrame is created by aggregating data in accordance with the supplied custom logic.

23. Convert Data Types

To convert a DataFrame column’s data type to a desired type in Pandas, use the astype() method. The column’s values are transformed into the designated data type, guaranteeing consistency and suitability for additional operations or analysis.

Convert data types of DataFrame columns:

df['column'] = df['column'].astype('float')

In order to ensure that numerical values are represented in a floating-point format for arithmetic operations, this code uses the astype() method to convert the data type of the ‘column’ in DataFrame ‘df’ to float.

24. Ranking

The rank() function in Pandas assigns ranks to the values in a Series or DataFrame column. By default, it ranks the values in ascending order. You can specify ascending=False to rank them in descending order.

Rank rows in DataFrame:

df['rank'] = df['column'].rank(ascending=False)

This code calculates the rank of values in the ‘column’ of DataFrame ‘df’, assigning a rank to each value based on its position when sorted in descending order. The ranks are stored in a new column named ‘rank’.

25. Crosstabulation

Pandas’ pd.crosstab() function computes a cross-tabulation from a DataFrame containing two or more factors, or variables. By tallying the frequency of occurrences for every combination of factor values, it offers valuable information about the correlations among variables.

Create a cross-tabulation of two or more factors:

cross_tab = pd.crosstab(df['factor1'], df['factor2'])

By generating a cross-tabulation or contingency table from DataFrame ‘df’, this code illustrates the frequency of occurrences for every possible combination of factor levels for the factors ‘factor1’ and ‘factor2’. The outcome is kept in a brand-new DataFrame called “cross_tab.”

26. Convert DataFrame to Numpy Array

The data in DataFrame ‘df’ is converted into a NumPy array using Pandas’ to_numpy() method. This makes it simple to integrate with other libraries and makes numerical calculations more effective.

Convert DataFrame to a Numpy array:

np_array = df.to_numpy()

This code converts DataFrame ‘df’ into a NumPy array ‘np_array’, maintaining the underlying data structure while allowing for compatibility with NumPy-based operations and libraries.

27. Datetime Indexing

To set a certain column as a DataFrame’s index, use Pandas’ set_index() function. In order to facilitate indexing and alignment operations, it modifies the DataFrame’s index labels to correspond to the values in the selected column.

Set datetime column as index for time series analysis:

df.set_index('datetime_column', inplace=True)

By setting the ‘datetime_column’ as the DataFrame ‘df’ index, this code essentially modifies the index labels to reflect the values in that column. When inplace=True, ‘df’ immediately receives the modification; no new DataFrame is created.

28. Memory Usage

Pandas’ memory_usage() function determines how much memory is used by each column in DataFrame ‘df’. It gives information about the memory footprint of the DataFrame by returning a Series with the memory use of each column expressed in bytes.

Check memory usage of DataFrame:

memory_usage = df.memory_usage(deep=True)

This code computes the memory usage of DataFrame ‘df’, accounting for the memory consumed by the data and index values, particularly useful for analyzing memory usage of large datasets. The deep=True parameter ensures accurate measurement by considering the memory usage of object dtypes and strings.

29. Drop Columns

Pandas’ drop() function eliminates rows or columns from a DataFrame according to provided labels (column names or index values). It makes it possible to eliminate unnecessary rows and columns, allowing for flexible data handling.

Drop columns from DataFrame:

df.drop(columns=['column1', 'column2'], inplace=True)

This code deletes ‘Column1’ and ‘Column2’ from DataFrame ‘df’. When inplace=True, ‘df’ receives the modification straight away; no new DataFrame is created.

30. Export Data to CSV

The to_csv() method in Pandas writes the DataFrame ‘df’ to a CSV (Comma-Separated Values) file. It allows for exporting DataFrame data to a CSV file format, making it accessible for storage, sharing, or further analysis in other applications.

Export DataFrame to a CSV file:

df.to_csv('output.csv', index=False)

This code saves the DataFrame ‘df’ to a CSV file named ‘output.csv’ without including the index values as a separate column. Writing the data to the CSV file makes it accessible for storage or sharing with other applications.

Conclusion

Gaining proficiency with Pandas gives up a vast array of Python data manipulation, analysis, and visualization options. You may expedite your data chores, extract more insightful information from your datasets, and improve your proficiency with Pandas data manipulation by implementing these 30 short tips and techniques into your workflow. Pandas offers the tools you need to effectively manage your data chores, whether you’re reading data from many sources, transforming and cleaning it, or doing sophisticated analysis. You’ll continue to hone your Pandas skills and succeed in your data projects if you keep trying, investigating, and learning.

You can also enroll in our free python course today!

Ayushi Trivedi

My name is Ayushi Trivedi. I am a B. Tech graduate. I have 3 years of experience working as an educator and content editor. I have worked with various python libraries, like numpy, pandas, seaborn, matplotlib, scikit, imblearn, linear regression and many more. I am also an author. My first book named #turning25 has been published and is available on amazon and flipkart. Here, I am technical content editor at Analytics Vidhya. I feel proud and happy to be AVian. I have a great team to work with. I love building the bridge between the technology and the learner.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

30 Quick Pandas Tips and Tricks for Beginners

Introduction

30 Quick Pandas Tips and Tricks for Beginners

1. Read Data from CSV

2. Display DataFrame

3. Select Columns

4. Filter Rows

5. Group By and Aggregate

6. Sort DataFrame

7. Handle Missing Values

8. Apply Function Row-wise

9. Rename Columns

10. Concatenate DataFrames

11. Merge DataFrames

12. Pivot Table

13. Date and Time Operations

14. Resample Time Series Data

15. Apply Function Element-wise

16. Convert Categorical to Numeric

17. Rolling Window Operations

18. String Operations

19. Drop Duplicates

20. Sampling

21. Interpolate Missing Values

22. Apply Custom Aggregation

23. Convert Data Types

24. Ranking

25. Crosstabulation

26. Convert DataFrame to Numpy Array

27. Datetime Indexing

28. Memory Usage

29. Drop Columns

30. Export Data to CSV

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect