Reading this article is essential for anyone looking to enhance their proficiency in Python data manipulation and analysis using Pandas. Whether you’re a beginner or an experienced data practitioner, this tutorial offers a wealth of insights and techniques to streamline your workflow and maximize your productivity. By mastering the 30 tips and tricks outlined in this article, you’ll gain valuable skills for reading and exporting data, modifying datasets, performing analysis, and visualizing results. Explaining each tip clearly and concisely, the article makes it easy to understand and implement both basic concepts and advanced strategies. By investing your time in this article, you’ll gain the knowledge and confidence needed to excel in your data projects and propel your career forward.
You can also enroll in our free python course today!
Here are 30 Python tips and tricks using pandas:
Python users can easily manipulate and analyze data by reading it from a CSV file into a DataFrame using the pandas function pd.read_csv(). It handles incomplete or distorted data and recognizes the delimiter automatically.
Read data from a CSV file into a DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
This code snippet allows data analysis and manipulation using pandas functionalities by importing the pandas library and reading data from a CSV file called “data.csv” into a pandas DataFrame called “df.”
The code df.head() prints the first few rows of a DataFrame. The head() function in pandas defaults to returning the first n rows of a DataFrame, which is 5 if no argument is passed.
Display the first few rows of a DataFrame:
print(df.head())
This code returns the first five rows, as no argument has been passed.
Select specific columns from a DataFrame:
selected_columns = df[['column1', 'column2']]
By extracting specified columns ‘column1’ and ‘column2’ from the original DataFrame df, this code creates a new DataFrame named selected_columns, enabling focussed analysis or operations on those specific columns without changing the original DataFrame.
Filter rows based on a condition:
filtered_data = df[df['column'] > 0]
The code filters DataFrame df
to include rows where the value in column ‘column’ is greater than 0. It selects rows from DataFrame df
where the values in the column named ‘column’ are greater than 0 and assigns them to filtered_data.
groupby in Pandas organizes data into groups based on specified columns or conditions. agg applies aggregate functions, like mean or sum, to each group, producing a summary statistic for each.
Group by a column and perform aggregation:
grouped_data = df.groupby('column').agg({'column2': 'mean'})
This code aggregates data in DataFrame ‘df’ by unique values in ‘column’, then computes the mean of ‘column2’ for each group. The result is stored in a new DataFrame named ‘grouped_data’.
The sort_values() function in Pandas arranges the DataFrame ‘df’ based on its values, by default in ascending order, without specifying a particular column. This function helps in reordering rows based on the values across all columns.
Sort DataFrame by one or more columns:
sorted_df = df.sort_values(by='column', ascending=False)
This code creates a new DataFrame called “sorted_df” by sorting the DataFrame “df” based on values in “column” in descending order.
The dropna() function removes rows containing missing values (NaN) from the DataFrame ‘df’, effectively reducing its size. Conversely, fillna(value) fills missing values in ‘df’ with a specified ‘value’, providing a way to handle or impute missing data.
Handle missing values in DataFrame:
df.dropna() # Drop rows with missing values
df.fillna(value) # Fill missing values with a specific value
The apply() function in Pandas applies a function along either the rows (axis=0) or columns (axis=1) of the DataFrame ‘df’. It allows for flexible operations on DataFrame elements using custom or built-in functions.
Apply a function row-wise to DataFrame:
df['new_column'] = df.apply(lambda row: row['column1'] + row['column2'], axis=1)
This code creates a new column ‘new_column’ in the DataFrame ‘df’. By applying a lambda function row-wise, it adds values from “column1” and “column2” for each row to determine the values in this column.
The rename() function in Pandas allows modifying the column or index labels of a DataFrame. It takes a dictionary where keys represent the old column names/index labels and values represent the new names, allowing for straightforward renaming.
Rename columns in DataFrame:
df.rename(columns={'old_column': 'new_column'}, inplace=True)
This code renames the ‘old_column’ column in DataFrame ‘df’ to ‘new_column’. When inplace=True, ‘df’ receives the modification straight away; no new DataFrame is created.
Pandas’ pd.concat() function joins DataFrames together into a single DataFrame by concatenating them along rows (axis=0) or columns (axis=1). It offers an adaptable method for vertically or horizontally combining data from several sources.
Concatenate multiple DataFrames:
concatenated_df = pd.concat([df1, df2], axis=0)
This code concatenates DataFrame ‘df1’ and ‘df2’ along rows (axis=0), creating a new DataFrame named ‘concatenated_df’ stacking ‘df2’ below ‘df1’. It merges the rows of both DataFrames vertically.
Pandas’ pd.concat() function joins DataFrames together into a single DataFrame by concatenating them along rows (axis=0) or columns (axis=1). It offers a method for vertically or horizontally combining data from several sources.
Merge two DataFrames on a common column:
merged_df = pd.merge(df1, df2, on='common_column', how='inner')
This code creates a new DataFrame called “concatenated_df” by concatenating DataFrames “df1” and “df2” along the rows (axis=0) and stacking the rows of “df2” underneath “df1.”
The pivot_table() function in Pandas reshapes a DataFrame to create a pivot table based on supplied columns. You can use one column as the new index, another to create additional columns, and a third column to provide values for the pivot table’s cell values. It enables the aggregation of duplicate entries through the use of different functions.
Create a pivot table from DataFrame:
pivot_table = df.pivot_table(index='index_column', columns='column', values='value_column', aggfunc='mean')
With ‘index_column’ as the rows, ‘column’ as the columns, and ‘value_column’ as the data to aggregate, this code builds a pivot table from DataFrame ‘df’. To combine duplicate entries, the’mean’ function is utilized.
In order to manipulate and analyze dates and times in a DataFrame, Pandas’ pd.to_datetime() function turns input into datetime objects. It can parse a number of different date formats and return a DatetimeIndex or datetime objects.
Convert string to datetime format and extract date/time components:
df['datetime_column'] = pd.to_datetime(df['datetime_column'])
df['year'] = df['datetime_column'].dt.year
This code converts the ‘datetime_column’ in DataFrame ‘df’ to datetime format using pd.to_datetime(). Then, it extracts the year from the datetime values and assigns them to a new column named ‘year’ using the dt.year accessor.
Time-series data in a DataFrame can have its frequency changed using Pandas’ resample() method. When using aggregation functions like sum, mean, etc., it can either upsample (increase frequency), downsample (reduce frequency), or maintain the same frequency.
Resample time series data to a different frequency:
resampled_df = df.resample('D').sum()
This code resamples the original DataFrame ‘df’ on a daily frequency (‘D’) to build a new DataFrame called’resampled_df’. The values inside each day are then added up by adding them together.
Apply a function element-wise to DataFrame:
df['new_column'] = df['column'].apply(lambda x: func(x))
By applying the function ‘func’ to each element in ‘column’ using the apply() function and a lambda expression, this code produces a new column called ‘new_column’ in DataFrame ‘df’.
Categorical variables in Pandas are transformed into dummy or indicator variables using the pd.get_dummies() function. In order to represent each category in the original categorized column, it constructs a new DataFrame and assigns 1 or 0 to each category depending on whether it is present or absent.
Convert categorical variable to numeric using one-hot encoding:
encoded_df = pd.get_dummies(df, columns=['categorical_column'])
Each category in the ‘categorical_column’ of DataFrame ‘df’ is represented by dummy variables in a new DataFrame ‘encoded_df’ created by this code. It transforms numerical data into a format appropriate for machine learning models using categorical variables.
The rolling() function in Pandas creates a rolling window object, enabling the application of functions like mean, sum, etc., over a specified window size along a series or DataFrame axis. It facilitates computation of rolling statistics for time-series or sequential data analysis.
Perform rolling window calculations on DataFrame:
rolling_mean = df['column'].rolling(window=3).mean()
This code computes the rolling mean of the ‘column’ in DataFrame ‘df’ using a window size of 3. It calculates the mean value for each window of consecutive elements, providing a smoothed representation of the data.
Perform string operations on DataFrame columns:
df['new_column'] = df['column'].str.upper()
The DataFrame ‘df’ gains a new column called ‘new_column’, with each value being the uppercase equivalent of the corresponding value in the ‘column’. It converts the strings to uppercase using Pandas’ str.upper() method.
Remove duplicate rows from DataFrame:
unique_df = df.drop_duplicates()
This code creates a new column ‘new_column’ in DataFrame ‘df’, where each value is the uppercase version of the corresponding value in the ‘column’, achieved by using the str.upper() method in Pandas.
Pandas’ sample() function provides a random sample of the data by selecting a predetermined number of rows (one by default) at random from the DataFrame ‘df’. It is helpful for investigating or examining a portion of the dataset.
Randomly sample rows from DataFrame:
sampled_df = df.sample(n=100)
This code creates a new DataFrame ‘sampled_df’ by randomly sampling 100 rows from DataFrame ‘df’. It provides a subset of the original data for analysis or processing.
Pandas’ interpolate() function uses spline, polynomial, and linear interpolation techniques to fill in missing values in DataFrame ‘df’. It helps to smooth out the data by estimating the missing values based on the values of nearby data points.
Interpolate missing values in DataFrame:
df.interpolate(method='linear', inplace=True)
In DataFrame “df,” this code replaces missing values with interpolated values based on nearby data points through linear interpolation. When inplace=True, ‘df’ receives the changes directly, negating the need to create a new DataFrame.
DataFrame rows can be grouped using the groupby() function in Pandas based on unique values in one or more columns. It generates a GroupBy object, which may be applied to the grouped data to perform several operations such as filtering, transformation, and aggregation.
Apply custom aggregation functions in groupby:
custom_agg = df.groupby('column').agg(custom_func)
‘custom_func’ is a custom aggregation function that is applied to each group of DataFrame ‘df’ based on unique values in ‘column’. The ‘custom_agg’ DataFrame is created by aggregating data in accordance with the supplied custom logic.
To convert a DataFrame column’s data type to a desired type in Pandas, use the astype() method. The column’s values are transformed into the designated data type, guaranteeing consistency and suitability for additional operations or analysis.
Convert data types of DataFrame columns:
df['column'] = df['column'].astype('float')
In order to ensure that numerical values are represented in a floating-point format for arithmetic operations, this code uses the astype() method to convert the data type of the ‘column’ in DataFrame ‘df’ to float.
The rank() function in Pandas assigns ranks to the values in a Series or DataFrame column. By default, it ranks the values in ascending order. You can specify ascending=False to rank them in descending order.
Rank rows in DataFrame:
df['rank'] = df['column'].rank(ascending=False)
This code calculates the rank of values in the ‘column’ of DataFrame ‘df’, assigning a rank to each value based on its position when sorted in descending order. The ranks are stored in a new column named ‘rank’.
Pandas’ pd.crosstab() function computes a cross-tabulation from a DataFrame containing two or more factors, or variables. By tallying the frequency of occurrences for every combination of factor values, it offers valuable information about the correlations among variables.
Create a cross-tabulation of two or more factors:
cross_tab = pd.crosstab(df['factor1'], df['factor2'])
By generating a cross-tabulation or contingency table from DataFrame ‘df’, this code illustrates the frequency of occurrences for every possible combination of factor levels for the factors ‘factor1’ and ‘factor2’. The outcome is kept in a brand-new DataFrame called “cross_tab.”
The data in DataFrame ‘df’ is converted into a NumPy array using Pandas’ to_numpy() method. This makes it simple to integrate with other libraries and makes numerical calculations more effective.
Convert DataFrame to a Numpy array:
np_array = df.to_numpy()
This code converts DataFrame ‘df’ into a NumPy array ‘np_array’, maintaining the underlying data structure while allowing for compatibility with NumPy-based operations and libraries.
To set a certain column as a DataFrame’s index, use Pandas’ set_index() function. In order to facilitate indexing and alignment operations, it modifies the DataFrame’s index labels to correspond to the values in the selected column.
Set datetime column as index for time series analysis:
df.set_index('datetime_column', inplace=True)
By setting the ‘datetime_column’ as the DataFrame ‘df’ index, this code essentially modifies the index labels to reflect the values in that column. When inplace=True, ‘df’ immediately receives the modification; no new DataFrame is created.
Pandas’ memory_usage() function determines how much memory is used by each column in DataFrame ‘df’. It gives information about the memory footprint of the DataFrame by returning a Series with the memory use of each column expressed in bytes.
Check memory usage of DataFrame:
memory_usage = df.memory_usage(deep=True)
This code computes the memory usage of DataFrame ‘df’, accounting for the memory consumed by the data and index values, particularly useful for analyzing memory usage of large datasets. The deep=True parameter ensures accurate measurement by considering the memory usage of object dtypes and strings.
Pandas’ drop() function eliminates rows or columns from a DataFrame according to provided labels (column names or index values). It makes it possible to eliminate unnecessary rows and columns, allowing for flexible data handling.
Drop columns from DataFrame:
df.drop(columns=['column1', 'column2'], inplace=True)
This code deletes ‘Column1’ and ‘Column2’ from DataFrame ‘df’. When inplace=True, ‘df’ receives the modification straight away; no new DataFrame is created.
The to_csv()
method in Pandas writes the DataFrame ‘df’ to a CSV (Comma-Separated Values) file. It allows for exporting DataFrame data to a CSV file format, making it accessible for storage, sharing, or further analysis in other applications.
Export DataFrame to a CSV file:
df.to_csv('output.csv', index=False)
This code saves the DataFrame ‘df’ to a CSV file named ‘output.csv’ without including the index values as a separate column. Writing the data to the CSV file makes it accessible for storage or sharing with other applications.
Gaining proficiency with Pandas gives up a vast array of Python data manipulation, analysis, and visualization options. You may expedite your data chores, extract more insightful information from your datasets, and improve your proficiency with Pandas data manipulation by implementing these 30 short tips and techniques into your workflow. Pandas offers the tools you need to effectively manage your data chores, whether you’re reading data from many sources, transforming and cleaning it, or doing sophisticated analysis. You’ll continue to hone your Pandas skills and succeed in your data projects if you keep trying, investigating, and learning.