Python Joins: A Guide to Merge Pandas and Learn Different Join Methods

Abhishek Sharma Last Updated : 05 Nov, 2024

9 min read

“I have two different tables in join Python, but I’m not sure how to join them. What criteria should I consider? What are the different ways I can join these tables using merge pandas dataframes?”

Does this sound familiar? I have encountered this question plenty of times on online discussion forums. Working with one table is fairly straightforward, but things become challenging when we have data spread across two or more tables. We can use different programming languages to work with tables, but today, we will focus on Python.

Python programming already has functions for basic tasks like string joins and concatenating lists, but performing these tasks becomes much easier if we use packages. One such package is the pandas library, which you must have heard of if you are learning Python programming or working on machine learning projects.

This is where the concept of Joins comes in. I cannot emphasize the number of times I have used these Joins in Pandas! They’ve come in especially handy during data science hackathons when I needed to join multiple tables quickly. This article will mainly introduce you to the different types of joins in pandas and how to use them. It will also tell you how to deal with redundancy or duplicate values in the resulting dataframes.

Learning Objectives

Get acquainted with the different types of python joins in Pandas
Learn how to join strings in Python pandas
Learn how to handle redundancy and duplication in Joins

Note: If you’re new to the world of Pandas and Python, I recommend taking the below free courses:

Looking to learn SQL joins? We have you covered! Head over here to learn all about SQL joins.

How to Use Join in Python?
Understanding Python Joins
Inner Join in Pandas
Full Join in Pandas
Left Join in Pandas
Right Join in Pandas
Handling Redundancy/Duplicates in Python Joins
Conclusion
Frequently Asked Questions

How to Use Join in Python?

The join() method is like a friendly string conductor that combines individual strings into a harmonious whole. Imagine you have a group of strings (like a band), and you want to join them using a specific separator (like a musical note). That’s where join() steps in!

What Does `join()` Do?

The join() method takes all the items in an iterable (like a list or tuple) and joins them into a single string.
You get to choose the separator (the glue that holds the strings together).

Syntax:

Pythonresult_string = separator.join(iterable) AI-generated code. Review and use carefully.

Parameters

separator: A string that acts as the glue between the items.
iterable: Any collection of strings (e.g., a list, tuple, or even a string itself).

Understanding Python Joins

I’m sure you’re quite familiar with e-commerce sites like Amazon and Flipkart. We are bombarded by their advertisements when visiting non-related websites – that’s the power of targeted marketing!

We’ll take a simple problem from a related marketing brand here. We are given two tables – one containing product data and the other with customer-level information.

We will use these tables to understand hands-on how the different types of joins work using Merge Pandas.

Inner Join in Pandas

Inner join is the most common type of join you’ll be working with. It returns a dataframe with only those rows that have common characteristics.

An inner join requires each row in the two joined dataframes to have matching column values. This is similar to the intersection of two sets.
Let’s start by importing the Pandas library:

import pandas as pd

For this tutorial, we have two dataframes – product and customer. The product dataframe contains product details like Product_ID, Product_name, Category, Price, and Seller_City. The customer dataframe contains details like id, name, age, Product_ID, Purchased_Product, and City.

Our task is to use our joining skills and generate meaningful information from the data. I encourage you to follow along with the code we’ll cover in this tutorial.

Python Code

import pandas as pd

product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
    'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
    'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})

print(product)

customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})

In the above code snippet, we have passed a dict in the pandas DataFrame constructor. The dictionary has key-value pairs, where keys are Python strings, and the values are iterable objects; these could be lists, tuples, json, etc.

We want to know about all the products sold online and who purchased them. Using an inner join, we can easily do this.

The merge() function

Pandas is our friend here. By default, the merge pandas function performs an inner join. The merge function only takes series or dataframe objects as input for merging; if we pass a different data type, it will result in a typeerror. The two dataframe / series objects are passed in the merge function with a comma as the separator. For example:

pd.merge(product,customer,on='Product_ID')

The return value type of the merge function is dataframe.

There is another function in pandas called the concat function. The name suggests we can concatenate strings using this function, but it just concatenates two dataframes.

Here, I have performed inner join on the product and customer dataframes on the ‘Product_ID’ column.

But what if the column name strings are different in the two dataframes? Then, we have to mention both column names str explicitly.

‘left_on’ and ‘right_on’ are two parameters through which we can achieve this. ‘left_on’ is the name of the key in the left dataframe and ‘right_on’ in the right dataframe, so the

syntax

pd.merge(product,customer,left_on='Product_name',right_on='Purchased_Product')

Also, we cannot pass empty strings in left_on or right_on arguments

Let’s try the above code in the live coding window below!!

import pandas as pd
product=pd.DataFrame({
    'Product_ID':[101,102,103,104,105,106,107],
    'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop'],
    'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics'],
    'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0],
    'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore']
})
print("Product Dataframe: \n\n")
print(product.head())

customer=pd.DataFrame({
    'id':[1,2,3,4,5,6,7,8,9],
    'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
    'age':[20,25,15,10,30,65,35,18,23],
    'Product_ID':[101,0,106,0,103,104,0,0,107],
    'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
    'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
print("\n\n Customer Dataframe: \n\n")
print(customer.head())
print("\n\n Merged Dataframe: \n\n")
print(pd.merge(product,customer,on='Product_ID'))

Let’s take things up a notch. The leadership team now wants more details about the products sold. They want to know about all the products the seller sells to the same city, i.e., both the seller and the customer belong to the same city.

In this case, we have to perform an inner join on both Product_ID and Seller_City of the product and Product_ID and City columns of the customer dataframe.

So, how can we do this?

Don’t scratch your head! We only passed single strings in the left_on or right_on arguments in the above example. We can pass an iterable object of column names to the left_on and right_on arguments, which can be a list or a tuple, as they are iterable in left_on or right_on:

pd.merge(product,customer,how='inner',left_on=['Product_ID','Seller_City'],right_on=['Product_ID','City'])

Here the list items are column names, which are the names of the columns on which we want to join two dataframes.

Full Join in Pandas

Here’s another interesting task for you. We have to combine both dataframes to find all the products that are not sold and all the customers who didn’t purchase anything from us.

We can use Full Join for this purpose.

Full Join, also known as Full Outer Join, returns all those records which either have a match in the left or right dataframe.

When rows in both the dataframes do not match, the resulting dataframe will have NaN for every dataframe column that lacks a matching row.

We can perform Full join by passing the how parameter as ‘outer’ to the merge() function, which takes input python strings as arguments:

pd.merge(product,customer,on='Product_ID',how='outer')

We cannot pass non-string values in the how parameter.

Did you notice what happened here? All the non-matching rows of both the dataframes have NaN values for the columns of other dataframes. But wait – we still don’t know which row belongs to which dataframe.

For this, Pandas provides us with a fantastic solution. By mentioning the indicator argument as True in the function, a new column of name _merge will be created in the resulting dataframe:

pd.merge(product,customer,on='Product_ID',how='outer',indicator=True)

As you can see, the _merge column mentions which row belongs to which dataframe.

Left Join in Pandas

Let’s say the leadership team wants information about only those customers who bought something from us. You guessed it—we can use the Left Join concept here.

Left join, also known as Left Outer Join, returns a dataframe containing all the rows of the left dataframe.

All the non-matching rows of the left dataframe contain NaN for the columns in the right dataframe. It is simply an inner join plus all the non-matching rows of the left dataframe filled with NaN for columns of the right dataframe.

Performing a left join is actually quite similar to a full join. Just change the how argument to a different str that is ‘left’:

pd.merge(product,customer,on='Product_ID',how='left')

Here, you can clearly see that all the unsold products contain NaN for the columns belonging to the customer dataframe.

Right Join in Pandas

Similarly, we can use the right join to create a table of customers, including information about the products they bought.

Right join, also known as Right Outer Join, is similar to the Left Outer Join. The only difference is that all the rows of the right dataframe are taken as it is and only those of the left dataframe that are common in both.

Similar to other joins, we can perform a right join by changing the how argument to ‘right’:

pd.merge(product,customer,on='Product_ID',how='right')

Take a look carefully at the above dataframe – we have NaN values for columns of the product dataframe. Pretty straightforward, right?

Handling Redundancy/Duplicates in Python Joins

Duplicate values can be tricky obstacles. They can cause problems while performing joins. These values won’t give an error but will create redundancy in our resulting dataframe. I’m sure you can imagine how harmful that can be!

Here, we have a new dataframe product_dup with duplicate details about products:

product_dup=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,103,107],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','Shoes','Laptop',
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Fashion','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,2999.0,79999.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','Chennai','Bengalore']
})

Let’s see what happens if we perform an inner join on this dataframe:

pd.merge(product_dup,customer,how='inner',on='Product_ID')

As you can see, we have duplicate rows in the resulting dataset as well. To solve this, there is a validate argument in the merge() function, which we can set to ‘one_to_one,’ ‘one_to_many,’ ‘many_to_one,’ and‘many_to_many.’

This ensures that there exists only a particular mapping across both the dataframes. If the mapping condition is not satisfied, then it throws a MergeError. To solve this, we can delete duplicates before applying join:

pd.merge(product_dup.drop_duplicates(),customer,how='inner',on='Product_ID')

But, if you want to keep these duplicates, then you can give validate values as per your requirements, and it will not throw an error:

pd.merge(product_dup,customer,how='inner',on='Product_ID',validate='many_to_many')

Conclusion

By now, you must have strong expertise in the one-of-a-kind types of joins in Pandas and how to utilize the merge() feature to mix dataframes successfully. Mastering those joins will beautify your potential to control and analyze records, making your data science and gadget learning tasks more efficient and insightful.

Remember, at the same time as joins are powerful, they could introduce complexity, particularly while managing large datasets or redundant facts. Pandas also have other functions, such as concat() and append(), that can merge dataframes. I encourage you to explore and apply them in your next project alongside what you’ve learned about joins in this tutorial.

Key Takeaways

Joins in pandas refer to the many different ways functions in Python are used to join two dataframes.
The four main types of joins in pandas are Left join, Right join, Inner join, and Cross join.
The validate argument in the merge() function can detect and remove duplicate values in the newly formed dataset.

Courses:

Python Tutorials:

12 Useful Pandas Techniques in Python for Data Manipulation

Frequently Asked Questions

Q1. How to join 2 DataFrames in pandas?

A. We can perform the join operation on two dataframes in Pandas using multiple functions, such as the Pandas join function. We can also use the pandas merge function to do a merge operation on the dataframes. So, there are multiple ways to join two dataframes in pandas.

Q2. What are the different types of joins?

A. Mainly, there are 4 types of joins for a dataframe. They are Left join, Right join, Inner join, and Cross join.
The most frequently used joins for dataframes and tables are left join and inner join. In the inner join, all the records with matching keys are present, whereas in the left join, all the records from the left table are present.

Q3 What is the difference between inner and outer join in pandas?

A. The inner join only finds and returns all matching rows; it returns no unmatched rows. However, the outer join returns all matching rows and some unmatched ones as well. There are two types of outer join: the left outer join and the right outer join.

Abhishek Sharma

He is a data science aficionado, who loves diving into data and generating insights from it. He is always ready for making machines to learn through code and writing technical blogs. His areas of interest include Machine Learning and Natural Language Processing still open for something new and exciting.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Python Joins: A Guide to Merge Pandas and Learn Different Join Methods

Table of contents

How to Use Join in Python?

What Does join() Do?

Syntax:

Parameters

Understanding Python Joins

Inner Join in Pandas

Python Code

The merge() function

syntax

Full Join in Pandas

Left Join in Pandas

Right Join in Pandas

Handling Redundancy/Duplicates in Python Joins

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

What Does `join()` Do?