Nowadays, Machine Learning is helping the Retail Industry in many different ways. You can imagine that from forecasting sales performance to identifying the buyers, there are many applications of AI and ML in the retail industry. Market basket analysis is a data mining technique retailers use to increase sales by better understanding customer purchasing patterns. Analyzing large data sets, such as purchase history, reveals product groupings and products likely to be purchased together. In this article, we will comprehensively cover the topic of Market Basket Analysis Python and its various components and then dive deep into implementing it in machine learning, including how to perform it in Python on a real-world dataset.
Overview:
This article was published as a part of the Data Science Blogathon.
Market basket analysis is a strategic data mining technique retailers use to enhance sales by understanding customer purchasing patterns. This method involves examining substantial datasets, such as historical purchase records, to unveil inherent product groupings and identify items that customers tend to buy together.
By recognizing these co-occurrence patterns, retailers can make informed decisions to optimize inventory management, devise effective marketing strategies, employ cross-selling tactics, and refine store layouts for improved customer engagement.
For example, if customers are buying milk, how likely are they to also buy bread (and which kind of bread) on the same trip to the supermarket? This information may lead to an increase in sales by helping retailers do selective marketing based on predictions, cross-selling, and planning their shelf space for optimal product placement.
Now, just think of the universe as the set of items available at the store. Each item has a Boolean variable that represents its presence or absence. We can represent each basket with a Boolean vector of values assigned to these variables. We can then analyze the Boolean vectors to identify purchase patterns that reflect items frequently associated or bought together, representing such patterns as association rules.
Industry | Applications of Market Basket Analysis |
---|---|
Retail | Identify frequently purchased product combinations and create promotions or cross-selling strategies |
E-commerce | Suggest complementary products to customers and improve the customer experience |
Hospitality | Identify which menu items are often ordered together and create meal packages or menu recommendations |
Healthcare | Understand which medications are often prescribed together and identify patterns in patient behavior or treatment outcomes |
Banking/Finance | Identify which products or services are frequently used together by customers and create targeted marketing campaigns or bundle deals |
Telecommunications | Understand which products or services are often purchased together and create bundled service packages that increase revenue and improve the customer experience |
Let I = {I1, I2,…, Im} be an itemset. These itemsets are called antecedents. Let D, the data, be a set of database transactions where each transaction T is a nonempty itemset such that T ⊆ I. Each transaction is associated with a TID(or Tid) identifier. Let A be a set of items (itemset). T is the Transaction that contains A if A ⊆ T. An Association Rule is an implication of form A ⇒ B, where A ⊂ I, B ⊂ I, and A ∩B = φ.
The rule A ⇒ B holds in the data set(transactions) D with supports, where ‘s’ is the percentage of transactions in D that contain A ∪ B (i.e., the union of set A and set B, or both A and B). This is taken as the probability, P(A ∪ B). Rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contains B. This is the conditional probability, like P(B|A). That is,
Rules that meet a minimum support threshold (called min sup) and a minimum confidence threshold (called min conf) are termed ‘Strong’.
Generally, Association Rule Mining can be viewed in a two-step process:
Association Rule Mining is primarily used to identify an association between different items in a set and then find frequent patterns in a transactional or relational database.
The best example of the association is seen in the following image.
Multiple data mining techniques and algorithms are used in Market Basket Analysis Python. One important objective in predicting the probability of items that customers buy together is to achieve accuracy.
The Apriori Algorithm is widely used and well-known for Association Rule mining, making it a popular choice in Market Basket Analysis Python. AI and SETM algorithms consider it more accurate. It helps to find frequent itemsets in transactions and identifies association rules between them. The limitation of the Apriori Algorithm is frequent itemset generation. It needs to scan the database many times, leading to increased time and reduced performance as a computationally costly step because of a large dataset. It uses the concepts of Confidence and Support.
The AIS algorithm creates multiple passes on the entire database or transactional data. During every pass, it scans all transactions. As you can see, in the first pass, it counts the support of separate items and determines which are frequent in the database. After each transaction scan, the algorithm enlarges huge itemsets from each pass to generate candidate itemsets. It determines the common itemsets between the itemsets of the previous pass and the items of the current transaction. This algorithm, developed to generate all large itemsets in a transactional database, was the first published algorithm of its kind.
It focused on enhancing databases with the necessary performance to process decision support. This technique is bound to only one item in the consequent.
This Algorithm is quite similar to the AIS algorithm. The SETM algorithm creates collective passes over the database. As you can see, in the first pass, it counts the support of single items and determines which are frequent in the database. Then, it also generates the candidate itemsets by enlarging large itemsets of the previous pass. In addition, the SETM algorithm recalls the TIDs (transaction ids) of the transactions generated with the candidate itemsets.
It is known as the Frequent Pattern Growth Algorithm (FPGA). The FP growth algorithm represents data in the form of an FP tree or Frequent Pattern. Hence, FP Growth is a method of Mining Frequent Itemsets. This algorithm is an advancement to the Apriori Algorithm. There is no need for candidate generation to generate a frequent pattern. This frequent pattern tree structure maintains the association between the itemsets.
A Frequent Pattern Tree is a tree structure that is made with the earlier itemsets of the data. The main purpose of the FP tree is to mine the most frequent patterns. Every node of the FP tree represents an item of that itemset. The root node represents the null value, whereas the lower nodes represent the itemsets of the data. While creating the tree, it maintains the association of these nodes with the lower nodes, namely, between item sets.
For Example:
Implementing Market Basket Analysis (MBA) in marketing has many advantages. MBA applies to customer data from point of sale (PoS) systems.
It helps retailers in the following ways:
Take an example of market basket analysis from Amazon, the world’s largest eCommerce platform. From a customer’s perspective, Market Basket Analysis in Data Mining is like shopping at a supermarket. Generally, it observes all items bought by customers together in a single purchase. Then, it shows the most related products customers will tend to buy in one purchase.
Now, let us implement market basket analysis in Python.
Here are the steps involved in using the apriori algorithm to implement an MBA:
In this implementation, we have to use the Store Data dataset, which is publicly available on Kaggle. This dataset contains 7501 transaction records, each containing a list of items sold in one transaction.
Data scientists frequently use the Apriori algorithm. We need to import the necessary libraries. Python requires us to import the apyori as an API to execute the Apriori Algorithm.
import pandas as pd
import numpy as np
from apyori import apriori
Now, we want to read the dataset we downloaded from Kaggle. The dataset does not have a header; hence, the first row contains the first transaction, so we have mentioned header = None here.
import pandas as pd
import numpy as np
from apyori import apriori
st_df=pd.read_csv("store_data.csv",header=None)
print(st_df)
Once we have completely read the dataset, we must obtain the list of items in every transaction. So we are going to run two loops. One will be for the total number of transactions, and the other will be for the total number of columns in every transaction. The list will work as a training set from where we can generate the list of Association Rules.
#converting dataframe into list of lists
l=[]
for i in range(1,7501):
l.append([str(st_df.values[i,j]) for j in range(0,20)])
So we are ready with the list of items in our training set, then we need to run the apriori algorithm, which will learn the list of association rules from the training set, i.e., list. So, we take the minimum support here as 0.0045. Let us see that we have kept 0.2 as the minimum confidence. We consider the minimum lift value to be three and the minimum length to be two because we must find an association among at least two items.
#applying apriori algorithm
association_rules = apriori(l, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
association_results = list(association_rules)
After running the above line of code, we generated the list of association rules between the items. Now, we want to read the dataset we downloaded from Kaggle.
for i in range(0, len(association_results)):
print(association_results[i][0])
frozenset({'light cream', 'chicken'})
frozenset({'mushroom cream sauce', 'escalope'})
frozenset({'pasta', 'escalope'})
frozenset({'herb & pepper', 'ground beef'})
frozenset({'tomato sauce', 'ground beef'})
frozenset({'whole wheat pasta', 'olive oil'})
frozenset({'shrimp', 'pasta'})
frozenset({'nan', 'light cream', 'chicken'})
frozenset({'shrimp', 'frozen vegetables', 'chocolate'})
frozenset({'spaghetti', 'cooking oil', 'ground beef'})
frozenset({'mushroom cream sauce', 'nan', 'escalope'})
frozenset({'nan', 'pasta', 'escalope'})
frozenset({'spaghetti', 'frozen vegetables', 'ground beef'})
frozenset({'olive oil', 'frozen vegetables', 'milk'})
frozenset({'shrimp', 'frozen vegetables', 'mineral water'})
frozenset({'spaghetti', 'olive oil', 'frozen vegetables'})
frozenset({'spaghetti', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'frozen vegetables', 'tomatoes'})
frozenset({'spaghetti', 'grated cheese', 'ground beef'})
frozenset({'herb & pepper', 'mineral water', 'ground beef'})
frozenset({'nan', 'herb & pepper', 'ground beef'})
frozenset({'spaghetti', 'herb & pepper', 'ground beef'})
frozenset({'olive oil', 'milk', 'ground beef'})
frozenset({'nan', 'tomato sauce', 'ground beef'})
frozenset({'spaghetti', 'shrimp', 'ground beef'})
frozenset({'spaghetti', 'olive oil', 'milk'})
frozenset({'soup', 'olive oil', 'mineral water'})
frozenset({'whole wheat pasta', 'nan', 'olive oil'})
frozenset({'nan', 'shrimp', 'pasta'})
frozenset({'spaghetti', 'olive oil', 'pancakes'})
frozenset({'nan', 'shrimp', 'frozen vegetables', 'chocolate'})
frozenset({'spaghetti', 'nan', 'cooking oil', 'ground beef'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'ground beef'})
frozenset({'spaghetti', 'frozen vegetables', 'milk', 'mineral water'})
frozenset({'nan', 'frozen vegetables', 'milk', 'olive oil'})
frozenset({'nan', 'shrimp', 'frozen vegetables', 'mineral water'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'olive oil'})
frozenset({'spaghetti', 'nan', 'shrimp', 'frozen vegetables'})
frozenset({'spaghetti', 'nan', 'frozen vegetables', 'tomatoes'})
frozenset({'spaghetti', 'nan', 'grated cheese', 'ground beef'})
frozenset({'nan', 'herb & pepper', 'mineral water', 'ground beef'})
frozenset({'spaghetti', 'nan', 'herb & pepper', 'ground beef'})
frozenset({'nan', 'milk', 'olive oil', 'ground beef'})
frozenset({'spaghetti', 'nan', 'shrimp', 'ground beef'})
frozenset({'spaghetti', 'nan', 'milk', 'olive oil'})
frozenset({'soup', 'nan', 'olive oil', 'mineral water'})
frozenset({'spaghetti', 'nan', 'olive oil', 'pancakes'})
frozenset({'spaghetti', 'milk', 'mineral water', 'nan', 'frozen vegetables'})
Here we are going to display the Rule, Support, and lift ratio for every above association rule by using for loop.
for item in association_results:
# first index of the inner list
# Contains base item and add item
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])
# second index of the inner list
print("Support: " + str(item[1]))
# third index of the list located at 0th position
# of the third index of the inner list
print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("-----------------------------------------------------")
Rule: light cream -> chicken
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
-----------------------------------------------------
Rule: mushroom cream sauce -> escalope
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
-----------------------------------------------------
Rule: pasta -> escalope
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
-----------------------------------------------------
Rule: herb & pepper -> ground beef
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
-----------------------------------------------------
Rule: tomato sauce -> ground beef
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
-----------------------------------------------------
Rule: whole wheat pasta -> olive oil
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
-----------------------------------------------------
Rule: shrimp -> pasta
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
-----------------------------------------------------
Rule: nan -> light cream
Support: 0.004533333333333334
Confidence: 0.2905982905982906
Lift: 4.843304843304844
-----------------------------------------------------
Rule: shrimp -> frozen vegetables
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.260160834601174
-----------------------------------------------------
Rule: spaghetti -> cooking oil
Support: 0.0048
Confidence: 0.5714285714285714
Lift: 3.281557646029315
-----------------------------------------------------
Rule: mushroom cream sauce -> nan
Support: 0.005733333333333333
Confidence: 0.30069930069930073
Lift: 3.7903273197390845
-----------------------------------------------------
Rule: nan -> pasta
Support: 0.005866666666666667
Confidence: 0.37288135593220345
Lift: 4.700185158809287
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.008666666666666666
Confidence: 0.3110047846889952
Lift: 3.164906221394116
-----------------------------------------------------
Rule: olive oil -> frozen vegetables
Support: 0.0048
Confidence: 0.20338983050847456
Lift: 3.094165778526489
-----------------------------------------------------
Rule: shrimp -> frozen vegetables
Support: 0.0072
Confidence: 0.3068181818181818
Lift: 3.2183725365543547
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.005733333333333333
Confidence: 0.20574162679425836
Lift: 3.1299436124887174
-----------------------------------------------------
Rule: spaghetti -> shrimp
Support: 0.006
Confidence: 0.21531100478468898
Lift: 3.0183785717479763
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.006666666666666667
Confidence: 0.23923444976076555
Lift: 3.497579674864993
-----------------------------------------------------
Rule: spaghetti -> grated cheese
Support: 0.005333333333333333
Confidence: 0.3225806451612903
Lift: 3.282706701098612
-----------------------------------------------------
Rule: herb & pepper -> mineral water
Support: 0.006666666666666667
Confidence: 0.390625
Lift: 3.975152645861601
-----------------------------------------------------
Rule: nan -> herb & pepper
Support: 0.016
Confidence: 0.3234501347708895
Lift: 3.2915549671393096
-----------------------------------------------------
Rule: spaghetti -> herb & pepper
Support: 0.0064
Confidence: 0.3934426229508197
Lift: 4.003825878061259
-----------------------------------------------------
Rule: olive oil -> milk
Support: 0.004933333333333333
Confidence: 0.22424242424242424
Lift: 3.411395906324912
-----------------------------------------------------
Rule: nan -> tomato sauce
Support: 0.005333333333333333
Confidence: 0.37735849056603776
Lift: 3.840147461662528
-----------------------------------------------------
Rule: spaghetti -> shrimp
Support: 0.006
Confidence: 0.5232558139534884
Lift: 3.004914704939635
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.0072
Confidence: 0.20300751879699247
Lift: 3.0883496774390333
-----------------------------------------------------
Rule: soup -> olive oil
Support: 0.0052
Confidence: 0.2254335260115607
Lift: 3.4295161157945335
-----------------------------------------------------
Rule: whole wheat pasta -> nan
Support: 0.008
Confidence: 0.2714932126696833
Lift: 4.130221288078346
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.005066666666666666
Confidence: 0.3220338983050848
Lift: 4.514493901473151
-----------------------------------------------------
Rule: spaghetti -> olive oil
Support: 0.005066666666666666
Confidence: 0.20105820105820105
Lift: 3.0586947422647217
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.005333333333333333
Confidence: 0.23255813953488372
Lift: 3.260160834601174
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0048
Confidence: 0.5714285714285714
Lift: 3.281557646029315
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.008666666666666666
Confidence: 0.3110047846889952
Lift: 3.164906221394116
-----------------------------------------------------
Rule: spaghetti -> frozen vegetables
Support: 0.004533333333333334
Confidence: 0.28813559322033905
Lift: 3.0224013274860737
-----------------------------------------------------
Rule: nan -> frozen vegetables
Support: 0.0048
Confidence: 0.20338983050847456
Lift: 3.094165778526489
-----------------------------------------------------
Rule: nan -> shrimp
Support: 0.0072
Confidence: 0.3068181818181818
Lift: 3.2183725365543547
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005733333333333333
Confidence: 0.20574162679425836
Lift: 3.1299436124887174
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006
Confidence: 0.21531100478468898
Lift: 3.0183785717479763
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006666666666666667
Confidence: 0.23923444976076555
Lift: 3.497579674864993
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005333333333333333
Confidence: 0.3225806451612903
Lift: 3.282706701098612
-----------------------------------------------------
Rule: nan -> herb & pepper
Support: 0.006666666666666667
Confidence: 0.390625
Lift: 3.975152645861601
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0064
Confidence: 0.3934426229508197
Lift: 4.003825878061259
-----------------------------------------------------
Rule: nan -> milk
Support: 0.004933333333333333
Confidence: 0.22424242424242424
Lift: 3.411395906324912
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.006
Confidence: 0.5232558139534884
Lift: 3.004914704939635
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.0072
Confidence: 0.20300751879699247
Lift: 3.0883496774390333
-----------------------------------------------------
Rule: soup -> nan
Support: 0.0052
Confidence: 0.2254335260115607
Lift: 3.4295161157945335
-----------------------------------------------------
Rule: spaghetti -> nan
Support: 0.005066666666666666
Confidence: 0.20105820105820105
Lift: 3.0586947422647217
-----------------------------------------------------
Rule: spaghetti -> milk
Support: 0.004533333333333334
Confidence: 0.28813559322033905
Lift: 3.0224013274860737
-----------------------------------------------------
In this tutorial, we discussed Market Basket Analysis and learned the steps to implement it from scratch using Python. We then implemented Market Basket Analysis using the Apriori Algorithm. Moreover, we looked into the various uses and advantages of this algorithm. We learned that we could also use the FP Growth and AIS algorithms to implement Market Basket Analysis in Data Mining.
A. The purpose of the market basket is to analyze consumer purchasing patterns and identify product associations.
A. To calculate a market basket, count the number of transactions containing a set of items and analyze co-occurrence.
A. Amazon uses market basket analysis to recommend products by identifying frequently bought together items and improving cross-selling strategies.
A. Market basket analysis for pricing helps determine optimal pricing strategies by understanding how product bundles influence purchasing decisions.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Hi, I liked your article. I have a question regarding the parameter that you choose for the apriori apriori algorithm. association_rules = apriori(l, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2) Could you please tell me how you choose these values? Thanks
Great post! I learned a lot from it.
This is a great guide for businesses. I would recommend it to anyone looking to improve their business.