Market Basket Analysis

gracecamc168
Dec 30, 2020
3 min read

Updated: Jan 7, 2021

One of the most interesting analysis I have learned from the course is Association Rule Mining, which is also called Market Basket Analysis. It is very useful for production selections or recommendations for retailers. The general idea is that what people bought but also purchased this item/these items. This is helpful for understanding consumer behaviors. Uncovering this relationship will help to produce better marketing activities. The dataset I am using can be found from Kaggle, and this is the sales transactions in a week of one supermarket in the US. The dataset is called MarketBasketOptimisation. R is the programming language for this project.

library(arules)

library(arulesViz)

Let's read file

MarketBasket <- read.transactions("Market_Basket_Optimisation.csv",

format="basket",sep=",")

summary(MarketBasket )

The first five transactions

inspect(MarketBasket[1:5])

Let's calculate the total numbers of combination for these 119 products

3^119-2^(119+1)+1

We total have 5.990034e+56 possible combinations.

Understand the numbers of each product

itemFrequency(MarketBasket) # return the proportion of each item

count <- itemFrequency(MarketBasket,type="absolute") # return the numbers of each item

sorted <- sort(count,decreasing = T) # mineral water is the most popular item

# visualize the top 20 most-wanted products

itemFrequencyPlot(MarketBasket, type="absolute",topN = 20, col = rainbow(20))

Next, let's calculate support, confidence, expected confidence and lift ratio by using apriori function. Support ratio tells us that how frequently the items appear in the data. Confidence indicates how confident we are to say that the if-then statements are found true. For example, Apple -> Milk, 0.6 confidence, this statement tell us that there is 60% that a transaction includes milk, given that the purchase includes apple. The greater the confidence is, the better. Lift ratio is to evaluate the strength of the association. The reason to have lift ratio is to avoid possible common misleading analysis if we only use support and confidence as benchmarks. For example, sometimes, supermarkets or online retailers do big promotion events, which might attract much more customers to buy certain items that they won't buy normally because they are more expensive at usual time. In this case, if we only look at the support and confidence rates, we might make an inaccurate relationship analysis between items. Accordingly, lift ratio is needed to include. Usually, if the lift ratio is more than 1, indicating the relationship between items is strong. The higher the lift ratio is, the stronger association between items. If the ratio is close to 0, the stronger the negative association between items.

Conduct association rule analysis with the selected support and confidence rates

rule <- apriori(MarketBasket, parameter=list(minlen=2,supp=0.01,conf=0.5))

sort_rule <- sort(rule, by= "lift",decreasing=T)

inspect(sort_rule)

We only received two rules from this setting of threshold of support and confidence

The first rule tells us that we only have 1% purchase of eggs, ground beef and mineral water, total 76 out of 7501 purchases; The confidence of 0.51 implies that there is an 51% confidence that a transaction includes mineral water, given that the purchase includes eggs and ground beef; The lift ratio of 2.12 indicates that a customer who purchased eggs, ground beef as one who was very likely to purchase mineral water at a rate of 24% better than just guessing that a random customer purchased mineral water. There is a strong association between eggs, ground beef and mineral water. We might suggest the store manager to place these items next to each other. For online purchase, mineral water is possibly to be the product recommendation next to eggs, ground beef.

Let's try different thresholds

rule1 <- apriori(MarketBasket, parameter=list(minlen=2,supp=0.01,conf=0.45))

summary(rule1)

We receive 6 rules of this setting.

rule2 <- apriori(MarketBasket, parameter=list(minlen=2,supp=0.01,conf=0.3))

summary(rule2)

We receive 63 rules of this setting.

rule3 <- apriori(MarketBasket, parameter=list(minlen=2,supp=0.005,conf=0.5))

summary(rule3)

We receive 20 rules of this setting.

After playing around the support and confidence rates, the highest support rate is 0.01, we can't go beyond this value. Holding support rate to 0.01, the highest confidence rate is 0.5, we can't go beyond this value.

As I mentioned above, the higher the confidence rate is, the better. For this analysis, the best thresholds for support and confidence are 0.01 and 0.5, respectively.

One of some limitations of Association Rule Analysis of using apriori() is the set of support and confidence values. We need to play around them, which might take some times. Another limitation is it might take some time for computation if we have large enough transactions in the dataset. But overall, I like this type of analysis, it, at least, gives me some useful information, which will be not just useful for marketing campaigns, but also increasing the possibility of earning more money from the consumers.

Market Basket Analysis

Recent Posts

Comments