One of the most interesting analysis I have learned from the course is Association Rule Mining, which is also called Market Basket Analysis. It is very useful for production selections or recommendations for retailers. The general idea is that what people bought but also purchased this item/these items. This is helpful for understanding consumer behaviors. Uncovering this relationship will help to produce better marketing activities. The dataset I am using can be found from Kaggle, and this is the sales transactions in a week of one supermarket in the US. The dataset is called MarketBasketOptimisation. R is the programming language for this project.
library(arules)
library(arulesViz)
Let's read file
MarketBasket <- read.transactions("Market_Basket_Optimisation.csv",
format="basket",sep=",")
summary(MarketBasket )
data:image/s3,"s3://crabby-images/2edaf/2edaf459d57a8535bc03ff05bd5d470f399eaea0" alt=""
The first five transactions
inspect(MarketBasket[1:5])
data:image/s3,"s3://crabby-images/dccb9/dccb90342622a0753c4543fe3fd779544f6f6289" alt=""
Let's calculate the total numbers of combination for these 119 products
3^119-2^(119+1)+1
We total have 5.990034e+56 possible combinations.
Understand the numbers of each product
itemFrequency(MarketBasket) # return the proportion of each item
count <- itemFrequency(MarketBasket,type="absolute") # return the numbers of each item
sorted <- sort(count,decreasing = T) # mineral water is the most popular item
# visualize the top 20 most-wanted products
itemFrequencyPlot(MarketBasket, type="absolute",topN = 20, col = rainbow(20))
data:image/s3,"s3://crabby-images/91f48/91f48001bacb4b56b37167b2a745dff93bb281bb" alt=""
Next, let's calculate support, confidence, expected confidence and lift ratio by using apriori function. Support ratio tells us that how frequently the items appear in the data. Confidence indicates how confident we are to say that the if-then statements are found true. For example, Apple -> Milk, 0.6 confidence, this statement tell us that there is 60% that a transaction includes milk, given that the purchase includes apple. The greater the confidence is, the better. Lift ratio is to evaluate the strength of the association. The reason to have lift ratio is to avoid possible common misleading analysis if we only use support and confidence as benchmarks. For example, sometimes, supermarkets or online retailers do big promotion events, which might attract much more customers to buy certain items that they won't buy normally because they are more expensive at usual time. In this case, if we only look at the support and confidence rates, we might make an inaccurate relationship analysis between items. Accordingly, lift ratio is needed to include. Usually, if the lift ratio is more than 1, indicating the relationship between items is strong. The higher the lift ratio is, the stronger association between items. If the ratio is close to 0, the stronger the negative association between items.
Conduct association rule analysis with the selected support and confidence rates
rule <- apriori(MarketBasket, parameter=list(minlen=2,supp=0.01,conf=0.5))
sort_rule <- sort(rule, by= "lift",decreasing=T)
inspect(sort_rule)
data:image/s3,"s3://crabby-images/75dff/75dff731766fd4974ffeb7b95411ef193abfb211" alt=""
We only received two rules from this setting of threshold of support and confidence
The first rule tells us that we only have 1% purchase of eggs, ground beef and mineral water, total 76 out of 7501 purchases; The confidence of 0.51 implies that there is an 51% confidence that a transaction includes mineral water, given that the purchase includes eggs and ground beef; The lift ratio of 2.12 indicates that a customer who purchased eggs, ground beef as one who was very likely to purchase mineral water at a rate of 24% better than just guessing that a random customer purchased mineral water. There is a strong association between eggs, ground beef and mineral water. We might suggest the store manager to place these items next to each other. For online purchase, mineral water is possibly to be the product recommendation next to eggs, ground beef.
Let's try different thresholds
rule1 <- apriori(MarketBasket, parameter=list(minlen=2,supp=0.01,conf=0.45))
summary(rule1)
We receive 6 rules of this setting.
rule2 <- apriori(MarketBasket, parameter=list(minlen=2,supp=0.01,conf=0.3))
summary(rule2)
We receive 63 rules of this setting.
rule3 <- apriori(MarketBasket, parameter=list(minlen=2,supp=0.005,conf=0.5))
summary(rule3)
We receive 20 rules of this setting.
After playing around the support and confidence rates, the highest support rate is 0.01, we can't go beyond this value. Holding support rate to 0.01, the highest confidence rate is 0.5, we can't go beyond this value.
As I mentioned above, the higher the confidence rate is, the better. For this analysis, the best thresholds for support and confidence are 0.01 and 0.5, respectively.
One of some limitations of Association Rule Analysis of using apriori() is the set of support and confidence values. We need to play around them, which might take some times. Another limitation is it might take some time for computation if we have large enough transactions in the dataset. But overall, I like this type of analysis, it, at least, gives me some useful information, which will be not just useful for marketing campaigns, but also increasing the possibility of earning more money from the consumers.
Comments