Recommendation Systems (R)
Association mining is commonly used to make product recommendations by identifying products that are frequently bought together. It is a common technique used to find associations between many variables. It is often used by grocery stores, e-commerce websites, and anyone with large transaction databases. A most common example that we encounter in our daily lives — Amazon knows what else you want to buy when you order something on their site.
Based on the concept of strong rules, Rakesh Agrawal, Tomasz Imieliński and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets.
Associate mining in R¶
Michael Hahsler has authored and maintains two very useful R packages relating to association rule mining: the arules package and the arulesViz package.
library(tidyverse)
library(arulesViz)
library(arules)
library(kableExtra)
Data and EDA¶
There’s a public data of buying records in a grocery store. The data looks like below.
data <- read.transactions('..\\data\\groceries.csv', format = 'basket', sep=',')
inspect(head(data))
## items
## [1] {citrus fruit,
## margarine,
## ready soups,
## semi-finished bread}
## [2] {coffee,
## tropical fruit,
## yogurt}
## [3] {whole milk}
## [4] {cream cheese,
## meat spreads,
## pip fruit,
## yogurt}
## [5] {condensed milk,
## long life bakery product,
## other vegetables,
## whole milk}
## [6] {abrasive cleaner,
## butter,
## rice,
## whole milk,
## yogurt}
I want to find the most frequently bought items.
itemFrequencyPlot(data, topN = 20, type = "absolute")
I want to find the items that are bought frequently together
freq.items <- eclat(data, parameter = list(supp = 0.01, maxlen = 15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 98
##
## create itemset ...
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [88 item(s)] done [0.00s].
## creating sparse bit matrix ... [88 row(s), 9835 column(s)] done [0.00s].
## writing ... [333 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
inspect(head(freq.items))
## items support count
## [1] {hard cheese, whole milk} 0.01006609 99
## [2] {butter milk, whole milk} 0.01159126 114
## [3] {butter milk, other vegetables} 0.01037112 102
## [4] {ham, whole milk} 0.01148958 113
## [5] {sliced cheese, whole milk} 0.01077783 106
## [6] {oil, whole milk} 0.01128622 111
Product recommendation rules¶
There are three parameters controlling the number of rules to be generated viz. Support, Lift and Confidence.
Support is an indication of how frequently the item set appears in the data set.
$$ Support = \frac{Number\, of\, transactions\, with\, both\, A\, and\, B}{Total\, number\, of\, transactions} = P\left(A \cap B\right) $$
Confidence is an indication of how often the rule has been found to be true.
$$ Confidence = \frac{Number\, of\, transactions\, with\, both\, A\, and\, B}{Total number of transactions with A} = \frac{P\left(A \cap B\right)}{P\left(A\right)} $$
Lift is the factor by which, the co-occurrence of A and B exceeds the expected probability of A and B co-occurring, had they been independent. So, higher the lift, higher the chance of A and B occurring together.
$$ Lift = \frac{Confidence}{Expected Confidence} = \frac{P\left(A \cap B\right)}{P\left(A\right).P\left(B\right)} $$
rules <- apriori(data, parameter = list(support = 0.0015, confidence = 0.9))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.0015 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
## sorting and recoding items ... [153 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_optim <- rules[]
inspect(rules_optim)
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1.0000000 0.001728521 3.913649 17
## [3] {cream cheese,
## other vegetables,
## sugar} => {whole milk} 0.001525165 0.9375000 0.001626843 3.669046 15
## [4] {butter,
## pip fruit,
## whipped/sour cream} => {whole milk} 0.001830198 0.9000000 0.002033554 3.522284 18
## [5] {domestic eggs,
## tropical fruit,
## whipped/sour cream} => {whole milk} 0.001830198 0.9000000 0.002033554 3.522284 18
## [6] {fruit/vegetable juice,
## tropical fruit,
## whipped/sour cream} => {other vegetables} 0.001931876 0.9047619 0.002135231 4.675950 19
## [7] {root vegetables,
## sausage,
## tropical fruit,
## yogurt} => {whole milk} 0.001525165 0.9375000 0.001626843 3.669046 15
lhs is “left hand side” and rhs is “right hand side”. In the first low, it’s for the result about “if a customer buy liquor and red/blush wine, which is in lhs column, will the customer buy bottled beer, which is in rhs column?”
To find the item pairs in descending order of support, lift and confidence.
inspect(sort(rules_optim, by="confidence", decreasing = T))
## lhs rhs support confidence coverage lift count
## [1] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1.0000000 0.001728521 3.913649 17
## [2] {cream cheese,
## other vegetables,
## sugar} => {whole milk} 0.001525165 0.9375000 0.001626843 3.669046 15
## [3] {root vegetables,
## sausage,
## tropical fruit,
## yogurt} => {whole milk} 0.001525165 0.9375000 0.001626843 3.669046 15
## [4] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [5] {fruit/vegetable juice,
## tropical fruit,
## whipped/sour cream} => {other vegetables} 0.001931876 0.9047619 0.002135231 4.675950 19
## [6] {butter,
## pip fruit,
## whipped/sour cream} => {whole milk} 0.001830198 0.9000000 0.002033554 3.522284 18
## [7] {domestic eggs,
## tropical fruit,
## whipped/sour cream} => {whole milk} 0.001830198 0.9000000 0.002033554 3.522284 18
inspect(sort(rules_optim, by="support", decreasing = T))
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {fruit/vegetable juice,
## tropical fruit,
## whipped/sour cream} => {other vegetables} 0.001931876 0.9047619 0.002135231 4.675950 19
## [3] {butter,
## pip fruit,
## whipped/sour cream} => {whole milk} 0.001830198 0.9000000 0.002033554 3.522284 18
## [4] {domestic eggs,
## tropical fruit,
## whipped/sour cream} => {whole milk} 0.001830198 0.9000000 0.002033554 3.522284 18
## [5] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1.0000000 0.001728521 3.913649 17
## [6] {cream cheese,
## other vegetables,
## sugar} => {whole milk} 0.001525165 0.9375000 0.001626843 3.669046 15
## [7] {root vegetables,
## sausage,
## tropical fruit,
## yogurt} => {whole milk} 0.001525165 0.9375000 0.001626843 3.669046 15
inspect(sort(rules_optim, by="lift", decreasing = T))
## lhs rhs support confidence coverage lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 0.002135231 11.235269 19
## [2] {fruit/vegetable juice,
## tropical fruit,
## whipped/sour cream} => {other vegetables} 0.001931876 0.9047619 0.002135231 4.675950 19
## [3] {flour,
## root vegetables,
## whipped/sour cream} => {whole milk} 0.001728521 1.0000000 0.001728521 3.913649 17
## [4] {cream cheese,
## other vegetables,
## sugar} => {whole milk} 0.001525165 0.9375000 0.001626843 3.669046 15
## [5] {root vegetables,
## sausage,
## tropical fruit,
## yogurt} => {whole milk} 0.001525165 0.9375000 0.001626843 3.669046 15
## [6] {butter,
## pip fruit,
## whipped/sour cream} => {whole milk} 0.001830198 0.9000000 0.002033554 3.522284 18
## [7] {domestic eggs,
## tropical fruit,
## whipped/sour cream} => {whole milk} 0.001830198 0.9000000 0.002033554 3.522284 18
plot(rules_optim[1:7], method = "graph")
By changing the support and confidence cutoffs, we can get better recommendations.