Transcript Document
Introduction to Machine
Learning
Lecture 13
Introduction to Association Rules
Albert Orriols i Puig
[email protected]
Artificial Intelligence – Machine
Learning Enginyeria i
Arquitectura La Salle
Universitat Ramon Llull
Recap of Lecture 5-12
LET’S START WITH DATA
CLASSIFICATION
Artificial Intelligence
Machine Learning
Slide 2
Recap of Lecture 5-12
Data Set
Classification Model
How?
We have seen four different types of approaches to classification :
• Decision trees (C4.5)
• Instance-based algorithms (kNN & CBR)
• Bayesian classifiers (Naïve Bayes)
• Neural Networks (Perceptron, Adaline, Madaline, SVM)
Artificial Intelligence
Machine Learning
Slide 3
Today’s Agenda
□Introduction
to Association Rules 口
A Taxonomy of Association Rules
口
Measures of Interest
□Apriori
Artificial Intelligence
Machine Learning
Slide 4
Introduction to AR
□
Ideas come from the market basket analysis (MBA)
•
Let’s go shopping!
Milk, eggs, sugar,
bread
Milk, eggs, cereal,
bread
Eggs, sugar
Customer1
Customer2
•What
Customer3
do my customer buy? Which product are bought together?
•Aim:
Find associations and correlations between the different
items that customers place in their shopping basket
Artificial Intelligence
Machine Learning
Slide 5
Introduction to AR
□
□
Formalizing the problem a little bit
•
Transaction Database T: a set of transactions T = {t1, t2, …, tn}
•
Each transaction contains a set of items I (item set)
•
An itemset is a collection of items I = {i1, i2, …, im}
General aim:
•
Find frequent/interesting patterns, associations, correlations, or
causal structures among sets of items or elements in
databases or other information repositories.
•
Put this relationships in terms of association rules
Y
Artificial Intelligence
X Y
Machine Learning
Slide 6
Example of AR
□
TID
Items
Examples:
T1
bread, jelly, peanut-butter
•bread peanut-butter
T2
bread, peanut-butter
T3
bread, milk, peanut-butter
T4
beer, bread
T5
beer, milk
•beer brea
d
Frequent itemsets: Items that frequently appear together
•
I = {bread, peanut-butter}
•
I = {beer, bread}
Artificial Intelligence
Machine Learning
Slide 7
What’s an Interesting Rule?
□
Support count (σ)
•
□
Items
T1
bread, jelly, peanut-butter
T2
bread, peanut-butter
Y
o ({bread, peanut-butter}) = 3
T3
bread, milk, peanut-butter
Y
o ({beer, bread}) = 1
T4
beer, bread
T5
beer, milk
Support
•
□
Frequency of occurrence of
and itemset
TID
Fraction of transactions that
contain an itemset
Y
s ({bread,peanut-butter}) = 3/5
Y
s ({beer, bread}) = 1/5
Frequent itemset
•
An itemset whose support is greater than or equal
to a minimum support threshold (minsup)
Artificial Intelligence
Machine Learning
Slide 8
What’s an Interesting Rule?
□
An association rule is an
implication of two itemsets
XY
•
□
Many measures of interest.
The two most used are:
TID
Items
T1
bread, jelly, peanut-butter
T2
bread, peanut-butter
T3
bread, milk, peanut-butter
T4
beer, bread
T5
beer, milk
Support (s)
•
Y
The occurring frequency of the rule,
i.e., number of transactions that contain
both X and Y
•
Confidence (c)
Y
The strength of the association, i.e.,
measures of how often items in Y
appear in transactions that contain X
Artificial Intelligence
Machine Learning
s
c
( X Y )
# of trans.
( X Y )
(X)
Slide 9
Interestingness of Rules
TID
TID
Items
s
c
T1
bread, jelly, peanut-butter
bread peanut-butter
0.60
0.75
T2
bread, peanut-butter
peanut-butter bread
0.60
1.00
T3
bread, milk, peanut-butter
beer bread
0.20
0.50
T4
beer, bread
peanut-butter jelly
0.20
0.33
T5
beer, milk
jelly peanut-butter
0.20
1.00
jelly milk
0.00
0.00
□
Many other interesting measures
•
The method presented herein are based on these two
approaches
Artificial Intelligence
Machine Learning
Slide 10
Types of AR
□
Binary association rules:
•
□
Quantitative association rules:
•
□
weight in [70kg – 90kg]
190cm]
height in [170cm –
Fuzzy association rules:
•
□
bread peanut-butter
weight in TALL height in TALL
Let’s start for the beginning
•
Binary association rules – A priori
Artificial Intelligence
Machine Learning
Slide 11
Apriori
□
This is the most influential AR miner
□
It consists of two steps
□
1.
Generate all frequent itemsets whose support ≥ minsup
2.
Use frequent itemsets to generate association rules
So, let’s pay attention to the first step
Artificial Intelligence
Machine Learning
Slide 12
Apriori
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, we have 2d possible itemsets.
•Do
Artificial Intelligence
I have to generate them all?
Machine Learning
Slide 13
Apriori
□
Let’s avoid expanding all the graph
□
Key idea:
•
□
Downward closure property: Any subsets of a frequent itemset
are also frequent itemsets
Therefore, the algorithm iteratively does:
•
Create itemsets
•
Only continue exploration of those whose support ≥ minsup
Artificial Intelligence
Machine Learning
Slide 14
Example Itemset Generation
null
Infrequent
itemset
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCD
Given d items, we have 2d possible itemsets.
•Do
Artificial Intelligence
I have to generate them all?
Machine Learning
Slide 15
Recovering the Example
Minimum support = 3
TID
Items
T1
bread, jelly, peanut-butter
T2
bread, peanut-butter
T3
bread, milk, peanut-butter
T4
beer, bread
T5
beer, milk
1-itemsets
Item
bread
count
4
peanut-b
3
jelly
1
milk
1
beer
1
Artificial Intelligence
Item
2-itemsets
count
bread, peanut-b
Machine Learning
3
Slide 16
Apriori Algorithm
口
k=1
□Generate
□Repeat
•
frequent itemsets of length 1
until no frequent itemsets are found
k := k+1
•
Generate itemsets of size k from the k-1 frequent
itemsets
•
Compute the support of each candidate by scanning DB
Artificial Intelligence
Machine Learning
Slide 17
Apriori Algorithm
Algorithm Apriori(T)
C1 init-pass(T);
F1 {f | f C1, f.count/n minsup};
for (k = 2; Fk-1 ; k++) do
Ck candidate-gen(Fk-1);
for each transaction t T do for each
candidate c Ck do
if c is contained in t then
c.count++;
end end
// n: no. of transactions in T
Fk {c Ck | c.count/n minsup}
end
return F
Artificial Intelligence
k
Fk;
Machine Learning
Slide 18
Apriori Algorithm
Function candidate-gen(Fk-1)
Ck ;
forall f1, f2 Fk-1
with f1 = {i1, … , ik-2, ik-1}
and f2 = {i1, … , ik-2, i’k-1}
and ik-1 < i’k-1 do
c {i1, …, ik-1, i’k-1}; Ck Ck
{c};
for each (k-1)-subset s of c
do if (s Fk-1) then
delete c from Ck;
end
end return Ck;
Artificial Intelligence
Machine Learning
// join f1 and f2
// prune
Slide 19
Example of Apriori Run
Itemset
sup
{A}
2
{B}
3
{C}
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
C1
1st
scan
C2
L2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
2
2
3
2
Itemset
sup
{A}
2
3
{B}
3
{D}
1
{C}
3
{E}
3
{E}
3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
L1
C2
2nd
scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
Artificial Intelligence
3rd scan
L3
Itemset
{B, C, E}
Machine Learning
sup
2
Slide 20
Apriori
□
Remember that Apriori consists of two steps
1.
Generate all frequent itemsets whose support ≥ minsup
2.
Use frequent itemsets to generate association rules
□
We accomplished step 1. So we have all frequent
itemsets
□
So, let’s pay attention to the second step
Artificial Intelligence
Machine Learning
Slide 21
Rule Generation in Apriori
□
Given a frequent itemset L
•
•
□
Find all non-empty subsets F in L, such that the
association rule F {L-F} satisfies the minimum
confidence
Create the rule F {L-F}
If L={A,B,C}
•
The candidate itemsets are: ABC, ACB, BCA, ABC,
BAC, CAB
•
In general, there are 2K-2 candidate solutions, where k is the
length of the itemset L
Artificial Intelligence
Machine Learning
Slide 22
Can you Be More Efficient?
□
Can we apply the same trick used with support?
•
Confidence does not have anti-monote property
•
That is, c(ABD) > c(A D)?
Y
Don’t know!
□
But confidence of rules generated from the same itemset
does have the anti-monote property
•
L={A,B,C,D}
Y
•
C(ABCD) ≥ c(AB CD) ≥ c(A BCD)
We can apply this property to prune the rule generation
Artificial Intelligence
Machine Learning
Slide 23
Example of Efficient Rule Generation
ABCD
Low
confidence
ABCD
ABCD
ABDC
ACBD
ABCD
Artificial Intelligence
ACDB
BCAD
ADBC
BACD
CABD
Machine Learning
BCDA
BDAD
CDAB
DABC
Slide 24
Challenges in AR Mining
□
□
Challenges
•
Apriori scans the data base multiple times
•
Most often, there is a high number of candidates
•
Support counting for candidates can be time expensive
Several methods try to improve this points by
•
Reduce the number of scans of the data base
•
Shrink the number of candidates
•
Counting the support of candidates more efficiently
Artificial Intelligence
Machine Learning
Slide 25
Next Class
□
Advanced topics in association rule mining
Artificial Intelligence
Machine Learning
Slide 26
Introduction to Machine
Learning
Lecture 13
Introduction to Association Rules
Albert Orriols i Puig
[email protected]
Artificial Intelligence – Machine
Learning Enginyeria i
Arquitectura La Salle
Universitat Ramon Llull