Transcript Document

Introduction to Machine
Learning
Lecture 13
Introduction to Association Rules
Albert Orriols i Puig
[email protected]
Artificial Intelligence – Machine
Learning Enginyeria i
Arquitectura La Salle
Universitat Ramon Llull
Recap of Lecture 5-12
LET’S START WITH DATA
CLASSIFICATION
Artificial Intelligence
Machine Learning
Slide 2
Recap of Lecture 5-12
Data Set
Classification Model
How?
We have seen four different types of approaches to classification :
• Decision trees (C4.5)
• Instance-based algorithms (kNN & CBR)
• Bayesian classifiers (Naïve Bayes)
• Neural Networks (Perceptron, Adaline, Madaline, SVM)
Artificial Intelligence
Machine Learning
Slide 3
Today’s Agenda
□Introduction
to Association Rules 口
A Taxonomy of Association Rules
口
Measures of Interest
□Apriori
Artificial Intelligence
Machine Learning
Slide 4
Introduction to AR
□
Ideas come from the market basket analysis (MBA)
•
Let’s go shopping!
Milk, eggs, sugar,
bread
Milk, eggs, cereal,
bread
Eggs, sugar
Customer1
Customer2
•What
Customer3
do my customer buy? Which product are bought together?
•Aim:
Find associations and correlations between the different
items that customers place in their shopping basket
Artificial Intelligence
Machine Learning
Slide 5
Introduction to AR
□
□
Formalizing the problem a little bit
•
Transaction Database T: a set of transactions T = {t1, t2, …, tn}
•
Each transaction contains a set of items I (item set)
•
An itemset is a collection of items I = {i1, i2, …, im}
General aim:
•
Find frequent/interesting patterns, associations, correlations, or
causal structures among sets of items or elements in
databases or other information repositories.
•
Put this relationships in terms of association rules
Y
Artificial Intelligence
X Y
Machine Learning
Slide 6
Example of AR
□
TID
Items
Examples:
T1
bread, jelly, peanut-butter
•bread  peanut-butter
T2
bread, peanut-butter
T3
bread, milk, peanut-butter
T4
beer, bread
T5
beer, milk
•beer  brea
d
Frequent itemsets: Items that frequently appear together
•
I = {bread, peanut-butter}
•
I = {beer, bread}
Artificial Intelligence
Machine Learning
Slide 7
What’s an Interesting Rule?
□
Support count (σ)
•
□
Items
T1
bread, jelly, peanut-butter
T2
bread, peanut-butter
Y
o ({bread, peanut-butter}) = 3
T3
bread, milk, peanut-butter
Y
o ({beer, bread}) = 1
T4
beer, bread
T5
beer, milk
Support
•
□
Frequency of occurrence of
and itemset
TID
Fraction of transactions that
contain an itemset
Y
s ({bread,peanut-butter}) = 3/5
Y
s ({beer, bread}) = 1/5
Frequent itemset
•
An itemset whose support is greater than or equal
to a minimum support threshold (minsup)
Artificial Intelligence
Machine Learning
Slide 8
What’s an Interesting Rule?
□
An association rule is an
implication of two itemsets
XY
•
□
Many measures of interest.
The two most used are:
TID
Items
T1
bread, jelly, peanut-butter
T2
bread, peanut-butter
T3
bread, milk, peanut-butter
T4
beer, bread
T5
beer, milk
Support (s)
•
Y
The occurring frequency of the rule,
i.e., number of transactions that contain
both X and Y
•
Confidence (c)
Y
The strength of the association, i.e.,
measures of how often items in Y
appear in transactions that contain X
Artificial Intelligence
Machine Learning
s
c
( X Y )
# of trans.
( X Y )
(X)
Slide 9
Interestingness of Rules
TID
TID
Items
s
c
T1
bread, jelly, peanut-butter
bread  peanut-butter
0.60
0.75
T2
bread, peanut-butter
peanut-butter  bread
0.60
1.00
T3
bread, milk, peanut-butter
beer  bread
0.20
0.50
T4
beer, bread
peanut-butter  jelly
0.20
0.33
T5
beer, milk
jelly  peanut-butter
0.20
1.00
jelly  milk
0.00
0.00
□
Many other interesting measures
•
The method presented herein are based on these two
approaches
Artificial Intelligence
Machine Learning
Slide 10
Types of AR
□
Binary association rules:
•
□
Quantitative association rules:
•
□
weight in [70kg – 90kg]
190cm]
 height in [170cm –
Fuzzy association rules:
•
□
bread  peanut-butter
weight in TALL  height in TALL
Let’s start for the beginning
•
Binary association rules – A priori
Artificial Intelligence
Machine Learning
Slide 11
Apriori
□
This is the most influential AR miner
□
It consists of two steps
□
1.
Generate all frequent itemsets whose support ≥ minsup
2.
Use frequent itemsets to generate association rules
So, let’s pay attention to the first step
Artificial Intelligence
Machine Learning
Slide 12
Apriori
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
Given d items, we have 2d possible itemsets.
•Do
Artificial Intelligence
I have to generate them all?
Machine Learning
Slide 13
Apriori
□
Let’s avoid expanding all the graph
□
Key idea:
•
□
Downward closure property: Any subsets of a frequent itemset
are also frequent itemsets
Therefore, the algorithm iteratively does:
•
Create itemsets
•
Only continue exploration of those whose support ≥ minsup
Artificial Intelligence
Machine Learning
Slide 14
Example Itemset Generation
null
Infrequent
itemset
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCD
Given d items, we have 2d possible itemsets.
•Do
Artificial Intelligence
I have to generate them all?
Machine Learning
Slide 15
Recovering the Example
Minimum support = 3
TID
Items
T1
bread, jelly, peanut-butter
T2
bread, peanut-butter
T3
bread, milk, peanut-butter
T4
beer, bread
T5
beer, milk
1-itemsets
Item
bread
count
4
peanut-b
3
jelly
1
milk
1
beer
1
Artificial Intelligence
Item
2-itemsets
count
bread, peanut-b
Machine Learning
3
Slide 16
Apriori Algorithm
口
k=1
□Generate
□Repeat
•
frequent itemsets of length 1
until no frequent itemsets are found
k := k+1
•
Generate itemsets of size k from the k-1 frequent
itemsets
•
Compute the support of each candidate by scanning DB
Artificial Intelligence
Machine Learning
Slide 17
Apriori Algorithm
Algorithm Apriori(T)
C1  init-pass(T);
F1  {f | f  C1, f.count/n  minsup};
for (k = 2; Fk-1  ; k++) do
Ck  candidate-gen(Fk-1);
for each transaction t  T do for each
candidate c  Ck do
if c is contained in t then
c.count++;
end end
// n: no. of transactions in T
Fk  {c Ck | c.count/n minsup}
end
return F 
Artificial Intelligence
k
Fk;
Machine Learning
Slide 18
Apriori Algorithm
Function candidate-gen(Fk-1)
Ck  ;
forall f1, f2 Fk-1
with f1 = {i1, … , ik-2, ik-1}
and f2 = {i1, … , ik-2, i’k-1}
and ik-1 < i’k-1 do
c  {i1, …, ik-1, i’k-1}; Ck  Ck
 {c};
for each (k-1)-subset s of c
do if (s  Fk-1) then
delete c from Ck;
end
end return Ck;
Artificial Intelligence
Machine Learning
// join f1 and f2
// prune
Slide 19
Example of Apriori Run
Itemset
sup
{A}
2
{B}
3
{C}
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
C1
1st
scan
C2
L2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
2
2
3
2
Itemset
sup
{A}
2
3
{B}
3
{D}
1
{C}
3
{E}
3
{E}
3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
L1
C2
2nd
scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
Artificial Intelligence
3rd scan
L3
Itemset
{B, C, E}
Machine Learning
sup
2
Slide 20
Apriori
□
Remember that Apriori consists of two steps
1.
Generate all frequent itemsets whose support ≥ minsup
2.
Use frequent itemsets to generate association rules
□
We accomplished step 1. So we have all frequent
itemsets
□
So, let’s pay attention to the second step
Artificial Intelligence
Machine Learning
Slide 21
Rule Generation in Apriori
□
Given a frequent itemset L
•
•
□
Find all non-empty subsets F in L, such that the
association rule F  {L-F} satisfies the minimum
confidence
Create the rule F  {L-F}
If L={A,B,C}
•
The candidate itemsets are: ABC, ACB, BCA, ABC,
BAC, CAB
•
In general, there are 2K-2 candidate solutions, where k is the
length of the itemset L
Artificial Intelligence
Machine Learning
Slide 22
Can you Be More Efficient?
□
Can we apply the same trick used with support?
•
Confidence does not have anti-monote property
•
That is, c(ABD) > c(A D)?
Y
Don’t know!
□
But confidence of rules generated from the same itemset
does have the anti-monote property
•
L={A,B,C,D}
Y
•
C(ABCD) ≥ c(AB CD) ≥ c(A BCD)
We can apply this property to prune the rule generation
Artificial Intelligence
Machine Learning
Slide 23
Example of Efficient Rule Generation
ABCD
Low
confidence
ABCD
ABCD
ABDC
ACBD
ABCD
Artificial Intelligence
ACDB
BCAD
ADBC
BACD
CABD
Machine Learning
BCDA
BDAD
CDAB
DABC
Slide 24
Challenges in AR Mining
□
□
Challenges
•
Apriori scans the data base multiple times
•
Most often, there is a high number of candidates
•
Support counting for candidates can be time expensive
Several methods try to improve this points by
•
Reduce the number of scans of the data base
•
Shrink the number of candidates
•
Counting the support of candidates more efficiently
Artificial Intelligence
Machine Learning
Slide 25
Next Class
□
Advanced topics in association rule mining
Artificial Intelligence
Machine Learning
Slide 26
Introduction to Machine
Learning
Lecture 13
Introduction to Association Rules
Albert Orriols i Puig
[email protected]
Artificial Intelligence – Machine
Learning Enginyeria i
Arquitectura La Salle
Universitat Ramon Llull