Data Mining Association Rules: Algorithm Apriori and

Download Report

Transcript Data Mining Association Rules: Algorithm Apriori and

Data Mining Association
Rules
Jen-Ting Tony Hsiao
Alexandros Ntoulas
CS 240B
May 21, 2002
Outline



Overview of data mining association rules
Algorithm Apriori
FP-Trees
May 21, 2002
Data Mining Association Rules




Data mining is motivated by the decision support
problem faced by most large retail organizations
Bar-code technology made it possible for retail
organizations to collect and store massive amounts of
sales data, referred to as the basket data
Successful organizations view such databases as
important pieces of the marketing infrastructure
Organizations are interested in instituting informationdriven marketing processes, managed by database
technology, that enable marketers to develop and
implement customized marketing programs and
strategies
May 21, 2002
Data Mining Association Rules




For example, 98% of customers that purchase tires and
auto accessories also get automotive services done
Finding all such rules is valuable for cross-marketing and
attached mailing applications
Other applications include catalog design, add-on sales,
store layout, and customer segmentation based on
buying patterns
Databases involved in these applications are very large,
thus imperative to have fast algorithms for this task
May 21, 2002
Formal Definition of Problem

Formal statement of the problem [AIS93]:
 Let I = {i1, i2, …, im} be a set of literals, called items. Let D be a
set of transactions, where each transaction T is a set of items
such that T  I.
 A transaction T contains X, a set of some items in I, if X  T.
 An association rule is an implication of the form X  Y, where X
 I, Y  I, and X  Y = Ø.
 X  Y holds in the transaction set D with confidence c if c% of
transactions in D that contain X also contain Y.
 X  Y has support s in the transaction set D if s% of
transactions in D contain X  Y.
May 21, 2002
Example of Confidence and Support

Given the following table of transactions:
{1 2 3 4}
{1 2 4 5}
{1 2 5}



1  2 has 100% confidence, with 100% support
3  4 has 100% confidence, with 33% support
2  3 has 33% confidence, with 33% support
May 21, 2002
Problem Decomposition


Find all sets of items (itemsets) that have transaction
support above minimum support. The support for an
itemset is the number of transactions that contain the
itemset. Itemsets with minimum support are called large
itemsets, and all others small itemsets.
Use large itemsets to generate the desired rules. If
ABCD and AB are large itemsets, then we can determine
if rule AB  CD holds by computing the ratio conf =
support(ABCD)/support(AB). If conf  minconf, then the
rule holds. (The rule will surely have minimum support
because ABCD is large.)
May 21, 2002
Discoverying Large Itemsets





Algorithms for discovering large itemsets make multiple passes over
the data
In the first pass, we count the support of individual items and
determine which of them are large (with minimum support)
In each subsequent pass, we start with a seed set of itemsets found
to be large in the previous pass, then use this seed set for
generating new potentially large itemsets, called candidate itemsets,
and count the actual support for these candidate itemsets during the
pass over the data
At the end of the pass, we determine which of the candidate
itemsets are actually large, and they become the seed for the next
pass.
This process continues until no new large itemsets are found
May 21, 2002
Notations


The number of items in an itemset is its size and call an
itemset of size k a k-itemset
Items within an itemset are kept in lexicographic order

k-itemset
An itemset having k items.
Lk
Set of large k-itemsets (those with minimum support).
Each member of this set has two fields: i) itemset and ii) support count.
Ck
Set of candidate k-itemsets (potentially large itemsets).
Each member of this set has two fields: i) itemset and ii) support count.
May 21, 2002
Algorithm Apriori
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
L1 = {large 1-itemsets};
for ( k = 2; Lk-1  Ø; k++ ) do begin
Ck = apriori-gen(Lk-1); // New candidates
forall transactions t  D do begin
Ct = subset(Ck, t); Candidates contained in t
forall candidates c  Ct do
c.count++;
end
Lk = {c  Ck | c.count  minsup}
end
Answer = k Lk;
May 21, 2002
Apriori Candidate Generation


The apriori-gen function takes as argument Lk – 1, the set of all large
(k – 1)-itemsets. It returns a superset of the a set of all large kitemsets.
First, in the join step, we join Lk – 1 with Lk – 1:
insert into Ck
select p.item1, p.item2, …, p.itemk
from Lk – 1 p, Lk – 1 q
where p.item1 = q.item1, …, p.itemk
1  q.itemk – 1

– 1,
– 2
q.itemk
– 1
= q.itemk
– 2,
p.itemk
–
Next, in the prune step, we delete all itemsets c  Ck such that some
(k – 1)-subset of c is not in Lk – 1:
forall itemsets c  Ck do
forall (k – 1)-subsets s of c do
if (s  Lk – 1) then
delete c from Ck;
May 21, 2002
Apriori-gen Example




Let L3 be {{1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4}}
In the join step:
 {1 2 3} joins with {1 2 4} to produce {1 2 3 4}
 {1 3 4} joins with {1 3 5} to produce {1 3 4 5}
 After the join step, C4 will be {{1 2 3 4}, {1 3 4 5}}
In the prune step:
 {1 2 3 4} is tested for existence of 3-item subsets
within L3, thus for {1 2 3}, {1 3 4}, and {2 3 4}
 {1 3 4 5} is tested for {1 3 4}, {1 4 5}, and {3 4 5}, with
{1 4 5} not found, and thus this set is deleted
We then will be left with only {1 2 3 4} in C4
May 21, 2002
Discovering Rules



For every large itemset l, we find all non-empty subsets
of l
For every such subset a, we output a rule of the form a
 (l – a) if the ratio of support(l) to support(a) is at least
minconf
We consider all subsets of l to generate rules with
multiple consequents
May 21, 2002
Simple Algorithm




We can improve the proceeding procedure by generating
the subsets of large itemset in a recursive depth-first
fashion
For example, given an itemset ABCD, we first consider
the subset ABC, then AB, etc
Then if a subset a of a large itemset l does not generate
a rule, the subsets of a need not to be considered for
generating rules using l
For example, if ABC  D does not have enough
confidence, we need not check whether AB  CD holds
May 21, 2002
Simple Algorithm
1)
2)
forall large itemsets lk, k  2 do
call genrules(lk, lk)
Procedure genrules(lk: large k-itemsets, am: large m-itemsets)
1)
A = {(m – 1)-itemsets am – 1 | am – 1  am};
2)
forall am – 1  A do again
3)
conf = support(lk)/support(am – 1)
4)
if (conf  minconf) then begin
5)
output the rule am – 1  (lk - am – 1),
6)
with confidence = conf and support = support(lk)
7)
if (m – 1 > 1) then
8)
call genrules(lk, am – 1);
9)
// to generate rules with subsets of am – 1
10)
as the antecedents
11)
end
12) end

May 21, 2002
Faster Algorithm




Earlier, we showed that if a rule does not hold for a
superset does not hold, it also does not hold for its
subset
For example, if ABC  D does not have enough
confidence, AB  CD also will not hold
This can also be applied in the reverse direction, where if
a rule holds for a subset, it also holds for its superset
For example, if AB  CD has enough confidence, ABC
 D will also hold
May 21, 2002
Faster Algorithm
1)
2)
3)
4)
forall large itemsets lk, k  2 do begin
H1 = { consequents of rules derived from lk with one item in
the consequent };
call ap-genrules(lk, H1)
end
Procedure ap-genrules(lk: large k-itemsets, Hm: set of m-item
consequents)
1)
if (k > m + 1) then begin
2)
Hm + 1 = apriori-gen(Hm);
3)
forall hm + 1  Hm + 1 do begin
4)
conf = support(lk)/support(lk - hm + 1);
5)
if (conf  minconf) then
6)
output the rule (lk - hm + 1)  hm + 1 with
7)
confidence = conf and support = support(lk);
8)
else
9)
delete hm + 1 from Hm + 1
10)
end
11)
call ap-genrules(lk, Hm + 1);
12) end

May 21, 2002
Faster Algorithm




For example, consider a large itemset ABCDE. Assume that ACDE
 B and ABCE  E are the only one-item consequent rules with
minimum confidence.
In the simple algorithm, the recursive call genrules(ABCDE, ACDE)
will test if the two-item consequence rules ACD  BE, ADE  BC,
CDE  BA, and ACE  BD hold.
The first of these rules cannot hold, because E  BE, and ABCD 
E does not have minimum confidence. The second and third rules
cannot hold for similar reasons.
The only two-item consequent rule that can possibly hold is ACE 
BD, where B and D are the consequents in the valid one-item
consequent rules. This is the only rule that will be tested by the
faster algorithm.
May 21, 2002
Mining Frequent Itemsets


If we use candidate generation
 Need to generate a huge number of candidate sets.
 Need to repeatedly scan the database and check a
large set of candidates by pattern matching.
Can we avoid that?
 FP-Trees (Frequent Pattern Trees)
May 21, 2002
FP-Tree


General Idea
 Divide and Conquer
Outline of the procedure
 For each item create its conditional pattern base
 Then expand it and create its conditional FP-Tree
 Repeat the process recursively on each
constructed FP-Tree
 Until FP-Tree is either empty or contains only one
path
May 21, 2002
Example
TID
100
200
300
400
500




Items bought
{f, a, c, d, g, i, m, p}
{a, b, c, f, l, m, o}
{b, f, h, j, o}
{b, c, k, s, p}
{a, f, c, e, l, p, m, n}
For support = 50%
Step 1: Scan DB find frequent 1-itemsets
Step 2: Order them in descending order
Step 3: Scan DB and construct FP-Tree
May 21, 2002
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
Example – Conditional Pattern Base



Start with the frequent header table
Traverse tree by following links from frequent items
Accumulate prefix paths to form a conditional pattern base
Header Table
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
root
f:4
c:3
p:2
May 21, 2002
c:1
b:1
a:3
m:2
Conditional pattern bases
b:1
p:1
b:1
m:1
item
cond. pattern base
c
f:3
a
fc:3
b
fca:1, f:1, c:1
m
fca:2, fcab:1
p
fcam:2, cb:1
Example – Conditional FP-Tree

For every pattern base
 Accumulate the frequency for each item
 Construct an FP-Tree for the frequent items of the
pattern base
m-conditional pattern base:
fca:2, fcab:1
{root}
f:3
c:3
a:3
m-conditional FP-tree
May 21, 2002
All frequent patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
Mining The FP-Tree




Construct conditional pattern bases for each node in the
initial FP-Tree
Construct conditional FP-Tree from each conditional
pattern base
Recursively visit conditional FP-Trees and expand
frequent patterns
If the conditional FP-Tree contains only one path
enumerate all the combinations of the items
May 21, 2002
Example – Mining The FP-Tree
Item
Conditional pattern-base
Conditional FP-tree
p
{(fcam:2), (cb:1)}
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
NULL
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
f
NULL
NULL
May 21, 2002
Example – Mining The FP-Tree

Recursively visit conditional FP-Trees
{root}
{root}
f:3
f:3
c:3
c:3
a:3
am-conditional FP-tree
m-conditional FP-tree
{root}
{root}
f:3
f:3
cm-conditional FP-tree
May 21, 2002
cam-conditional FP-tree
Example – Mining The FP-Tree

If we have a single path, the frequent itemsets of a tree
can be generated by enumerating all combinations of
nodes of the tree.
{root}
f:3
c:3
a:3
m-conditional FP-tree
May 21, 2002
Enumeration of all patterns concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
Advantages of FP-Tree


Completeness:
 Never breaks a long pattern of any transaction
 Preserves complete information for frequent pattern
mining
Compactness
 Reduce irrelevant information—infrequent items are
gone
 Frequency descending ordering: more frequent items
are more likely to be shared
 Never be larger than the original database (if not
count node-links and counts)
May 21, 2002
Comparison of FP-Tree Method and Apriori
Data set size T25I20D10K
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
May 21, 2002
0.5
1
1.5
2
Support threshold(%)
2.5
3
References



R. Agrawal, T. Imielinski, and A. Swami. Mining
Association Rules between Sets of Items in Large
Databases. In Proceedings of the 1993 International
Conference on Management of Data (SIGMOD 93),
pages 207-216, May 1993
R. Agrawal, R. Srikant. Fast Algorithms for Mining
Association Rules (1994) Proc. 20th Int. Conf. Very
Large Data Bases, VLDB
Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent
Patterns without Candidate Generation. SIGMOD
Conference 2000: 1-12
May 21, 2002