AR Rule - WordPress.com

Download Report

Transcript AR Rule - WordPress.com

DATA MINING
ASSOCIATION RULES
Outline
– Large Itemsets
– Basic Algorithms
– Simple Apriori Algorithm
– FP Algorithm
– Advanced Topics
DATA MINING VESIT M.VIJAYALAKSHMI
2
Association Rules Outline
• Association Rules Problem Overview
– Large itemsets
• Association Rules Algorithms
– Apriori
– Sampling
– Partitioning
– Parallel Algorithms
• Comparing Techniques
• Incremental Algorithms
• Advanced AR Techniques
DATA MINING VESIT M.VIJAYALAKSHMI
3
Example: Market Basket Data
• Items frequently purchased together:
Computer Printer
• Uses:
– Placement
– Advertising
– Sales
– Coupons
• Objective: increase sales and reduce costs
• Called Market Basket Analysis, Shopping Cart
Analysis
DATA MINING VESIT M.VIJAYALAKSHMI
4
Association Rule Definitions
•
•
•
•
Set of items: I={I1,I2,…,Im}
Transactions: D={t1,t2, …, tn}, tj I
Itemset: {Ii1,Ii2, …, Iik}  I
Support of an itemset: Percentage of
transactions which contain that itemset.
• Large (Frequent) itemset: Itemset whose
number of occurrences is above a threshold.
DATA MINING VESIT M.VIJAYALAKSHMI
5
Transaction Data: Supermarket Data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, jam, salt, ice-cream}
…
…
tn: {biscuit, jam, milk}
• Concepts:
– An item: an item/article in a basket
– I: the set of all items sold in the store
– A Transaction: items purchased in a basket; it may
have TID (transaction ID)
– A Transactional dataset: A set of transactions
DATA MINING VESIT M.VIJAYALAKSHMI
6
Transaction Data: A Set Of Documents
• A text document data set. Each document is
treated as a “bag” of keywords
doc1:
doc2:
doc3:
doc4:
doc5:
doc6:
doc7:
Student, Teach, School
Student, School
Teach, School, City, Game
Baseball, Basketball
Basketball, Player, Spectator
Baseball, Coach, Game, Team
Basketball, Team, City, Game
DATA MINING VESIT M.VIJAYALAKSHMI
7
Association Rules Example
TID
Items
1
Bread, Milk
2
Bread, Biscuit, FruitJuice,
Eggs
Milk, Biscuit, FruitJuice,
Coke
Bread, Milk, Biscuit,
FruitJuice
Bread, Milk, Biscuit, Coke
3
4
5
{Biscuit}  {FruitJuice},
{Milk, Bread}  {Eggs,Coke},
{FruitJuice, Bread}  {Milk},
DATA MINING VESIT M.VIJAYALAKSHMI
8
Association Rule Definitions
• Association Rule (AR): implication X 
Y where X,Y  I and X  Y = ;
• Support of AR (s) X  Y: Percentage of
transactions that contain X Y
• Confidence of AR (a) X  Y: Ratio of
number of transactions that contain X 
Y to the number that contain X
DATA MINING VESIT M.VIJAYALAKSHMI
9
Association Rules Ex (cont’d)
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Biscuit}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Biscuit}) = 2
• Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Biscuit}) = 2/5
• Frequent Itemset
TID
Items
1
Bread, Milk
2
Bread, Biscuit, FruitJuice,
Eggs
Milk, Biscuit, FruitJuice,
Coke
Bread, Milk, Biscuit,
FruitJuice
Bread, Milk, Biscuit, Coke
3
4
5
– An itemset whose support is greater
than or equal to a minsup threshold
DATA MINING VESIT M.VIJAYALAKSHMI
10
Association Rule Problem
• Given a set of items I={I1,I2,…,Im} and a
database of transactions D={t1,t2, …, tn} where
ti={Ii1,Ii2, …, Iik} and Iij  I, the Association
Rule Problem is to identify all association rules
X  Y with a minimum support and
confidence.
• Link Analysis
• NOTE: Support of X  Y is same as support
of X  Y.
DATA MINING VESIT M.VIJAYALAKSHMI
11
Definition: Association Rule
• An implication expression of
the form X  Y, where X
and Y are itemsets
• Example:
{Milk, Biscuit}  {FruitJuice}
• Rule Evaluation Metrics
TID
Items
1
Bread, Milk
2
Bread, Biscuit, FruitJuice,
Eggs
Milk, Biscuit, FruitJuice,
Coke
Bread, Milk, Biscuit,
FruitJuice
Bread, Milk, Biscuit, Coke
3
4
5
–Support (s)
Example:
•Fraction of transactions that
{Milk, Biscuit}  FruitJuice
contain both X and Y
–Confidence (c)
•Measures how often items in s  σ(Milk,BiscuitFruitJuice)  2  0.4
|T|
5
Y
appear in transactions that
σ(Milk,Biscuit, FruitJuice) 2
contain X
c
  0.67
σ(Biscuit,FruitJuice)
3
DATA MINING VESIT M.VIJAYALAKSHMI
12
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
DATA MINING VESIT M.VIJAYALAKSHMI
13
Mining Association Rules
Example of Rules:
TID
Items
1
Bread, Milk
2
Bread, Biscuit, FruitJuice,
Eggs
Milk, Biscuit, FruitJuice,
Coke
Bread, Milk, Biscuit,
FruitJuice
Bread, Milk, Biscuit, Coke
3
4
5
DATA MINING VESIT M.VIJAYALAKSHMI
{Milk,Biscuit}  {FruitJuice}
(s=0.4, c=0.67)
{Milk,FruitJuice}  {Biscuit}
(s=0.4, c=1.0)
{Biscuit,FruitJuice}  {Milk}
(s=0.4, c=0.67)
{FruitJuice}  {Milk,Biscuit}
(s=0.4, c=0.67)
{Biscuit}  {Milk,FruitJuice}
(s=0.4, c=0.5)
{Milk}  {Biscuit,FruitJuice}
(s=0.4, c=0.5)
14
Observations
• All the above rules are binary partitions of the
same itemset:
{Milk, Biscuit, FruitJuice}
• Rules originating from the same itemset have
identical support but can have different
confidence
• Thus, we may decouple the support and
confidence requirements
DATA MINING VESIT M.VIJAYALAKSHMI
15
Example 2
• Transaction data
• Assume:
t1:
t2:
t3:
t4:
t5:
t6:
t7:
Butter, Cocoa, Milk
Butter, Cheese
Cheese, Boots
Butter, Cocoa, Cheese
Butter, Cocoa, Clothes, Cheese, Milk
Cocoa, Clothes, Milk
Cocoa, Milk, Clothes
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Cocoa, Clothes, Milk}
[sup = 3/7]
• Association rules from the itemset:
Clothes  Milk, Cocoa
…
Clothes, Cocoa  Milk,
DATA MINING VESIT M.VIJAYALAKSHMI
[sup = 3/7, conf = 3/3]
…
[sup = 3/7, conf = 3/3]
16
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
• Frequent itemset generation is still
computationally expensive
DATA MINING VESIT M.VIJAYALAKSHMI
17
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
Given
d items, there
are 2d possible
candidate itemsets
ABCDE
DATA MINING VESIT M.VIJAYALAKSHMI
18
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
N
W
TID
Items
1
Bread, Milk
2
Bread, Biscuit, FruitJuice,
Eggs
Milk, Biscuit, FruitJuice,
Coke
Bread, Milk, Biscuit,
FruitJuice
Bread, Milk, Biscuit, Coke
3
4
5
DATA MINING VESIT M.VIJAYALAKSHMI
19
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
• Reduce the number of transactions (N)
– Reduce size of N as the size of itemset
increases
• Reduce the number of comparisons (NM)
– Use efficient data structures to store the
candidates or transactions
– No need to match every candidate against
every transaction
DATA MINING VESIT M.VIJAYALAKSHMI
20
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must
also be frequent
• Apriori principle holds due to the following
property of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of
its subsets
DATA MINING VESIT M.VIJAYALAKSHMI
21
Apriori Rule
• Frequent Itemset Property:
Any subset of a Frequent itemset is
Frequent.
• Contrapositive:
If an itemset is not Frequent, none of
its supersets are large Frequent.
DATA MINING VESIT M.VIJAYALAKSHMI
22
Illustrating Apriori Principle
null null
A A
B B
C C
D D
E E
AB AB
AC AC
AD AD
AE AE
BC BC
BD BD
BE BE
CD CD
CE CE
DE DE
ABCABC
ABDABD
ABEABE
ACDACD
ACEACE
ADEADE
BCDBCD
BCEBCE
BDEBDE
CDECDE
Found to be
Infrequent
ABCD
ABCD
Pruned
supersets
DATA MINING VESIT M.VIJAYALAKSHMI
ABCE
ABCE
ABDE
ABDE
ACDE
ACDE
BCDE
BCDE
ABCDE
ABCDE
23
Illustrating Apriori Principle
Item
Bread
Coke
Milk
FruitJuice
Biscuit
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
Itemset
Count
{Bread,Milk}
3
{Bread,FruitJuice}
2
{Bread,Biscuit}
3
{Milk,FruitJuice}
2
{Milk,Biscuit}
3
{FruitJuice,Biscuit}
3
If every subset is considered,
6C
1
+ 6C2 + 6C3 = 41
Pairs (2-itemsets)
(No need to generate
candidates involving
Coke or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Biscuit}
Count
3
With support-based pruning,
6 + 6 + 1 = 13
DATA MINING VESIT M.VIJAYALAKSHMI
24
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from
length k frequent itemsets
• Prune candidate itemsets containing subsets of
length k that are infrequent
• Count the support of each candidate by scanning
the DB
• Eliminate candidates that are infrequent, leaving
only those that are frequent
DATA MINING VESIT M.VIJAYALAKSHMI
25
Apriori Algorithm
A level-wise, candidate-generation-and-test approach
(Agrawal & Srikant 1994)
Data base D
1-candidates
Scan D
Min_sup=2
3-candidates
Scan D
Itemset
a
b
c
d
e
Sup
2
3
3
1
3
Itemset
a
b
c
e
Freq 2-itemsets
Itemset
bce
Freq 3-itemsets
Itemset
bce
Freq 1-itemsets
Sup
2
DATA MINING VESIT M.VIJAYALAKSHMI
Itemset
ac
bc
be
ce
Sup
2
2
3
2
2-candidates
Sup
2
3
3
3
Counting
Itemset
ab
ac
ae
bc
be
ce
Sup
1
2
1
2
3
2
Itemset
ab
ac
ae
bc
be
ce
Scan D
26
Example –
Finding frequent itemsets
minsup=0.5
itemset:count
Dataset T
TID
Items
T100 1, 3, 4
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1:
{1}:2, {2}:3, {3}:3,
 C2:
{1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
{5}:3
T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
 F2:
 C3:
{1,3}:2,
{2,3}:2, {2,5}:3, {3,5}:2
{2, 3,5}
3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}
DATA MINING VESIT M.VIJAYALAKSHMI
27
Details: Ordering Of Items
• The items in I are sorted in lexicographic
order (which is a total order).
• The order is used throughout the algorithm
in each itemset.
• {w[1], w[2], …, w[k]} represents a k-itemset
w consisting of items w[1], w[2], …, w[k],
where w[1] < w[2] < … < w[k] according to
the total order.
DATA MINING VESIT M.VIJAYALAKSHMI
28
Apriori-Gen
• Generate candidates of size i+1 from large
itemsets of size i.
• Approach used: join large itemsets of size i
if they agree on i-1
• May also prune candidates who have
subsets that are not large.
DATA MINING VESIT M.VIJAYALAKSHMI
29
Candidate-Gen Function
Function candidate-gen(Fk-1)
Ck  ;
forall f1, f2  Fk-1
with f1 = {i1, … , ik-2, ik-1}
and f2 = {i1, … , ik-2, i’k-1}
and ik-1 < i’k-1 do
c  {i1, …, ik-1, i’k-1};
// join f1 and f2
Ck  Ck  {c};
for each (k-1)-subset s of c do
if (s  Fk-1) then
delete c from Ck;
// prune
end
end
return Ck;
DATA MINING VESIT M.VIJAYALAKSHMI
30
An Example
• F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}
• After join
– C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
• After pruning:
– C4 = {{1, 2, 3, 4}}
because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is
removed)
DATA MINING VESIT M.VIJAYALAKSHMI
31
Apriori-Gen Example
DATA MINING VESIT M.VIJAYALAKSHMI
32
Apriori-Gen Example (cont’d)
DATA MINING VESIT M.VIJAYALAKSHMI
33
Apriori Adv/Disadv
• Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
• Disadvantages:
– Assumes transaction database is memory resident.
– Requires up to m database scans.
DATA MINING VESIT M.VIJAYALAKSHMI
34
Step 2: Generating Rules From Frequent
Itemsets
• Frequent itemsets  association rules
• One more step is needed to generate association
rules
• For each frequent itemset X,
For each proper nonempty subset A of X,
– Let B = X - A
– A  B is an association rule if
• Confidence(A  B) ≥ minconf,
support(A  B) = support(AB) = support(X)
confidence(A  B) = support(A  B) / support(A)
DATA MINING VESIT M.VIJAYALAKSHMI
35
Generating Rules: An example
• Suppose {2,3,4} is frequent, with sup=50%
– Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},
with sup=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
• 2,3  4, confidence=100%
• 2,4  3, confidence=100%
• 3,4  2, confidence=67%
• 2  3,4, confidence=67%
• 3  2,4, confidence=67%
• 4  2,3, confidence=67%
• All rules have support > 50%
DATA MINING VESIT M.VIJAYALAKSHMI
36
Rule Generation
• Given a frequent itemset L, find all non-empty
subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,
ABD C,
B ACD,
AC  BD,
CD AB,
ACD B,
C ABD,
AD  BC,
BCD A,
D ABC
BC AD,
• If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L   and   L)
DATA MINING VESIT M.VIJAYALAKSHMI
37
Generating Rules
• To recap, in order to obtain A  B, we need
to have support(A  B) and support(A)
• All the required information for confidence
computation has already been recorded in
itemset generation. No need to see the data T
any more.
• This step is not as time-consuming as frequent
itemsets generation.
DATA MINING VESIT M.VIJAYALAKSHMI
38
Rule Generation
• How to efficiently generate rules from
frequent itemsets?
– In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than
c(AB D)
– But confidence of rules generated from the
same itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
DATA MINING VESIT M.VIJAYALAKSHMI
39
Rule Generation for Apriori
Lattice of rules
Algorithm
Low
Confidence
Rule
CD=>AB
CD=>AB
ABCD=>{
ABCD=>{} }
BCD=>A
BCD=>A
BD=>AC
BD=>AC
D=>ABC
D=>ABC
ACD=>B
ACD=>B
BC=>AD
BC=>AD
C=>ABD
C=>ABD
ABD=>C
ABD=>C
AD=>BC
AD=>BC
B=>ACD
B=>ACD
ABC=>D
ABC=>D
AC=>BD
AC=>BD
AB=>CD
AB=>CD
A=>BCD
A=>BCD
Pruned
Rules
DATA MINING VESIT M.VIJAYALAKSHMI
40
APriori - Performance Bottlenecks
• The core of the Apriori algorithm:
– Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
– Use database scan and pattern matching to collect counts for
the candidate itemsets
• Bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2,
…, a100}, one needs to generate 2100  1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest
pattern
DATA MINING VESIT M.VIJAYALAKSHMI
41
Mining Frequent Patterns
Without Candidate Generation
• Compress a large database into a compact, FrequentPattern tree (FP-tree) structure
– highly condensed, but complete for frequent pattern mining
– avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern
mining method
– A divide-and-conquer methodology: decompose mining
tasks into smaller ones
– Avoid candidate generation: sub-database test only!
DATA MINING VESIT M.VIJAYALAKSHMI
42
Construct FP-tree From A Transaction DB
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
min_support = 0.5
{}
Steps:
1. Scan DB once, find
frequent 1-itemset
(single item pattern)
2. Order frequent items
in frequency
descending order
3. Scan DB again,
construct FP-tree
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
DATA MINING VESIT M.VIJAYALAKSHMI
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
43
Benefits of the FP-tree Structure
• Completeness:
– never breaks a long pattern of any transaction
– preserves complete information for frequent pattern mining
• Compactness
– reduce irrelevant information—infrequent items are gone
– frequency descending ordering: more frequent items are more
likely to be shared
– never be larger than the original database (if not count nodelinks and counts)
DATA MINING VESIT M.VIJAYALAKSHMI
44
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in
the FP-tree
2) Construct conditional FP-tree from each conditional
pattern-base
3) Recursively mine conditional FP-trees and grow
frequent patterns obtained so far

If the conditional FP-tree contains a single path, simply
enumerate all the patterns
DATA MINING VESIT M.VIJAYALAKSHMI
45
Step 1: FP-tree to Conditional Pattern Base
• Starting at the frequent header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item
• Accumulate all of transformed prefix paths of that item to form
a conditional pattern base
Conditional pattern
bases
Header Table
{}
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
b:1
a:3
c:1
item
base
cond. pattern
b:1
c
f:3
a
fc:3
b
fca:1, f:1, c:1
m
fca:2, fcab:1
p
fcam:2, cb:1
p:1
m:2
b:1
p:2
m:1
DATA MINING VESIT M.VIJAYALAKSHMI
46
Properties of FP-tree for Conditional
Pattern Base Construction
• Node-link property
– For any frequent item ai, all the possible frequent patterns
that contain ai can be obtained by following ai's node-links,
starting from ai's head in the FP-tree header
• Prefix path property
– To calculate the frequent patterns for a node ai in a path P,
only the prefix sub-path of ai in P need to be accumulated,
and its frequency count should carry the same count as node
ai.
DATA MINING VESIT M.VIJAYALAKSHMI
47
Step 2: Construct Conditional FP-tree
• For each pattern-base
– Accumulate the count for each item in the base
– Construct the FP-tree for the frequent items of the pattern
base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
m-conditional
pattern base:
fca:2, fcab:1
{}
f:4
c:3
b:1
a:3
m:2
{}
c:1
b:1
p:1
f:3

c:3 
b:1
All frequent
patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
a:3
p:2
m:1
DATA MINING VESIT M.VIJAYALAKSHMI
m-conditional FPtree
48
Mining Frequent Patterns by Creating
Conditional Pattern-Bases
Item
Conditional pattern-base
Conditional FP-tree
p
{(fcam:2), (cb:1)}
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
Empty
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
f
Empty
Empty
DATA MINING VESIT M.VIJAYALAKSHMI
49
Step 3: Recursively mine the conditional
FP-tree
{
}
{}
f:
3
Cond. pattern base of “am”: (fc:3)
c:3
f:3
am-conditional FP-tree
c:3
a:3
{}
Cond. pattern base of “cm”: (f:3)
f:3
m-conditional FP-tree
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree
DATA MINING VESIT M.VIJAYALAKSHMI
50
Single FP-tree Path Generation
• Suppose an FP-tree T has a single path P
• The complete set of frequent pattern of T can be generated by
enumeration of all the combinations of the sub-paths of P
{}
f:3
c:3

a:3
All frequent patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
m-conditional FP-tree
DATA MINING VESIT M.VIJAYALAKSHMI
51
Principles of Frequent Pattern
Growth
• Pattern growth property
– Let a be a frequent itemset in DB, B be a's conditional
pattern base, and  be an itemset in B. Then a   is a
frequent itemset in DB iff  is frequent in B.
• “abcdef ” is a frequent pattern, if and only if
– “abcde ” is a frequent pattern, and
– “f ” is frequent in the set of transactions containing “abcde ”
DATA MINING VESIT M.VIJAYALAKSHMI
52
Why Is Frequent Pattern
Growth Fast?
• Performance study shows
– FP-growth is an order of magnitude faster than Apriori,
and is also faster than tree-projection
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scan
– Basic operation is counting and FP-tree building
DATA MINING VESIT M.VIJAYALAKSHMI
53
FP-Tree Construction
TID
1
2
3
4
5
6
7
8
9
10
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
null
After reading TID=1:
A:1
B:1
After reading TID=2:
DATA MINING VESIT M.VIJAYALAKSHMI
null
A:1
B:1
B:1
C:1
D:154
FP-Tree Construction
TID
1
2
3
4
5
6
7
8
9
10
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
Header table
Item
Pointer
A
B
C
D
E
Transaction
Database
null
B:3
A:7
B:5
C:1
D:1
DATA MINING VESIT M.VIJAYALAKSHMI
C:3
D:1
C:3
D:1
D:1
D:1
E:1
E:1
E:1
Pointers are used to assist
frequent itemset generation
55
FP-growth
Build conditional pattern
base for E:
P = {(A:1,C:1,D:1),
(A:1,D:1),
(B:1,C:1)}
null
B:3
A:7
B:5
C:1
D:1
C:3
C:3
Recursively apply FPgrowth on P
D:1
D:1
D:1
E:1
E:1
E:1
D:1
DATA MINING VESIT M.VIJAYALAKSHMI
56
FP-growth
Conditional tree for E:
Conditional Pattern base
for E:
P = {(A:1,C:1,D:1,E:1),
(A:1,D:1,E:1),
(B:1,C:1,E:1)}
null
B:1
A:2
C:1
D:1
C:1
Count for E is 3: {E} is
frequent itemset
Recursively apply FPgrowth on P
D:1
E:1
E:1
E:1
DATA MINING VESIT M.VIJAYALAKSHMI
57
FP-growth
Conditional tree for D
within conditional tree
for E:
null
A:2
C:1
D:1
Conditional pattern base
for D within conditional
base for E:
P = {(A:1,C:1,D:1),
(A:1,D:1)}
Count for D is 2: {D,E} is
frequent itemset
Recursively apply FPgrowth on P
D:1
DATA MINING VESIT M.VIJAYALAKSHMI
58
FP-growth
Conditional tree for C
within D within E:
null
Conditional pattern base
for C within D within E:
P = {(A:1,C:1)}
Count for C is 1: {C,D,E}
is NOT frequent itemset
A:1
C:1
DATA MINING VESIT M.VIJAYALAKSHMI
59
FP-growth
Conditional tree for A
within D within E:
null
Count for A is 2: {A,D,E}
is frequent itemset
Next step:
A:2
Construct conditional tree
C within conditional tree
E
Continue until exploring
conditional tree for A
(which has only node A)
DATA MINING VESIT M.VIJAYALAKSHMI
60
Benefits of the FP-tree Structure
• Performance study shows
• Reasoning
– No candidate generation, no
candidate test
– Use compact data structure
– Eliminate repeated database
scan
– Basic operation is counting
and FP-tree building
DATA MINING VESIT M.VIJAYALAKSHMI
D1 FP-grow th runtime
90
D1 Apriori runtime
80
70
Run time(sec.)
– FP-growth is an order of
magnitude faster than
Apriori, and is also faster
than tree-projection
100
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
3
61
Iceberg Queries
• Icerberg query: Compute aggregates over one or a set of
attributes only for those whose aggregate values is
above certain threshold
• Example:
select P.custID, P.itemID, sum(P.qty)
from purchase P
group by P.custID, P.itemID
having sum(P.qty) >= 10
• Compute iceberg queries efficiently by Apriori:
– First compute lower dimensions
– Then compute higher dimensions only when all the lower ones
are above the threshold
DATA MINING VESIT M.VIJAYALAKSHMI
62
Multiple-Level Association Rules
• Items often form hierarchy.
• Items at the lower level are
expected to have lower
support.
• Rules regarding itemsets at
appropriate levels could be
quite useful.
• Transaction database can be
encoded based on dimensions
and levels
• We can explore shared multilevel mining
DATA MINING VESIT M.VIJAYALAKSHMI
Food
bread
milk
skim
Amul
TID
T1
T2
T3
T4
T5
2%
wheat
white
Dairy
Items
{111, 121, 211, 221}
{111, 211, 222, 323}
{112, 122, 221, 411}
{111, 121}
{111, 122, 211, 221, 413}
63
Mining Multi-Level Associations
• A top_down, progressive deepening approach:
– First find high-level strong rules:
milk  bread [20%, 60%].
– Then find their lower-level “weaker” rules:
2% milk  wheat bread [6%, 50%].
• Variations at mining multiple-level association rules.
– Level-crossed association rules:
2% milk  Wonder wheat bread
– Association rules with multiple, alternative hierarchies:
2% milk  Wonder bread
DATA MINING VESIT M.VIJAYALAKSHMI
64
Multi-level Association: Uniform
Support vs. Reduced Support
• Uniform Support: the same minimum support for all
levels
– + One minimum support threshold. No need to
examine itemsets containing any item whose
ancestors do not have minimum support.
– – Lower level items do not occur as frequently. If
support threshold
• too high  miss low level associations
• too low  generate too many high level
associations
• Reduced Support: reduced minimum support at lower
levels
DATA MINING VESIT M.VIJAYALAKSHMI
65
Multi-level Association:
Redundancy Filtering
• Some rules may be redundant due to “ancestor”
relationships between items.
• Example
– milk  wheat bread [support = 8%, confidence = 70%]
– 2% milk  wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule.
• A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
DATA MINING VESIT M.VIJAYALAKSHMI
66
Multi-Dimensional Association:
Concepts
• Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
• Multi-dimensional rules:  2 dimensions or predicates
– Inter-dimension association rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”)
– hybrid-dimension association rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
• Categorical Attributes
– finite number of possible values, no ordering among values
• Quantitative Attributes
– numeric, implicit ordering among values
DATA MINING VESIT M.VIJAYALAKSHMI
67
Techniques for Mining
MD Associations
• Search for frequent k-predicate set:
– Example: {age, occupation, buys} is a 3-predicate set.
– Techniques can be categorized by how age are treated.
1. Using static discretization of quantitative attributes
– Quantitative attributes are statically discretized by using
predefined concept hierarchies.
2. Quantitative association rules
– Quantitative attributes are dynamically discretized into
“bins”based on the distribution of the data.
3. Distance-based association rules
– This is a dynamic discretization process that considers the
distance between data points.
DATA MINING VESIT M.VIJAYALAKSHMI
68
Interestingness Measurements
•
Objective measures
Two popular measurements:
1. support; and
2. confidence
•
•
Subjective measures
A rule (pattern) is interesting if
1. it is unexpected (surprising to the user); and/or
2. actionable (the user can do something with it)
DATA MINING VESIT M.VIJAYALAKSHMI
69
Criticism to Support and
Confidence
• Example 1: Among 5000 students
• 3000 play basketball
• 3750 eat cereal
• 2000 both play basket ball and eat cereal
– play basketball  eat cereal [40%, 66.7%] is misleading because the
overall percentage of students eating cereal is 75% which is higher than
66.7%.
– play basketball  not eat cereal [20%, 33.3%] is far more accurate,
although with lower support and confidence
basketball not basketball sum(row)
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
DATA MINING VESIT M.VIJAYALAKSHMI
70
(Cont.)
• Example 2:
– X and Y: positively correlated,
– X and Z, negatively related
– support and confidence of
X=>Z dominates
• We need a measure of
dependent or correlated events
X11110000
Y11000000
Z01111111
P( A B)

P( A) P( B)
Rule Support Confidence
corrA, B
X=>Y 25%
50%
X=>Z
37.50%
75%
• P(B|A)/P(B) is also called the
lift of rule A => B
DATA MINING VESIT M.VIJAYALAKSHMI
71