assoc - CSE, IIT Bombay
Download
Report
Transcript assoc - CSE, IIT Bombay
Frequent itemset mining and
temporal extensions
Sunita Sarawagi
[email protected]
http://www.it.iitb.ac.in/~sunita
Association rules
Given several sets of items, example:
Set
of items purchased
Set of pages visited on a website
Set of doctors visited
Find all rules that correlate the presence of
one set of items with another
are of the form X Y where X and Y are
sets of items
Eg: Purchase of books A&B purchase of C
Rules
Parameters: Support and Confidence
All rules X Z have two parameters
Support probability that a transaction has X and Z
confidence conditional probability that a transaction having X
also contains Z
Two parameters to association rule mining:
Minimum support s
Minimum confidence c
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
S: 50%, and c: 50%
A C (50%, 66.6%)
C A (50%, 100%)
Applications of fast itemset counting
Cross selling in retail, banking
Catalog design and store layout
Applications in medicine: find redundant tests
Improve predictive capability of classifiers
that assume attribute independence
Improved clustering of categorical attributes
Finding association rules in large
databases
Number of transactions: in millions
Number of distinct items: tens of thousands
Lots of work on scalable algorithms
Typically two parts to the algorithm:
Finding all frequent itemsets with support > S
2. Finding rules with confidence greater than C
1.
Frequent itemset search more expensive
Apriori algorithm,
FP-tree algorithm
The Apriori Algorithm
L1 = {frequent items of size one};
for (k = 1; Lk !=; k++)
Ck+1
= candidates generated from Lk by
• Join Lk with itself
• Prune any k+1 itemset whose subset not in Lk
for
each transaction t in database do
• increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
return k Lk;
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
Items
124
235
1235
25
itemset sup.
C1
{1}
2
{2}
4
Scan D
{3}
2
{4}
1
{5}
3
itemset sup
{1 2}
2
L2 {2 3} 2
{2 5}
3
{3 5}
2
C2
C3 itemset
{2 3 5}
itemset sup
{1 2}
2
{1 3}
1
{1 5}
1
{2 3}
2
{2 5}
3
{3 5}
2
Scan D
L1
itemset sup.
{1}
2
{2}
4
{3}
2
{5}
3
C2 itemset
{1 2}
Scan D
L3
itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Improvements to Apriori
Apriori with well-designed data structures works
well in practice when frequent itemsets not too long
(common case)
Lots of enhancements proposed
Sampling: count in two passes
Invert database to be column major instead of row major
and count by intersection
Count multiple length itemsets in one-pass
Reducing passes not useful since I/O not bottleneck:
Main bottleneck: candidate generation and counting
not optimized for long itemsets
Mining Frequent Patterns
Without Candidate Generation
Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
highly
condensed, but complete for frequent pattern
mining
Develop an efficient, FP-tree-based frequent
pattern mining method
A divide-and-conquer methodology:
mining tasks into smaller ones
Avoid
candidate generation
decompose
Construct FP-tree from Database
min_support = 0.5
TID
100
200
300
400
500
Items bought (ordered) frequent items
{f, a, c, d, g, i, m, p} {f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Scan DB once, find
frequent 1-itemset
Order frequent items by
decreasing frequency
Scan DB again, construct
FP-tree
Item frequency
f
c
a
b
m
p
4
4
3
3
3
3
f:4
c:1
c:3 b:1 b:1
a:3
p:1
m:2 b:1
p:2 m:1
Step 1: FP-tree to Conditional Pattern Base
Starting at the frequent header table in the FP-tree
Traverse the FP-tree by following the link of each frequent
item
Accumulate all of transformed prefix paths of that item to
form a conditional pattern base
Header Table
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
Conditional pattern bases
{}
item
cond. pattern base
c:1
c
f:3
c:3 b:1 b:1
a
fc:3
a:3
b
fca:1, f:1, c:1
m
fca:2, fcab:1
p
fcam:2, cb:1
f:4
m:2 b:1
p:2 m:1
p:1
Step 2: Construct Conditional FP-tree
For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the pattern
base
m-conditional pattern base:
fca:2, fcab:1
All frequent patterns
concerning m
{}
m,
f:3
fm, cm, am,
c:3
fcm, fam, cam,
fcam
a:3
m-conditional FP-tree
Mining Frequent Patterns by
Creating Conditional Pattern-Bases
Item
p
Conditional pattern-base
{(fcam:2), (cb:1)}
Conditional FP-tree
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
Empty
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
f
Empty
Empty
Repeat this recursively for higher items…
FP-growth vs. Apriori: Scalability With
the Support Threshold
Data set T25I20D10K
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
3
Criticism to Support and Confidence
X and Y: positively correlated,
X and Z, negatively related
support and confidence of
X=>Z dominates
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Need to measure departure from
Rule Support Confidence
expected.
X=>Y 25%
50%
For two items: P( A B )
X=>Z 37.50%
75%
P( A) P( B)
For k items, expected support derived
from support of k-1 itemsets using
iterative scaling methods
Prevalent correlations are not interesting
Analysts already know
about prevalent rules
Interesting rules are
those that deviate from
prior expectation
Mining’s payoff is in
finding surprising
phenomena
1995
Zzzz...
1998
bedsheets and
pillow covers
sell together!
bedsheets and
pillow covers
sell together!
What makes a rule surprising?
Does not match prior
expectation
Correlation between
milk and cereal
remains roughly
constant over time
Cannot be trivially
derived from simpler
rules
Milk 10%, cereal 10%
Milk and cereal 10%
… surprising
Eggs 10%
Milk, cereal and eggs
0.1% … surprising!
Expected 1%
Finding suprising temporal patterns
Algorithms to mine for surprising patterns
Encode
itemsets into bit streams using two
models
• Mopt: The optimal model that allows change along
time
• Mcons: The constrained model that does not allow
change along time
Surprise
= difference in number of bits in Mopt
and Mcons
One item: optimal model
Milk-buying habits modeled by biased coin
Customer tosses this coin to decide whether
to buy milk
Head
or “1” denotes “basket contains milk”
Coin bias is Pr[milk]
Analyst wants to study Pr[milk] along time
Single
coin with fixed bias is not interesting
Changes in bias are interesting
The coin segmentation problem
Players A and B
A has a set of coins
with different biases
A repeatedly
Picks arbitrary coin
Tosses it arbitrary
number of times
B observes H/T
Guesses transition
points and biases
Return
Pick
A
Toss
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
B
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
How to explain the data
Given n head/tail observations
Can
assume n different coins with bias 0 or 1
• Data fits perfectly (with probability one)
• Many coins needed
Or
assume one coin
• May fit data poorly
“Best explanation” is a compromise
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
1/4
5/7
1/3
Coding examples
Sequence of k zeroes
Naïve
encoding takes k bits
Run length takes about log k bits
1000 bits, 10 randomly placed 1’s, rest 0’s
Posit
a coin with bias 0.01
Data encoding cost is (Shannon’s theorem):
10 log 0.01 990 log 0.99 81 bits « 1000 bits
Note that 10 log100 66 bits
How to find optimal segments
Sequence of 17 tosses:
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
Derived graph with 18 nodes:
Shortest path
Edge cost =
model cost
+ data cost
Data cost for
Pr[head] = 5/7,
5 heads, 2 tails
Model cost =
one node ID +
one Pr[head]
Two or more items
“Unconstrained” segmentation
items induce a 2k sided coin
“milk and cereal” = 11, “milk, not
cereal” = 10, “neither” = 00, etc.
k
Shortest path finds significant shift
in any of the coin face probabilities
Problem: some of these shifts may
be completely explained by
marginals
00 01
10 11
Example
Theta=2
0.4
0.3
Support
Drop in joint sale of
milk and cereal is
completely explained
by drop in sale of milk
Pr[milk & cereal] /
(Pr[milk] Pr[cereal])
remains constant over
time
Call this ratio
0.2
0.1
0
0
Milk
2
4 6
Time
Cereal
8
10
Both
Constant- segmentation
Observed support
p11
p11
p1 p1 ( p10 p11 )( p01 p11 )
Independence
Compute global over all time
All coins must have this common value of
Segment as before
Compare with un-constrained coding cost
Is all this really needed?
Simpler alternative
Aggregate
data into suitable time windows
Compute support, correlation, , etc. in each
window
Use variance threshold to choose itemsets
Pitfalls
Choices:
windows, thresholds
May miss fine detail
Over-sensitive to outliers
Experiments
Millions of baskets over several years
Two algorithms
Complete
MDL approach
MDL segmentation + statistical tests (MStat)
Data set
2.8
million transactions
7 years, 1987 to 1993
15800 items
Average 2.62 items per basket
1600
1600
1200
1200
Rank(MDL)
Rank(MDL)
Little agreement in itemset ranks
800
800
400
400
0
0
0
400 800 1200 1600
Rank(Stat, 4 week)
0
400 800 1200 1600
Rank(MStat)
Simpler methods do not approximate MDL
2000
MDL
1500
1000
500
0
-2000
0
2000
4000
6000
Score
Freq
Freq
MDL has high selectivity
1800
1600
1400
1200
1000
800
600
400
200
0
MStat
0
5
10
15
Score
Score of best itemsets stand out from the
rest using MDL
Three anecdotes
against time
High MStat score
20
15
10
5
Small marginals
Polo shirt & shorts
0
600
500
High correlation
400
300
Small % variation
Bedsheets & pillow cases
High MDL score
200
100
0
40
30
Significant gradual drift
Men’s & women’s shorts
20
10
0
Conclusion
New notion of surprising patterns based on
Joint
support expected from marginals
Variation of joint support along time
Robust MDL formulation
Efficient algorithms
Near-optimal
segmentation using shortest path
Pruning criteria
Successful application to real data
References
R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94 487-499, Santiago, Chile.
S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using
temporal description length Proc. of the 24th Int'l Conference on Very Large
Databases (VLDB), 1998
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD'00, 1-12, Dallas, TX, May 2000.
Jiawei Han, Micheline Kamber , Data Mining: Concepts and Techniques by,
Morgan Kaufmann Publishers (Some of the slides in the talk are taken from this
book)
H. Toivonen. Sampling large databases for association rules. VLDB'96, 134145, Bombay, India, Sept. 1996