assoc - CSE, IIT Bombay

Download Report

Transcript assoc - CSE, IIT Bombay

Frequent itemset mining and
temporal extensions
Sunita Sarawagi
[email protected]
http://www.it.iitb.ac.in/~sunita
Association rules

Given several sets of items, example:
 Set
of items purchased
 Set of pages visited on a website
 Set of doctors visited

Find all rules that correlate the presence of
one set of items with another
are of the form X  Y where X and Y are
sets of items
 Eg: Purchase of books A&B  purchase of C
 Rules
Parameters: Support and Confidence

All rules X  Z have two parameters



Support probability that a transaction has X and Z
confidence conditional probability that a transaction having X
also contains Z
Two parameters to association rule mining:

Minimum support s
 Minimum confidence c
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
S: 50%, and c: 50%


A  C (50%, 66.6%)
C  A (50%, 100%)
Applications of fast itemset counting

Cross selling in retail, banking
 Catalog design and store layout
 Applications in medicine: find redundant tests
 Improve predictive capability of classifiers
that assume attribute independence
 Improved clustering of categorical attributes
Finding association rules in large
databases




Number of transactions: in millions
Number of distinct items: tens of thousands
Lots of work on scalable algorithms
Typically two parts to the algorithm:
Finding all frequent itemsets with support > S
2. Finding rules with confidence greater than C
1.

Frequent itemset search more expensive

Apriori algorithm,
FP-tree algorithm
The Apriori Algorithm
L1 = {frequent items of size one};
 for (k = 1; Lk !=; k++)

 Ck+1
= candidates generated from Lk by
• Join Lk with itself
• Prune any k+1 itemset whose subset not in Lk
 for
each transaction t in database do
• increment the count of all candidates in Ck+1
that are contained in t


Lk+1 = candidates in Ck+1 with min_support
return k Lk;
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order

Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
Items
124
235
1235
25
itemset sup.
C1
{1}
2
{2}
4
Scan D
{3}
2
{4}
1
{5}
3
itemset sup
{1 2}
2
L2 {2 3} 2
{2 5}
3
{3 5}
2
C2
C3 itemset
{2 3 5}
itemset sup
{1 2}
2
{1 3}
1
{1 5}
1
{2 3}
2
{2 5}
3
{3 5}
2
Scan D
L1
itemset sup.
{1}
2
{2}
4
{3}
2
{5}
3
C2 itemset
{1 2}
Scan D
L3
itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Improvements to Apriori

Apriori with well-designed data structures works
well in practice when frequent itemsets not too long
(common case)
 Lots of enhancements proposed




Sampling: count in two passes
Invert database to be column major instead of row major
and count by intersection
Count multiple length itemsets in one-pass
Reducing passes not useful since I/O not bottleneck:
 Main bottleneck: candidate generation and counting
 not optimized for long itemsets
Mining Frequent Patterns
Without Candidate Generation

Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
 highly
condensed, but complete for frequent pattern
mining

Develop an efficient, FP-tree-based frequent
pattern mining method
 A divide-and-conquer methodology:
mining tasks into smaller ones
 Avoid
candidate generation
decompose
Construct FP-tree from Database
min_support = 0.5
TID
100
200
300
400
500
Items bought (ordered) frequent items
{f, a, c, d, g, i, m, p} {f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Scan DB once, find
frequent 1-itemset
Order frequent items by
decreasing frequency
Scan DB again, construct
FP-tree
Item frequency
f
c
a
b
m
p
4
4
3
3
3
3
f:4
c:1
c:3 b:1 b:1
a:3
p:1
m:2 b:1
p:2 m:1
Step 1: FP-tree to Conditional Pattern Base



Starting at the frequent header table in the FP-tree
Traverse the FP-tree by following the link of each frequent
item
Accumulate all of transformed prefix paths of that item to
form a conditional pattern base
Header Table
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
Conditional pattern bases
{}
item
cond. pattern base
c:1
c
f:3
c:3 b:1 b:1
a
fc:3
a:3
b
fca:1, f:1, c:1
m
fca:2, fcab:1
p
fcam:2, cb:1
f:4
m:2 b:1
p:2 m:1
p:1
Step 2: Construct Conditional FP-tree

For each pattern-base


Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the pattern
base
m-conditional pattern base:
fca:2, fcab:1
All frequent patterns
concerning m
{}
 m,
f:3
fm, cm, am,
c:3
fcm, fam, cam,
fcam
a:3
m-conditional FP-tree
Mining Frequent Patterns by
Creating Conditional Pattern-Bases
Item
p
Conditional pattern-base
{(fcam:2), (cb:1)}
Conditional FP-tree
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
Empty
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
f
Empty
Empty
Repeat this recursively for higher items…
FP-growth vs. Apriori: Scalability With
the Support Threshold
Data set T25I20D10K
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
3
Criticism to Support and Confidence
X and Y: positively correlated,
X and Z, negatively related
support and confidence of
X=>Z dominates

X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Need to measure departure from
Rule Support Confidence
expected.
X=>Y 25%
50%
 For two items: P( A B )
X=>Z 37.50%
75%
P( A) P( B)

For k items, expected support derived
from support of k-1 itemsets using
iterative scaling methods
Prevalent correlations are not interesting

Analysts already know
about prevalent rules
 Interesting rules are
those that deviate from
prior expectation
 Mining’s payoff is in
finding surprising
phenomena
1995
Zzzz...
1998
bedsheets and
pillow covers
sell together!
bedsheets and
pillow covers
sell together!
What makes a rule surprising?

Does not match prior
expectation

Correlation between
milk and cereal
remains roughly
constant over time

Cannot be trivially
derived from simpler
rules





Milk 10%, cereal 10%
Milk and cereal 10%
… surprising
Eggs 10%
Milk, cereal and eggs
0.1% … surprising!
Expected 1%
Finding suprising temporal patterns

Algorithms to mine for surprising patterns
 Encode
itemsets into bit streams using two
models
• Mopt: The optimal model that allows change along
time
• Mcons: The constrained model that does not allow
change along time
 Surprise
= difference in number of bits in Mopt
and Mcons
One item: optimal model

Milk-buying habits modeled by biased coin
 Customer tosses this coin to decide whether
to buy milk
 Head
or “1” denotes “basket contains milk”
 Coin bias is Pr[milk]

Analyst wants to study Pr[milk] along time
 Single
coin with fixed bias is not interesting
 Changes in bias are interesting
The coin segmentation problem

Players A and B
 A has a set of coins
with different biases
 A repeatedly



Picks arbitrary coin
Tosses it arbitrary
number of times
B observes H/T
 Guesses transition
points and biases
Return
Pick
A
Toss
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
B
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
How to explain the data

Given n head/tail observations
 Can
assume n different coins with bias 0 or 1
• Data fits perfectly (with probability one)
• Many coins needed
 Or
assume one coin
• May fit data poorly

“Best explanation” is a compromise
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
1/4
5/7
1/3
Coding examples

Sequence of k zeroes
 Naïve
encoding takes k bits
 Run length takes about log k bits

1000 bits, 10 randomly placed 1’s, rest 0’s
 Posit
a coin with bias 0.01
 Data encoding cost is (Shannon’s theorem):
10  log 0.01  990  log 0.99  81 bits « 1000 bits
Note that 10  log100  66 bits
How to find optimal segments
Sequence of 17 tosses:
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
Derived graph with 18 nodes:
Shortest path
Edge cost =
model cost
+ data cost
Data cost for
Pr[head] = 5/7,
5 heads, 2 tails
Model cost =
one node ID +
one Pr[head]
Two or more items

“Unconstrained” segmentation
items induce a 2k sided coin
 “milk and cereal” = 11, “milk, not
cereal” = 10, “neither” = 00, etc.
k

Shortest path finds significant shift
in any of the coin face probabilities
 Problem: some of these shifts may
be completely explained by
marginals
00 01
10 11
Example

Theta=2
0.4
0.3
Support
Drop in joint sale of
milk and cereal is
completely explained
by drop in sale of milk
 Pr[milk & cereal] /
(Pr[milk]  Pr[cereal])
remains constant over
time
 Call this ratio 
0.2
0.1
0
0
Milk
2
4 6
Time
Cereal
8
10
Both
Constant-  segmentation
Observed support
p11
p11


p1 p1 ( p10  p11 )( p01  p11 )
Independence
Compute global  over all time
 All coins must have this common value of 
 Segment as before
 Compare with un-constrained coding cost

Is all this really needed?

Simpler alternative
 Aggregate
data into suitable time windows
 Compute support, correlation, , etc. in each
window
 Use variance threshold to choose itemsets

Pitfalls
 Choices:
windows, thresholds
 May miss fine detail
 Over-sensitive to outliers
Experiments

Millions of baskets over several years
 Two algorithms
 Complete
MDL approach
 MDL segmentation + statistical tests (MStat)

Data set
 2.8
million transactions
 7 years, 1987 to 1993
 15800 items
 Average 2.62 items per basket
1600
1600
1200
1200
Rank(MDL)
Rank(MDL)
Little agreement in itemset ranks
800
800
400
400
0
0
0

400 800 1200 1600
Rank(Stat, 4 week)
0
400 800 1200 1600
Rank(MStat)
Simpler methods do not approximate MDL
2000
MDL
1500
1000
500
0
-2000

0
2000
4000
6000
Score
Freq
Freq
MDL has high selectivity
1800
1600
1400
1200
1000
800
600
400
200
0
MStat
0
5
10
15
Score
Score of best itemsets stand out from the
rest using MDL
Three anecdotes
 against time
 High MStat score

20
15
10
5


Small marginals
Polo shirt & shorts
0
600
500

High correlation
400
300



Small % variation
Bedsheets & pillow cases
High MDL score
200
100
0
40
30


Significant gradual drift
Men’s & women’s shorts
20
10
0
Conclusion

New notion of surprising patterns based on
 Joint
support expected from marginals
 Variation of joint support along time

Robust MDL formulation
 Efficient algorithms
 Near-optimal
segmentation using shortest path
 Pruning criteria

Successful application to real data
References

R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94 487-499, Santiago, Chile.
 S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using
temporal description length Proc. of the 24th Int'l Conference on Very Large
Databases (VLDB), 1998
 J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD'00, 1-12, Dallas, TX, May 2000.


Jiawei Han, Micheline Kamber , Data Mining: Concepts and Techniques by,
Morgan Kaufmann Publishers (Some of the slides in the talk are taken from this
book)
H. Toivonen. Sampling large databases for association rules. VLDB'96, 134145, Bombay, India, Sept. 1996