Transcript The problem
Mining surprising patterns using
temporal description length
Soumen Chakrabarti (IIT Bombay)
Sunita Sarawagi (IIT Bombay)
Byron Dom (IBM Almaden)
1
Market basket mining algorithms
Find prevalent rules
that hold over large
fractions of data
Useful for promotions
and store arrangement
Intensively researched
1990
Milk and
cereal sell
together!
2
Prevalent Interesting
Analysts already know
about prevalent rules
Interesting rules are
those that deviate from
prior expectation
Mining’s payoff is in
finding surprising
phenomena
Zzzz...
1995
Milk and
cereal sell
together!
1998
Milk and
cereal sell
together!
3
What makes a rule surprising?
Does not match prior
expectation
Correlation between
milk and cereal
remains roughly
constant over time
Cannot be trivially
derived from simpler
rules
Milk 10%, cereal 10%
Milk and cereal 10%
… surprising
Eggs 10%
Milk, cereal and eggs
0.1% … surprising!
Expected 1%
4
Two views on data mining
Data
Data
Mining
Program
Model of
Analyst’s
Knowledge
of the Data
Discovery
Mining
Program
Discovery
Analyst
5
Our contributions
A new notion of surprising patterns
Detect
changes in correlation along time
Filter out steady, uninteresting correlations
Algorithms to mine for surprising patterns
Encode
data into bit streams using two models
Surprise = difference in number of bits needed
Experimental results
Demonstrate
superiority over prevalent patterns
6
A simpler problem: one item
Milk-buying habits modeled by biased coin
Customer tosses this coin to decide whether
to buy milk
Head
or “1” denotes “basket contains milk”
Coin bias is Pr[milk]
Analyst wants to study Pr[milk] along time
Single
coin with fixed bias is not interesting
Changes in bias are interesting
7
The coin segmentation problem
Players A and B
A has a set of coins
with different biases
A repeatedly
Picks arbitrary coin
Tosses it arbitrary
number of times
B observes H/T
Guesses transition
points and biases
Return
Pick
A
Toss
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
B
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
8
How to explain the data
Given n head/tail observations
Can
assume n different coins with bias 0 or 1
• Data fits perfectly (with probability one)
• Many coins needed
Or
assume one coin
• May fit data poorly
“Best explanation” is a compromise
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
1/4
5/7
1/3
9
Coding examples
Sequence of k zeroes
Naïve
encoding takes k bits
Run length takes about log k bits
1000 bits, 10 randomly placed 1’s, rest 0’s
Posit
a coin with bias 0.01
Data encoding cost is (Shannon’s theorem):
10 log 0.01 990 log 0.99 81 bits « 1000 bits
Note that 10 log100 66 bits
10
How to find optimal segments
Sequence of 17 tosses:
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
Derived graph with 18 nodes:
Shortest path
Edge cost =
model cost
+ data cost
Data cost for
Pr[head] = 5/7,
5 heads, 2 tails
Model cost =
one node ID +
one Pr[head]
11
Approximate shortest path
Suppose there are T tosses
Make T1– chunks each with T nodes
(tune )
Find shortest paths within chunks
Some nodes are chosen in each chunk
Solve a shortest path with all chosen nodes
12
Two or more items
“Unconstrained” segmentation
items induce a 2k sided coin
“milk and cereal” = 11, “milk, not
cereal” = 10, “neither” = 00, etc.
k
00 01
10 11
Shortest path finds significant shift
in any of the coin face probabilities
Problem: some of these shifts may
be completely explained by lower
order marginal
13
Example
Theta=2
0.4
0.3
Support
Drop in joint sale of
milk and cereal is
completely explained
by drop in sale of milk
Pr[milk & cereal] /
(Pr[milk] Pr[cereal])
remains constant over
time
Call this ratio
0.2
0.1
0
0
Milk
2
4 6
Time
Cereal
8
10
Both
14
Constant- segmentation
Observed support
p11
p11
p1 p1 ( p10 p11 )( p01 p11 )
Independence
Compute global over all time
All coins must have this common value of
Segment by constrained optimization
Compare with unconstrained coding cost
15
Is all this really needed?
Simpler alternative
Aggregate
data into suitable time windows
Compute support, correlation, , etc. in each
window
Use variance threshold to choose itemsets
Pitfalls
Choices:
windows, thresholds
May miss fine detail
Over-sensitive to outliers
16
… but no simpler
Smoothing leads to an estimated trend that is
descriptive rather than analytic or explanatory.
Because it is not based on an explicit probabilistic
model, the method cannot be treated rigorously
in terms of mathematical statistics.
The Statistical Analysis of Time Series
T. W. Anderson
17
Experiments
2.8 million baskets over 7 years, 1987-93
15800 items, average 2.62 items per basket
Two algorithms
Complete
MDL approach
MDL segmentation + statistical tests (MStat)
Anecdotes
MDL effective
at penalizing obvious itemsets
18
0.002
1800
0.0018
1600
0.0016
1400
0.0014
1200
0.0012
Time(s)
Approx/OPT-1
Quality of approximation
1000
0.001
0.0008
800
600
0.0006
400
0.0004
0.0002
200
0
0
0.3
0.5
0.7
Epsilon
0.9
0.3
0.5
0.7
0.9
Epsilon
19
1600
1600
1200
1200
Rank(MDL)
Rank(MDL)
Little agreement in itemset ranks
800
800
400
400
0
0
0
400 800 1200 1600
Rank(Stat, 4 week)
0
400 800 1200 1600
Rank(MStat)
Simpler methods do not approximate MDL
20
2000
MDL
1500
1000
500
0
-2000
0
2000
4000
6000
Score
Freq
Freq
MDL has high selectivity
1800
1600
1400
1200
1000
800
600
400
200
0
MStat
0
5
10
15
Score
Score of best itemsets stand out from the
rest using MDL
21
Three anecdotes
against time
High MStat score
20
15
10
5
Small marginals
Polo shirt & shorts
0
600
500
High correlation
400
300
Small % variation
Bedsheets & pillow cases
High MDL score
200
100
0
40
30
Significant gradual drift
Men’s & women’s shorts
20
10
0
22
Conclusion
New notion of surprising patterns based on
Joint
support expected from marginals
Variation of joint support along time
Robust MDL formulation
Efficient algorithms
Near-optimal
segmentation using shortest path
Pruning criteria
Successful application to real data
23