Transcript The problem

Mining surprising patterns using
temporal description length
Soumen Chakrabarti (IIT Bombay)
Sunita Sarawagi (IIT Bombay)
Byron Dom (IBM Almaden)
1
Market basket mining algorithms

Find prevalent rules
that hold over large
fractions of data
 Useful for promotions
and store arrangement
 Intensively researched
1990
Milk and
cereal sell
together!
2
Prevalent  Interesting

Analysts already know
about prevalent rules
 Interesting rules are
those that deviate from
prior expectation
 Mining’s payoff is in
finding surprising
phenomena
Zzzz...
1995
Milk and
cereal sell
together!
1998
Milk and
cereal sell
together!
3
What makes a rule surprising?

Does not match prior
expectation

Correlation between
milk and cereal
remains roughly
constant over time

Cannot be trivially
derived from simpler
rules





Milk 10%, cereal 10%
Milk and cereal 10%
… surprising
Eggs 10%
Milk, cereal and eggs
0.1% … surprising!
Expected 1%
4
Two views on data mining
Data
Data
Mining
Program
Model of
Analyst’s
Knowledge
of the Data
Discovery
Mining
Program
Discovery
Analyst
5
Our contributions

A new notion of surprising patterns
 Detect
changes in correlation along time
 Filter out steady, uninteresting correlations

Algorithms to mine for surprising patterns
 Encode
data into bit streams using two models
 Surprise = difference in number of bits needed

Experimental results
 Demonstrate
superiority over prevalent patterns
6
A simpler problem: one item

Milk-buying habits modeled by biased coin
 Customer tosses this coin to decide whether
to buy milk
 Head
or “1” denotes “basket contains milk”
 Coin bias is Pr[milk]

Analyst wants to study Pr[milk] along time
 Single
coin with fixed bias is not interesting
 Changes in bias are interesting
7
The coin segmentation problem

Players A and B
 A has a set of coins
with different biases
 A repeatedly



Picks arbitrary coin
Tosses it arbitrary
number of times
B observes H/T
 Guesses transition
points and biases
Return
Pick
A
Toss
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
B
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
8
How to explain the data

Given n head/tail observations
 Can
assume n different coins with bias 0 or 1
• Data fits perfectly (with probability one)
• Many coins needed
 Or
assume one coin
• May fit data poorly

“Best explanation” is a compromise
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
1/4
5/7
1/3
9
Coding examples

Sequence of k zeroes
 Naïve
encoding takes k bits
 Run length takes about log k bits

1000 bits, 10 randomly placed 1’s, rest 0’s
 Posit
a coin with bias 0.01
 Data encoding cost is (Shannon’s theorem):
10  log 0.01  990  log 0.99  81 bits « 1000 bits
Note that 10  log100  66 bits
10
How to find optimal segments
Sequence of 17 tosses:
0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1
Derived graph with 18 nodes:
Shortest path
Edge cost =
model cost
+ data cost
Data cost for
Pr[head] = 5/7,
5 heads, 2 tails
Model cost =
one node ID +
one Pr[head]
11
Approximate shortest path

Suppose there are T tosses
 Make T1– chunks each with T nodes
(tune )
 Find shortest paths within chunks
 Some nodes are chosen in each chunk
 Solve a shortest path with all chosen nodes
12
Two or more items

“Unconstrained” segmentation
items induce a 2k sided coin
 “milk and cereal” = 11, “milk, not
cereal” = 10, “neither” = 00, etc.
k
00 01
10 11

Shortest path finds significant shift
in any of the coin face probabilities
 Problem: some of these shifts may
be completely explained by lower
order marginal
13
Example

Theta=2
0.4
0.3
Support
Drop in joint sale of
milk and cereal is
completely explained
by drop in sale of milk
 Pr[milk & cereal] /
(Pr[milk] Pr[cereal])
remains constant over
time
 Call this ratio 
0.2
0.1
0
0
Milk
2
4 6
Time
Cereal
8
10
Both
14
Constant- segmentation
Observed support
p11
p11


p1 p1 ( p10  p11 )( p01  p11 )
Independence
Compute global  over all time
 All coins must have this common value of 
 Segment by constrained optimization
 Compare with unconstrained coding cost

15
Is all this really needed?

Simpler alternative
 Aggregate
data into suitable time windows
 Compute support, correlation, , etc. in each
window
 Use variance threshold to choose itemsets

Pitfalls
 Choices:
windows, thresholds
 May miss fine detail
 Over-sensitive to outliers
16
… but no simpler
Smoothing leads to an estimated trend that is
descriptive rather than analytic or explanatory.
Because it is not based on an explicit probabilistic
model, the method cannot be treated rigorously
in terms of mathematical statistics.
The Statistical Analysis of Time Series
T. W. Anderson
17
Experiments

2.8 million baskets over 7 years, 1987-93
 15800 items, average 2.62 items per basket
 Two algorithms
 Complete
MDL approach
 MDL segmentation + statistical tests (MStat)

Anecdotes
 MDL effective
at penalizing obvious itemsets
18
0.002
1800
0.0018
1600
0.0016
1400
0.0014
1200
0.0012
Time(s)
Approx/OPT-1
Quality of approximation
1000
0.001
0.0008
800
600
0.0006
400
0.0004
0.0002
200
0
0
0.3
0.5
0.7
Epsilon
0.9
0.3
0.5
0.7
0.9
Epsilon
19
1600
1600
1200
1200
Rank(MDL)
Rank(MDL)
Little agreement in itemset ranks
800
800
400
400
0
0
0

400 800 1200 1600
Rank(Stat, 4 week)
0
400 800 1200 1600
Rank(MStat)
Simpler methods do not approximate MDL
20
2000
MDL
1500
1000
500
0
-2000

0
2000
4000
6000
Score
Freq
Freq
MDL has high selectivity
1800
1600
1400
1200
1000
800
600
400
200
0
MStat
0
5
10
15
Score
Score of best itemsets stand out from the
rest using MDL
21
Three anecdotes
 against time
 High MStat score

20
15
10
5


Small marginals
Polo shirt & shorts
0
600
500

High correlation
400
300



Small % variation
Bedsheets & pillow cases
High MDL score
200
100
0
40
30


Significant gradual drift
Men’s & women’s shorts
20
10
0
22
Conclusion

New notion of surprising patterns based on
 Joint
support expected from marginals
 Variation of joint support along time

Robust MDL formulation
 Efficient algorithms
 Near-optimal
segmentation using shortest path
 Pruning criteria

Successful application to real data
23