Finding Relevant Pat..

Download Report

Transcript Finding Relevant Pat..

Finding Relevant Patterns in
Bursty Sequences
Alexander Lachmann
and
Mirek Riedewald
(VLDB 2008)
Introduction
• Finding frequent subsequences
– For set of input sequences, find all subsequences
with support  minSupport
• Many applications
–
–
–
–
Purchase patterns in sales transactions
Patterns of page visits in Web logs
Patterns of hardware and software faults
Patterns in sensor readings, e.g., natural and
industrial process monitoring, supply chain (RFID)
– Patterns in DNA data
– Medical treatment sequence analysis
2009/1/13
2
Sequence Mining Challenges
• Exponential number of subsequences
– abc contains a, ab, ac, bc, etc.
• Monotonicity of support
– support(abc)  support(ab)  support(b)
– Exploited for pruning search space
– But: long frequent sequence implies many
frequent subsequences
2009/1/13
3
Outline
• Properties of bursty sequences
2009/1/13
4
Event Bursts
Timeout event
Component error event
2009/1/13
a
b
a
b
a
b
a
b
5
Event Bursts
Timeout event
Component error event
a
b
a
b
a
b
a
b
Subsequences: babab
2009/1/13
6
Event Bursts
Timeout event
Component error event
a
b
a
b
a
b
a
b
Subsequences: babab, baaba
2009/1/13
7
Event Bursts
Timeout event
Component error event
a
b
a
b
a
b
a
b
Subsequences: babab, baaba, abba, abab, bbb, aaaa, baaaa
baab, and so on
2009/1/13
8
High Support, Many Results
Lemma: Let A=a1…an be a sequence of length n
where the aj are i.i.d. and each aj is either i1 or i2 with
probability 0.5. Then any given sequence of m items,
such that mn and each item is either i1 or i2, is a
subsequence of A with probability at least
1 0.5

n / m  m
• n=100, m=10: 0.99; n=100, m=20: 0.53; n=20, m=5: 0.72
• Bound not tight
• Generalizable to skewed distributions, more items
Monotonicity of support: many frequent sequences that are
just variations of burst events
2009/1/13
9
Proof Idea
Sequence A (n items)
n/m
n/m
n/m
1-0.5n/m
1 0.5
2009/1/13
m segments, each with n/m items
Prob. that segment contains at least
one instance of i1 (same for i2)

n / m  m
Prob. that each segment contains the
right item
10
Solutions?
Bursts of common events = irrelevant results at
high cost
• Solution 1: higher minSupport? No.
– Potential loss of relevant patterns
– Irrelevant patterns have high support
• Solution 2: remove common events? No.
– Occurrence of common events important
• Burst of timeouts can signal mechanical problem
• Better: capture notion of burst directly
– Replace aaa by “many ‘a’ events occurred”
– Replace a’s and b’s by “bursts of ‘a’ and ‘b’ occurred
simultaneously”
2009/1/13
11
Outline
• Properties of bursty sequences
• Input transformation
– From bursts to intervals
– From intervals back to sequences
2009/1/13
12
Bursts to Intervals
a
a
a
b
b
a
b
c
d
1
2
3
4
5
6
7
8
9
10
Input sequence S ={(a,1),(b,2),(c,3),({a,d},4),(b,5),({a,b},8),(a,10)}
S(a)={(a,1),(a,4),(a,8),(a,10)} S(b)={(b,2),(b,5),(b,8)}
2009/1/13
S(c)={(c,3)} S(d)={(d,4)}
13
Bursts to Intervals
a
a
b
c
d
1
2
3
Intervals (S(a))={(a,1,4),(a,8,10)}
Intervals (S(b))={(b,2,8)}
Intervals (S(c))={(c,3,3)}
Intervals (S(d))={(d,4,4)}
Intervals (S)={(a,1,4),(a,8,10),(b,2,8),(c,3,3),(d,4,4)}
Merge threshold 
4
5
6
7
8
9
10
Problem: Not a sequence any more, cannot use traditional mining algorithms
2009/1/13
14
Intervals Back to Sequences
a
a
b
c
d
transf (S)=
1
2{a, b, c}
3 {a,4b, d} 5
6
7 {a, 8
b}
9
10
-Create only locally maximal transactions
T(S)={1,2,3,4,8,10}
-Transaction encodes notion of concurrent events
At time 1, start(1)={(a,1,4)},but end(1)=  and start (2)   ,no transaction.
Similarly, at time 2.
Use traditional sequence mining algorithms on transformed data.
At time 3, start(3)={(c,3,3)} and end(3)={(c,3,3)}, add transaction({a,b,c},3)
Similarly, at time 4, add transaction({a,b,d},4) and at time 8, add
2009/1/13
transaction({a,b},8)
15
Outline
• Properties of bursty sequences
• Input transformation
• Transformation Properties
– Length reduction
– Structure preservation
2009/1/13
16
Length Reduction
• Theorem 1: Transformed sequence has at most
as many transactions as original sequence
– Proof idea: transactions only created for times when
some interval starts, intervals only start where
transactions were in the original sequence
– Note: new transactions can have more items
• Significant reduction in length and i-length in
practice
2009/1/13
17
Structure Preservation
• Notation: also use  for set step
– {a, b, c} equivalent to abc
• Theorem 2: Any subsequence A=a1 ~1 a2 ~2…~k-1 ak,
~i being  or , in an original sequence is preserved in
the transformed sequence as A’=a1 ~’1 a2 ~’2…~’k-1 ak,
~’i being  or , such that ~’i is  only if ~i is .
• Intuition: some sequence steps become set steps
– {a, b}c preserved as {a, b, c}, but not abc or c{a, b}
– baacb preserved as {a, b, c}
2009/1/13
18
Impact on Mining Results
• Theorem 2 not strong enough to guarantee
some A’ is frequent for frequent A
– ab in original data preserved as ab or ab in
transformed data
– Fragmentation of support possible
• Solution 1: Lower support threshold
– Use minSupport / 2n to preserve all sequences with
up to n sequence steps
– Guarantees some A’ will be frequent
– Huge increase in mining cost and result size
2009/1/13
20
Impact on Mining Results (cont.)
• Solution 2: Modify support definition
– Transaction with k items supports all
sequences over these k items
– E.g., {a, b} supports ab and ba
– Guarantees some A’ will be frequent
– Huge increase in mining cost and result size
possible
• Solution 3: Do nothing
– Our choice
2009/1/13
21
Why Do Nothing?
• Experiments show that fragmentation of support
does not a significant problem.
• Experiments show that lowering the support
threshold leads to a large increase in total
runtime and in the number of frequent patterns
found.
2009/1/13
22
Outline
•
•
•
•
Properties of bursty sequences
Input transformation
Transformation Properties
Algorithm
2009/1/13
23
Basic Algorithm
1. Scan input sequences to create intervals
2. Scan intervals to generate locally maximal
transactions
3. Apply frequent sequence mining algorithm
•
Cost of mining dominates transformation cost
–
2009/1/13
Can try various transformations to explore size
reduction before mining starts
24
Redundant Results
a
a
b
c
d
{a, b, c} {a, b, d}
{a, b}
{a, b, c}{a, b, d} supports {a, b}b
- Both b’s originate from same burst interval, hence redundant
{a, b, c}{a, b, d} also supports {a, b}{b, d}
- Not redundant; difference: d not in bursts of first transaction
2009/1/13
25
Example
a1
a2
b1
b
c1
d1
{a1, b1, c1} {a1, b1, d1}
{a2, b1}
Witness check :
1. Witness for {a, b} for {a, b}b
1. Try {a1, b1, c1}: ok
2. Witness for b:
1. Try {a1, b1, d1}: reject, b1 is in previous witness {a1, b1, c1}
2. Try {a2, b1}: reject, b1 is in previous witness {a1, b1, c1}
2009/1/13
27
Another Example
a1
a2
b1
b
c1
d1
{a1, b1, c1} {a1, b1, d1}
{a2, b1}
Witness check for {a, b}{b, d}:
1. Witness for {a, b}
1. Try {a1, b1, c1}: ok
2. Witness for {b, d}:
1. Try {a1, b1, d1}: ok, because {b1, d1} is not in
previous witness {a1, b1, c1}
2009/1/13
28
Extended Algorithm
• Alternative: post-process result
– For each result sequence P, compute nonredundant support
– Remove results with non-redundant support
below minSupport threshold
2009/1/13
29
Outline
•
•
•
•
•
Properties of bursty sequences
Input transformation
Transformation Properties
Algorithm
Experiments
2009/1/13
30
Synthetic Data
• Synthetic data: Poisson process with
parameters H, L, HL, LL
– H, L: mean interarrival time for high and low
frequency periods
– HL, LL: mean duration of high and low
frequency periods
– Sequence length = 200
– #items = 50 (dense data) or 100 (sparse data)
2009/1/13
32
Dense, Synthetic
Defaults:
H=2, L=40,
HL=100, LL=500
minSupport=0.17
merge threshold=5
2009/1/13
33
Preserving Relevant Patterns
• Initial evaluation by domain expert positive
• Objective measure??
– Preservation of non-redundant structure
• Map a1 ~1 a2 ~2…~k-1 ak to {a1, a2,…, ak}
• Measure precision and recall of mapped patterns
• Results
– Different real datasets, merge thresholds 1 min to 1
day, various minSupport
• 89%-97% recall
• Virtually 100% precision
2009/1/13
36
Conclusions
• New transformation for mining bursty sequences
• Proved important properties
• Showed reduction in result size and mining cost,
while keeping important structure
• Efficiency??
• Mining intervals ????
• Recall???
2009/1/13
37