Mining Serial Episod..

Download Report

Transcript Mining Serial Episod..

Mining Serial Episode Rules with Time
Lags over Multiple Data Streams
Tung-Ying Lee, En Tzu Wang
Dept. of CS, National Tsing Hua Univ. (Taiwan)
Arbee L.P. Chen
Dept. of CS, National Chengchi Univ. (Taiwan)
DaWaK’08
Outline
 Introduction
 Related work
 Preliminaries
 Method
 Experiments
 Conclusions
Introduction
 In many applications, data are generated as a form of
continuous data streams.
 Continuously detecting flow and occupancy of a road to qualify
the congestion condition of a road forms data streams
 When roads A and B have heavy traffic, 5 mins later, road C will
most likely be congested
 Serial episode rules with time lags (SER) : XlagY
Related Work
 Finding episodes/episode rules from static time series data
has been studied for decades
 Episodes
Precursor
B
Successor
D
A
A
B
D
C
Episode
Serial episode
A
B
D
L
Serial episode rule
E
Preliminaries
 Environment: a centralized system collecting n synchronized data
streams DS1, DS2, …, DSn
 n-tuple event: a set of items coming from all streams at the same time
 itemset: a subset of an n-tuple event
 serial episode: described as an ordered list of itemsets
e.g. serial episode (aA)(bB)
Itemset {gA}
time:
1, 2, 3, 4, 5, 6, 7, 8
DS1: a, b, b, c, g, a, b, f
DS2: A, B, S, G, A, B, A, F
…
DSn: , , , , , , , 
n-tuple event
Preliminaries (cont.)
 Minimal occurrence: given a serial episode S, a time interval [a, b] is a
minimal occurrence of S, if
 S occurs in [a, b]
 S does not occur in any proper subintervals of [a, b]
 If (b-a+1)  T, a time bound given by users, [a, b] is valid
 MO(S): the set of all minimal occurrences of S
 Supp(S): the number of valid minimal occurrences of S
Time bound T: 3
DS1
DS2
Serial episodes Minimal Occurrences
Support
(a A)(b B)
[1, 2], [6, 7], [11, 12], [13, 14], [18, 19] 5
(g G)
[5, 5], [10, 10], [15, 15], [17, 17]
4
6
Preliminaries (cont.)
 A SER is R: S1Lag = LS2
 Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid  [c, d] MO(S2)[c,
d]: valid s.t. (c-a) = L}
 Conf(R) = Supp(R)/Supp(S1)
4
Time bound T: 3
DS1
DS2
Serial episode rules
(a A)(b B)→4 (g G)
7
Support,
Confidence
[1, 2]→[5, 5], [6, 7]→[10, 10], [11, Supp: 4,
12]→[15, 15], [13, 14]→[17, 17]
Conf: 4/5 = 0.8
Minimal Occurrences
Preliminaries (cont.)

Problem Formulation: given 4 parameters
 the maximum time lag (Lmax)
 the minimum support (minsup)
 the minimum confidence (minconf)
 the time bound (T)

Find all SERs e.g. R: S1Lag = LS2 satisfying
 L  Lmax
 Supp(R)  N  minsup, (N: the number of received n-tuple events)
 Conf(R)  minconf
 Calculating supports for serial episodes and SERs must take T into account
Time bound T: 3
DS1
DS2
Serial episode rules Minimal Occurrences
Support, Confidence
[1, 2]→[5, 5], [6, 7]→[10, 10], [11, Supp: 4, 4  (N=19)  0.2
(a A)(b B)→4 (g G)
12]→[15, 15], [13, 14]→[17, 17]
Conf: 4/5 = 0.8
8
Lmax Minsup
5
0.2
Minconf
0.8
T
3
Preliminaries (cont.)
 Using the prefix tree for keeping serial episodes
 S: a serial episode, X: an item
 S+X: X follows S
 S+_X: X and the last itemset in S appear at the same time
A
Serial episode (AB)
_B
Level 0

Root
B
B
Serial episode (A)(B)
9
Level 1
Level 2
LossyDL
 The concept of LossyDL: keeping the valid minimal occurrences of a
serial episode for generating rules
Each item in the current 2-tuple
event needs to be processed
(traversing in a bottom-up order)
The last two minimal occurrences
needs to be checked
Using Lossy Counting [VLDB02],
whenever N  0 mod 1/, the
oldest minimal occurrence is
removed
10
Processing C can
generate (B)(C): [2, 3]
and (BC): [3, 3]

At time point = 3, a 2-tupe
event (BC) arrives, T = 3
B
A
[1, 1]
B
[2, 2] [3, 3]
B
[1, 2] [1, 3]: not minimal
B
[1, 3]
[2, 3]
LossyDL (Rule Generation)
 Mining SERs
 For any two serial episode with supports  (minsup  )  N are
checked to see if any minimal occurrences of them can be combined.
Then, Supp(R) can be computed
 For each R: S1Lag = LS2, it will be returned if
 Supp(R)  (minsup  )  N, and
 (Supp(R) + N)/Supp(S1)  minconf
Serial episodes
(a A)(b B)
(g G)
Serial episode rules
11
(a A)(b B)→4 (g G)
Minimal Occurrences
[1, 2], [6, 7], [11, 12], [13, 14], [18, 19]
[5, 5], [10, 10], [15, 15], [17, 17]
Minimal Occurrences
[1, 2]→[5, 5], [6, 7]→[10, 10], [11,
12]→[15, 15], [13, 14]→[17, 17]
TLT
 A lot of minimal occurrences are kept in LossyDL, but only the last two are used
while updating
 Observations
 XL(AB) and XLA, obviously Supp(XLA)  Supp(XL(AB)): XL(AB) is
not significant if XLA does not satisfy one of minsup and minconf
 (AB)L(CD) and ALC, obviously Supp(ALC)  Supp((AB)L(CD)):
(AB)L(CD) is not significant if Supp(ALC) < Supp(AB)  minconf
12
TLT (cont.)
 Observations (cont.):
 Given a SER: (A)(B)5(CD), and T = 3
 A1B or A2B, that is ApB, 0<p< T (T1 types)
 A1B4(CD), A2B3(CD), that is ApBLp(CD)
 Supp(ApBLp(CD))  min(Supp(ApB), Supp(BLpC))
 (A)(B)5(CD) is not significant, if

pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B)  minconf
 Using the observations to prune insignificant rules
 Time lag table (TLT)
 ALB is a reduced SER, if A and B are single items
 For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from
the first itemset of precursor to the last itemset of successor)
 Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER
13
TLT (cont.)
 The support and the last two minimal occurrences of an serial episode are
kept in the prefix tree
 Keeping supports instead of keeping minimal occurrence lists
 Keeping the last two minimal occurrences for updating the supports
 Whenever N  0 mod 1/, all supports are decreased by 1
 In addition, the last Lmax+T1 n-tuple events are kept for updating the Time
Lag Tables
14
TLT (Rule Generation)
 Mining SERs
 Any two serial episode with supports  (minsup  )  N form the
candidate SERs
 A candidate SER will be returned if it can pass the pruning rules from the
above observations
15
Experiments
 real dataset
 PDOMEI: the dataset contains the dryness and climate indices
derived by experts, usually used to predict droughts
 Parameter setting
  = 0.1minsup
 Lmax = 10
16
17
Conclusions
 We address the problem of finding significant serial episode rules with time
lags over multiple data streams and propose two methods to solve it. TLT is
more space-efficient, but LossyDL has high precision
18