Mining time-delayed associations from event datasets
Download
Report
Transcript Mining time-delayed associations from event datasets
Mining time-delayed
associations from event
datasets
Loo, Kin Kong
13 July, 2006
Contents
Motivation
Definition of time-delayed association
A simple algorithm
Improving the simple algorithm
Experiments
Conclusion
The “Butterfly Effect”...
“A butterfly's wings might create tiny changes in
the atmosphere that ultimately cause a tornado
to appear” - from Wikipedia
......
http://www.clipartxp.com/Butterflies/04_Butterfly.jpg
http://www.wrh.noaa.gov/hnx/newslet/summer05/xwind.jpg
Road network
Network of roads and roundabouts
City Centre
2
A
4
E
3
4
B
D
5
C
4
Road network
Alerts issued by a network monitoring system
City Centre
2
A
4
E
3
4
B
D
5
C
4
0
A
A
A
1
2
3
4
B
B
5
6
B
7
8
C
A
D
B
9 10 11 12 13 14 15
Time
Episode
Proposed by Mannila et al in [Mannila95]
“An episode is a partially ordered collection of
events occurring together”
Episodes can be described as directed acyclic
graphs
X
Y
X
Y
X
Y
Z
Minimal Occurrences
A time interval [ts, te) is a minimal occurrence of
an episode if
exists in [ts, te)
does not occur in any proper sub-interval [ts', te') of
[ts, te)
0
A
A
A
1
2
3
4
B
B
5
6
B
7
8
C
A
D
B
9 10 11 12 13 14 15
Time
Time-delayed associations
A time-delayed association r between two event
[u, v]
types is in the form I
J,
I and J are two event types
0<u v
Informally interpreted as “if an event of type I
occurs at time t, then it is likely that an event of
type J occurs within the interval [t+u, t+v]
Definition
Let D be the event dataset
A1, ..., Am are event attributes with domains D1, ..., Dm
An event e is in the form (A1, ..., Am, t), where t is the
time that e occurred
OR
e is in the form (E, t), where E is an event type
Notations:
Ee = the event type of the event
te = the time when e occurred
Let E be the set of all possible event types
Definition (Cont’d)
A time-delayed association relates two event
types, in the form I [u, v] J such that 0 < u v
An event i is a match to I [u, v] J if
Ei = I
j | Ej = J and ti + u tj ti + v
The event j here is a consequence to which i
corresponds
Support = (number of distinct matches) / |D|
Confidence = (number of distinct matches) /
(number of occurrences of I)
Length = 2
Complex association
Treating a time-delayed association as a complex
event type, it can be extended and associate to
another event type
Let R be the complex event type representing
[u, v]
I
J
A time-delayed association between R and K is in
the form R [u, v] K
K is an “ordinary” event type
u and v are same as that in I
[u, v]
J
Complex association (Cont’d)
An event i is a match to R
[u, v]
K if
i is a match of I
J
j such that j is a consequence of i and
k | Ek = K and tj + u tk tj + v
[u, v]
Support = (number of distinct matches) / |D|
Confidence = supp(R [u, v] K ) / supp(I [u, v] J)
Length = 1 + length(I
[u, v]
J)
Problem statement
Given an event data set D, time constraints u, v
such that 0 < u v, find all time-delayed
associations with support not smaller than s and
confidence not smaller than c.
MQ Mappings
Matches and their corresponding consequences
can be arranged in a table-like structure
0
A
A
A
1
2
3
A
B
B
4
5
6
[, 4]
B
B
7
8
C
D
A
B
B
B
9 10 11 12 13 14 15
[, 4]
B
C
m
q
m
q
1
5
6
10
2
5
8
10
2
6
3
5
3
6
12
13
12
14
12
15
Time
From MQ mappings to
MQ mapping
0
A
A
A
A
1
2
3
[, 4]
B
4
B
B
5
6
B
B
7
[, 4]
8
C
C
D
A
B
B
B
9 10 11 12 13 14 15
(A
[, 4]
B)
Time
[, 4]
m
q
m
q
m
q
1
5
6
10
2
10
2
5
8
10
3
10
3
5
2
6
3
6
12
13
12
14
12
15
C
Algorithm BRUTE-FORCE
Room for improvement
Every possible permutation is evaluated (until
one is found to be infrequent)!
Very large number of MQ mappings are
generated
too much for main memory
intermediate results may need to be swapped
to secondary storage
Hence, we need
Good pruning strategies
Good cache management
Pruning strategy
Apriori property does not hold in time-delayed
associations
“(A B) C is frequent” does not implies that “A C is
frequent”
Multiplicity of consequences
We define the multiplicity of a consequence as
the number of distinct matches corresponding to
the consequence
[, 4]
[, 4]
[, 4]
[, 4]
A
B
B
C
(A
B)
C
m
q
multi
m
q
m
q
1
5
3
6
10
2
10
2
5
8
10
4
10
3
5
2
6
3
6
12
13
1
12
14
1
12
15
1
2
0
A
A
A
1
2
3
4
B
B
5
6
B
7
8
C
D
A
B
B
B
9 10 11 12 13 14 15
Time
GlobalK
Sort all multiplicities in reverse order
Find the minimal k such that sum of top-k of the
multiplicities is not less than s |D|
A
[, 4]
B
B
[, 4]
C
(A
[, 4]
B)
[, 4]
m
q
multi
m
q
m
q
1
5
3
6
10
2
10
2
5
8
10
4
10
3
5
2
6
3
6
2
s |D| = 3
k =1
12
13
1
12
14
1
12
15
1
Infrequent
C
SectTop
GlobalK does not make use of temporal
information
We can divide the whole history into a number of
segments and keep information for each
segment
For each segment, a vector containing multiplicities of
consequences in inverse order is kept
SectTop (Cont’d)
0
A
[, 4]
A
A
A
1
2
3
B
m
q
multi
1
5
3
2
5
3
5
2
6
3
6
12
13
1
12
14
1
12
15
1
4
B
B
5
6
B
[, 4]
m
B
7
8
C
q
C
D
A
B
B
B
9 10 11 12 13 14 15
Time
s |D| = 3
Cannot be frequent
2
6
10
8
10
Cache replacement strategies
Common cache replacement strategies:
FIFO – “oldest” data in cache are replaced
LRU – data that have not been referenced for the
longest time are replaced
LFU – data that are least frequently referenced are
replaced
Candidate generation
Experiment
Open and closing prices of 33 HSI constituents
for around 1400 transaction days (since Jan
2000) are obtained for experiments
The prices are transformed to events as follows:
Closing price
Open price
Type A
Open price
Open price
=
Closing price
Closing price
Type B
Type C
Hence, 33 data points are generated each day,
each data point in the form “(0008B, date)”
Experiment - Effectiveness of
pruning strategies
No. of candidate actually enumerated
s (%)
s (%)
Experiment - Effectiveness of
pruning strategies (Cont’d)
Case when [u, v] = [, 1] at high and low s
Experiment - Effectiveness of
pruning strategies (Cont’d)
Case when [u, v] = [, 2] at high and low s
Experiment - Cache replacement
strategy and candidate generation
LRU, [u, v] = [, 1] at high s (0.6%), with and
without using pruning strategy
Experiment - Cache replacement
strategy and candidate generation
LFU, [u, v] = [, 1] at high s (0.6%), with and
without using pruning strategy
Experiment - Cache replacement
strategy and candidate generation
LRU, [u, v] = [, 1] at low s (0.3%), with and without
using pruning strategy
Conclusion
Time-delayed association rules offer an alternative tool for
sequential analysis of event datasets
Although the Apriori property does not hold in the model
of time-delayed association, we can still derive good
pruning strategies to trim the search space for frequent
associations
Due to the large volume of intermediate data generated, it
is essential to have a rightly chosen cache replacement
strategy
The way candidates are generated can play a crucial role
in the performance of the algorithm
Reference
Mannila, H., Toivonen, H., and Verkamo, A.I. 1995. Discovering frequent episodes in
sequences. In Proceedings of the First International Conference on Knowledge Discovery
and Data Mining (KDD ’95). Montr´eal, Canada, pp. 210–215.
Thank you.