Mining time-delayed associations from event datasets

Download Report

Transcript Mining time-delayed associations from event datasets

Mining time-delayed
associations from event
datasets
Loo, Kin Kong
13 July, 2006
Contents






Motivation
Definition of time-delayed association
A simple algorithm
Improving the simple algorithm
Experiments
Conclusion
The “Butterfly Effect”...

“A butterfly's wings might create tiny changes in
the atmosphere that ultimately cause a tornado
to appear” - from Wikipedia
......
http://www.clipartxp.com/Butterflies/04_Butterfly.jpg
http://www.wrh.noaa.gov/hnx/newslet/summer05/xwind.jpg
Road network

Network of roads and roundabouts
City Centre
2
A
4
E
3
4
B
D
5
C
4
Road network
Alerts issued by a network monitoring system

City Centre
2
A
4
E
3
4
B
D
5
C
4
0
A
A
A
1
2
3
4
B
B
5
6
B
7
8
C
A
D
B
9 10 11 12 13 14 15
Time
Episode



Proposed by Mannila et al in [Mannila95]
“An episode is a partially ordered collection of
events occurring together”
Episodes can be described as directed acyclic
graphs
X
Y
X
Y
X
Y
Z
Minimal Occurrences

A time interval [ts, te) is a minimal occurrence of
an episode  if


 exists in [ts, te)
 does not occur in any proper sub-interval [ts', te') of
[ts, te)
0
A
A
A
1
2
3
4
B
B
5
6
B
7
8
C
A
D
B
9 10 11 12 13 14 15
Time
Time-delayed associations

A time-delayed association r between two event
[u, v]
types is in the form I
J,



I and J are two event types
0<u v
Informally interpreted as “if an event of type I
occurs at time t, then it is likely that an event of
type J occurs within the interval [t+u, t+v]
Definition

Let D be the event dataset

A1, ..., Am are event attributes with domains D1, ..., Dm
An event e is in the form (A1, ..., Am, t), where t is the
time that e occurred
OR
e is in the form (E, t), where E is an event type



Notations:
Ee = the event type of the event
te = the time when e occurred
Let E be the set of all possible event types
Definition (Cont’d)


A time-delayed association relates two event
types, in the form I [u, v] J such that 0 < u  v
An event i is a match to I [u, v] J if






Ei = I
 j | Ej = J and ti + u  tj  ti + v
The event j here is a consequence to which i
corresponds
Support = (number of distinct matches) / |D|
Confidence = (number of distinct matches) /
(number of occurrences of I)
Length = 2
Complex association



Treating a time-delayed association as a complex
event type, it can be extended and associate to
another event type
Let R be the complex event type representing
[u, v]
I
J
A time-delayed association between R and K is in
the form R [u, v] K


K is an “ordinary” event type
u and v are same as that in I
[u, v]
J
Complex association (Cont’d)

An event i is a match to R


[u, v]
K if
i is a match of I
J
 j such that j is a consequence of i and
 k | Ek = K and tj + u  tk  tj + v
[u, v]

Support = (number of distinct matches) / |D|
Confidence = supp(R [u, v] K ) / supp(I [u, v] J)

Length = 1 + length(I

[u, v]
J)
Problem statement

Given an event data set D, time constraints u, v
such that 0 < u  v, find all time-delayed
associations with support not smaller than s and
confidence not smaller than c.
MQ Mappings

Matches and their corresponding consequences
can be arranged in a table-like structure
0
A
A
A
1
2
3
A
B
B
4
5
6
[, 4]
B
B
7
8
C
D
A
B
B
B
9 10 11 12 13 14 15
[, 4]
B
C
m
q
m
q
1
5
6
10
2
5
8
10
2
6
3
5
3
6
12
13
12
14
12
15
Time
From MQ mappings to
MQ mapping
0
A
A
A
A
1
2
3
[, 4]
B
4
B
B
5
6
B
B
7
[, 4]
8
C
C
D
A
B
B
B
9 10 11 12 13 14 15
(A
[, 4]
B)
Time
[, 4]
m
q
m
q
m
q
1
5
6
10
2
10
2
5
8
10
3
10
3
5
2
6
3
6
12
13
12
14
12
15
C
Algorithm BRUTE-FORCE
Room for improvement



Every possible permutation is evaluated (until
one is found to be infrequent)!
Very large number of MQ mappings are
generated
 too much for main memory
 intermediate results may need to be swapped
to secondary storage
Hence, we need


Good pruning strategies
Good cache management
Pruning strategy

Apriori property does not hold in time-delayed
associations

“(A B) C is frequent” does not implies that “A C is
frequent”
Multiplicity of consequences

We define the multiplicity of a consequence as
the number of distinct matches corresponding to
the consequence
[, 4]
[, 4]
[, 4]
[, 4]
A
B
B
C
(A
B)
C
m
q
multi
m
q
m
q
1
5
3
6
10
2
10
2
5
8
10
4
10
3
5
2
6
3
6
12
13
1
12
14
1
12
15
1
2
0
A
A
A
1
2
3
4
B
B
5
6
B
7
8
C
D
A
B
B
B
9 10 11 12 13 14 15
Time
GlobalK


Sort all multiplicities in reverse order
Find the minimal k such that sum of top-k of the
multiplicities is not less than s  |D|
A
[, 4]
B
B
[, 4]
C
(A
[, 4]
B)
[, 4]
m
q
multi
m
q
m
q
1
5
3
6
10
2
10
2
5
8
10
4
10
3
5
2
6
3
6
2
s  |D| = 3
k =1
12
13
1
12
14
1
12
15
1
Infrequent
C
SectTop


GlobalK does not make use of temporal
information
We can divide the whole history into a number of
segments and keep information for each
segment

For each segment, a vector containing multiplicities of
consequences in inverse order is kept
SectTop (Cont’d)
0
A
[, 4]
A
A
A
1
2
3
B
m
q
multi
1
5
3
2
5
3
5
2
6
3
6
12
13
1
12
14
1
12
15
1
4
B
B
5
6
B
[, 4]
m
B
7
8
C
q
C
D
A
B
B
B
9 10 11 12 13 14 15
Time
s  |D| = 3
Cannot be frequent
2
6
10
8
10
Cache replacement strategies

Common cache replacement strategies:



FIFO – “oldest” data in cache are replaced
LRU – data that have not been referenced for the
longest time are replaced
LFU – data that are least frequently referenced are
replaced
Candidate generation
Experiment


Open and closing prices of 33 HSI constituents
for around 1400 transaction days (since Jan
2000) are obtained for experiments
The prices are transformed to events as follows:
Closing price
Open price
Type A

Open price
Open price
=
Closing price
Closing price
Type B
Type C
Hence, 33 data points are generated each day,
each data point in the form “(0008B, date)”
Experiment - Effectiveness of
pruning strategies

No. of candidate actually enumerated
s (%)
s (%)
Experiment - Effectiveness of
pruning strategies (Cont’d)

Case when [u, v] = [, 1] at high and low s
Experiment - Effectiveness of
pruning strategies (Cont’d)

Case when [u, v] = [, 2] at high and low s
Experiment - Cache replacement
strategy and candidate generation

LRU, [u, v] = [, 1] at high s (0.6%), with and
without using pruning strategy
Experiment - Cache replacement
strategy and candidate generation

LFU, [u, v] = [, 1] at high s (0.6%), with and
without using pruning strategy
Experiment - Cache replacement
strategy and candidate generation

LRU, [u, v] = [, 1] at low s (0.3%), with and without
using pruning strategy
Conclusion




Time-delayed association rules offer an alternative tool for
sequential analysis of event datasets
Although the Apriori property does not hold in the model
of time-delayed association, we can still derive good
pruning strategies to trim the search space for frequent
associations
Due to the large volume of intermediate data generated, it
is essential to have a rightly chosen cache replacement
strategy
The way candidates are generated can play a crucial role
in the performance of the algorithm
Reference

Mannila, H., Toivonen, H., and Verkamo, A.I. 1995. Discovering frequent episodes in
sequences. In Proceedings of the First International Conference on Knowledge Discovery
and Data Mining (KDD ’95). Montr´eal, Canada, pp. 210–215.
Thank you.