SDM 05 - 井手剛の研究紹介

Download Report

Transcript SDM 05 - 井手剛の研究紹介

Tokyo Research Laboratory
Knowledge Discovery from Heterogeneous
Dynamic Systems using Change-Point Correlations
IBM Research, Tokyo Research Lab
Tsuyoshi Ide
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Data streams in the real world are full of intractable features
How can we discover knowledge on system’s mechanism?
 Example:
• time series data taken from various sensors in a car
heterogeneity
sudden &
unpredictable
changes
strong
correlation
Direct comparison between time-series leads to
meaningless results due to the heterogeneity
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Tackling the heterogeneity using a nonlinear transformation
─ “Correlate by features, not by values”
original
recordings
apparently
different
changepoint score
synchronized
features
SDM 05 | 2005/04/21 |
potential
relationship
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Singular spectrum transformation (SST) ─ transforms an original time
series into a time-series of the change-point scores using SVD
reference interval
test interval
representative
patterns
H  st n ,...,st 2 , st 1 
xt 2 w1
u1 , u2 ,..
SVD
xt 1
Test matrix
(Test vectors)
1. For each of reference and test intervals, construct a
matrix using subsequences as column vectors
2. Perform SVD on those matrices to find
representative patterns
3. Compute dissimilarity between the patterns
SDM 05 | 2005/04/21 |
compute the dissimilarity
test
vec
tor
w

u2
u1
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Surprisingly, SST removes the heterogeneity with a single parameter set.
No ad hoc parameter tuning is needed.
changepoint score
raw data
w =15
(15 data points)
SST
• recordings taken from a car
• quite heterogeneous
SDM 05 | 2005/04/21 |
• comparable to each other
• existing clustering techniques are applicable
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
SST is robust over a very wide range of parameters
Dependence on the pattern length, w
original data
change-point score
w =20, l =3
change-point score
for 5 < w < 38, l =3
change-point score
for 5 < w < 38, l =3
robust over a very wide range of w
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Application to an automobile data set. SST effectively detects changepoints irrespective of the heterogeneous behavior
 Data
powertrain control module
 sampling rate = 0.1 sec

raw data
CP score
(w =25; 25 data points)
 Variables
x1: fuel flow rate
 x2: engaged gear
 x3: vehicle speed
 x4: engine RPM
 x5: manifold absolute pressure

SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Application to an automobile data set.
SST is useful for discovering hidden structures among variables
 x2 and x4 behave quite differently in the original data
 However, after the SST, they forms the nearest pair in
the MDS plot.

Multidimensional scaling
method for retrieving the coordinates from a distance matrix
 This result can be naturally understood by the
mechanism of the car
engaged gear
engine RPM
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Backup
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Looking for a change-point detection method applicable to heterogeneous
systems without any ad hoc parameter tuning
 Existing methods




CUSUM (cumulated sum)
• Suitable only for such a data that distributes around a constant.
Autoregressive models
• Cannot be used for streams with sudden & unpredictable changes.
Wavelet transforms
• Essentially the same as differentiation. Applicable to a very limited class of variables.
Otherwise, fine parameter tuning for individual variable is needed.
Gaussian mixture models
• Cannot be used for streams with sudden & unpredictable changes.
A new “nonparametric” approach
Singular Spectrum Transformation
Note: The original inspiration for this approach was based on an aspect of [Moskvina-Zhigljavsky 2003]
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
SST needs only two parameters in a standard setting:
the pattern length and the number of patterns
pattern length
w
w
w vectors
reference interval at t
w
test interval at t
w vectors
# of patterns
w
SDM 05 | 2005/04/21 |
t
l
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Computing the distances between variables and visualize them using
multidimensional scaling
 Normalization policies and distance metrics
original time series
normalization
 dt x (t )
i
distance
between
xi & xj
2
1
change-point score
 dt x (t )  1
i
standardize
2
dt
(
x

x
)
 i j
 dt | x
i
 xj |
* SST series are nonnegative by definition, so they are
naturally interpreted as the probability densities of changes

Multidimensional scaling


(b) SST
(a) raw
retrieves the coordinates from a distance matrix
eigenvalue analysis shows d =2 is sufficient
SDM 05 | 2005/04/21 |
standardize
SST
smoothing
smoothing
L2 distance
L1 distance
multidimensional
scaling
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Summary and future work
 Summary



We proposed a new framework of data mining for heterogeneous dynamic systems
The SST is a robust and effective way to remove the heterogeneity
An experiment demonstrated the utility of SST in discovering hidden structures
 Future work


Speed up and sophisticate SST
Develop methodologies for causality analysis
x
z
v
w
x
z
SDM 05 | 2005/04/21 |
car B
y
car A
y
w
z
v
© Copyright IBM Corporation 2005