SDM 05 - 井手剛の研究紹介
Download
Report
Transcript SDM 05 - 井手剛の研究紹介
Tokyo Research Laboratory
Knowledge Discovery from Heterogeneous
Dynamic Systems using Change-Point Correlations
IBM Research, Tokyo Research Lab
Tsuyoshi Ide
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Data streams in the real world are full of intractable features
How can we discover knowledge on system’s mechanism?
Example:
• time series data taken from various sensors in a car
heterogeneity
sudden &
unpredictable
changes
strong
correlation
Direct comparison between time-series leads to
meaningless results due to the heterogeneity
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Tackling the heterogeneity using a nonlinear transformation
─ “Correlate by features, not by values”
original
recordings
apparently
different
changepoint score
synchronized
features
SDM 05 | 2005/04/21 |
potential
relationship
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Singular spectrum transformation (SST) ─ transforms an original time
series into a time-series of the change-point scores using SVD
reference interval
test interval
representative
patterns
H st n ,...,st 2 , st 1
xt 2 w1
u1 , u2 ,..
SVD
xt 1
Test matrix
(Test vectors)
1. For each of reference and test intervals, construct a
matrix using subsequences as column vectors
2. Perform SVD on those matrices to find
representative patterns
3. Compute dissimilarity between the patterns
SDM 05 | 2005/04/21 |
compute the dissimilarity
test
vec
tor
w
u2
u1
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Surprisingly, SST removes the heterogeneity with a single parameter set.
No ad hoc parameter tuning is needed.
changepoint score
raw data
w =15
(15 data points)
SST
• recordings taken from a car
• quite heterogeneous
SDM 05 | 2005/04/21 |
• comparable to each other
• existing clustering techniques are applicable
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
SST is robust over a very wide range of parameters
Dependence on the pattern length, w
original data
change-point score
w =20, l =3
change-point score
for 5 < w < 38, l =3
change-point score
for 5 < w < 38, l =3
robust over a very wide range of w
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Application to an automobile data set. SST effectively detects changepoints irrespective of the heterogeneous behavior
Data
powertrain control module
sampling rate = 0.1 sec
raw data
CP score
(w =25; 25 data points)
Variables
x1: fuel flow rate
x2: engaged gear
x3: vehicle speed
x4: engine RPM
x5: manifold absolute pressure
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Application to an automobile data set.
SST is useful for discovering hidden structures among variables
x2 and x4 behave quite differently in the original data
However, after the SST, they forms the nearest pair in
the MDS plot.
Multidimensional scaling
method for retrieving the coordinates from a distance matrix
This result can be naturally understood by the
mechanism of the car
engaged gear
engine RPM
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Backup
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Looking for a change-point detection method applicable to heterogeneous
systems without any ad hoc parameter tuning
Existing methods
CUSUM (cumulated sum)
• Suitable only for such a data that distributes around a constant.
Autoregressive models
• Cannot be used for streams with sudden & unpredictable changes.
Wavelet transforms
• Essentially the same as differentiation. Applicable to a very limited class of variables.
Otherwise, fine parameter tuning for individual variable is needed.
Gaussian mixture models
• Cannot be used for streams with sudden & unpredictable changes.
A new “nonparametric” approach
Singular Spectrum Transformation
Note: The original inspiration for this approach was based on an aspect of [Moskvina-Zhigljavsky 2003]
SDM 05 | 2005/04/21 |
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
SST needs only two parameters in a standard setting:
the pattern length and the number of patterns
pattern length
w
w
w vectors
reference interval at t
w
test interval at t
w vectors
# of patterns
w
SDM 05 | 2005/04/21 |
t
l
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Computing the distances between variables and visualize them using
multidimensional scaling
Normalization policies and distance metrics
original time series
normalization
dt x (t )
i
distance
between
xi & xj
2
1
change-point score
dt x (t ) 1
i
standardize
2
dt
(
x
x
)
i j
dt | x
i
xj |
* SST series are nonnegative by definition, so they are
naturally interpreted as the probability densities of changes
Multidimensional scaling
(b) SST
(a) raw
retrieves the coordinates from a distance matrix
eigenvalue analysis shows d =2 is sufficient
SDM 05 | 2005/04/21 |
standardize
SST
smoothing
smoothing
L2 distance
L1 distance
multidimensional
scaling
© Copyright IBM Corporation 2005
Tokyo Research Laboratory
Summary and future work
Summary
We proposed a new framework of data mining for heterogeneous dynamic systems
The SST is a robust and effective way to remove the heterogeneity
An experiment demonstrated the utility of SST in discovering hidden structures
Future work
Speed up and sophisticate SST
Develop methodologies for causality analysis
x
z
v
w
x
z
SDM 05 | 2005/04/21 |
car B
y
car A
y
w
z
v
© Copyright IBM Corporation 2005