Transcript Document
Clustering Time Series Data An
Evolutionary Approach
Monica Chiş, Soumya Banerjee, Aboul Ella Hassanien
Springer 2009
Outline
Introduction
Related Works: Time Series Clustering
Evolutionary Time Series Clustering Algorithm
(ETSC)
Experimental Results
Conclusions and Future Work
Introduction
Time series are an ordered sequence of values
of a variable at equally spaced time intervals.
Traditional time series analysis and modeling
tend to be based on non-automatic and trialand-error approaches.
(Cont.)
Time series arise in many important areas.
Clustering of time series has attracted the
interest of researchers from a wide range of
fields.
Related Works: Time Series
Clustering
The mining of time series data has attracted great
attention in the data mining community in recent
years.
There are two main categories in time series
clustering:
Whole clustering
Subsequence
clustering
the clustering performed on many individual time
series to group similar series into clusters
based on sliding window extractions of a single time
series and aims to find similarity and differences
among different time window of a single time series.
(Cont.) Clustering Approach
Hierarchical Clustering
Generality
Only small datasets
K-Means
Distance is minimized
(Cont.) Time-Series
Use lower dimensionality that preserves the
original information
Describes the original shape of the time-series
data as closely as possible.
(Cont.)
Another categorization of clustering time series is the
classification in model based or model free.
Model based
assume some form of the underlying generating
process, estimate the model from each data then
cluster based on similarity between model parameters.
Markov chain
Hidden Makov Models
Model free
Fast Fourier transforms
Wavelet transforms
involve the specification of a specific distance
measure and/or a transformation of the data.
Evolutionary Time Series Clustering
Algorithm (ETSC)
A new evolutionary algorithms focused on
whole clustering of time series, called
Evolutionary Time Series Clustering- ETSC is
presented.
(Cont.)
Apply an evolutionary algorithm for solving
a problem is To find :
1. A suitable representation for candidate
solutions
2. By finding a good representation for
clustering time series dataset.
The fitness function will be considered a
similarity measure of time series
(Cont.)
Two time-series are similarity
if they have enough non overlapping time
ordered subsequences that are similar.
Two subsequences are similarity
if one is enclosed within an envelope of a user
defined width around another
Evolutionary Time Series Clustering
選擇與複製
突變
Experimental Results
Conclusions and Future Work
An evolutionary approach to perform clustering on
time series data set was presented.
Some different fitness functions are tested.
Hierarchical Clustering
Calculate the similarity
between all possible
combinations of two profiles
Keys
• Similarity
• Clustering
Two most similar clusters
are grouped together to form
a new cluster
Calculate the similarity
between the new cluster and
all remaining clusters.
出處:IR講義
(Cont.)
出處:IR講義
The K-Means Clustering Method
Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
出處:IR講義
7
8
9
10
Evoluationary Algorithm
Evolutionary algorithms are randomized search
optimization techniques guided by the principles of
natural biology evolution processes
(Cont.)
(Cont.)
They are effective and robust search processes
provide a near optimal solution of a fitness function.
(Cont.)
Evolutionary algorithms are used to improve the
robustness and accuracy of the more traditional
techniques used in feature extraction, feature
selection, classification and clustering
Gene Algorithm
1975年 John Holland 學者所提出
成為非常著名的演算法
Genetic Algorithm (GA)
基因演算法
又稱: 遺傳演算法
出處:基因演算法介紹
基因演算法的概念
將問題編碼成 染色體的型式
基因: 0 or 1
染色體: x = 0 1 0 0 1 1 0
設計 適應函數 (fitness function)
f(x) :適應函數,給一 x 可得一 適應值 f(x)
適應值愈高, 表愈適應環境
出處:基因演算法介紹
基因演算法的三個基本動作
選擇(Selection)與複製 (Reproduction):
X=0100110
選擇 適應值高 之染色體
複製成多個染色體
出處:基因演算法介紹
(Cont.)
交換 (crossover):
父:0 1 0 0 1 1 0
母:1 0 1 1 0 0 1
交換
0111010
1000101
出處:基因演算法介紹
(Cont.)
突變 (mutation):
0100110
1
出處:基因演算法介紹
(Cont.)
產生初始族群
是否滿足
停止條件
否
是
產生最適個體
計算適應値
選擇
複製
交換
突變
產生下一子代
出處:基因演算法介紹
Fitness Function (1)
Fitness Function (2)
(Cont.)
An ordered set of m realvalued variables
The dataset could be described in matrix as follows
An individual is represented as a vector
(Cont.)
dj is a time series with m real-valued
cj is an integer number from 1 to p
Representation indicates how time series are
assigned to classes
The value cj of the gene j indicates the classes to
which the object dj is assigned.
All the times series for dataset are assigned to a
class (cluster).
The number of classes is determining by the
evolutionary algorithms