Transcript Document

Clustering Time Series Data An
Evolutionary Approach
Monica Chiş, Soumya Banerjee, Aboul Ella Hassanien
Springer 2009
Outline
 Introduction
 Related Works: Time Series Clustering
 Evolutionary Time Series Clustering Algorithm
(ETSC)
 Experimental Results
 Conclusions and Future Work
Introduction
 Time series are an ordered sequence of values
of a variable at equally spaced time intervals.
 Traditional time series analysis and modeling
tend to be based on non-automatic and trialand-error approaches.
(Cont.)
 Time series arise in many important areas.
 Clustering of time series has attracted the
interest of researchers from a wide range of
fields.
Related Works: Time Series
Clustering
 The mining of time series data has attracted great
attention in the data mining community in recent
years.
 There are two main categories in time series
clustering:
 Whole clustering
 Subsequence
clustering
the clustering performed on many individual time
series to group similar series into clusters
based on sliding window extractions of a single time
series and aims to find similarity and differences
among different time window of a single time series.
(Cont.) Clustering Approach
 Hierarchical Clustering
 Generality
 Only small datasets
 K-Means
 Distance is minimized
(Cont.) Time-Series
 Use lower dimensionality that preserves the
original information
 Describes the original shape of the time-series
data as closely as possible.
(Cont.)
 Another categorization of clustering time series is the
classification in model based or model free.
 Model based
assume some form of the underlying generating
process, estimate the model from each data then
cluster based on similarity between model parameters.
 Markov chain
 Hidden Makov Models
 Model free
 Fast Fourier transforms
 Wavelet transforms
involve the specification of a specific distance
measure and/or a transformation of the data.
Evolutionary Time Series Clustering
Algorithm (ETSC)
 A new evolutionary algorithms focused on
whole clustering of time series, called
Evolutionary Time Series Clustering- ETSC is
presented.
(Cont.)
 Apply an evolutionary algorithm for solving
a problem is To find :
1. A suitable representation for candidate
solutions
2. By finding a good representation for
clustering time series dataset.
 The fitness function will be considered a
similarity measure of time series
(Cont.)
 Two time-series are similarity

if they have enough non overlapping time
ordered subsequences that are similar.
 Two subsequences are similarity

if one is enclosed within an envelope of a user
defined width around another
Evolutionary Time Series Clustering
選擇與複製
突變
Experimental Results
Conclusions and Future Work
 An evolutionary approach to perform clustering on
time series data set was presented.
 Some different fitness functions are tested.
Hierarchical Clustering
Calculate the similarity
between all possible
combinations of two profiles
Keys
• Similarity
• Clustering
Two most similar clusters
are grouped together to form
a new cluster
Calculate the similarity
between the new cluster and
all remaining clusters.
出處:IR講義
(Cont.)
出處:IR講義
The K-Means Clustering Method
 Example
10
10
9
9
8
8
7
7
6
6
5
5
10
9
8
7
6
5
4
4
3
2
1
0
0
1
2
3
4
5
6
7
8
K=2
Arbitrarily choose K
object as initial
cluster center
9
10
Assign
each
objects
to most
similar
center
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
reassign
10
9
9
8
8
7
7
6
6
5
5
4
3
2
1
0
1
2
3
4
5
6
7
8
8
9
10
reassign
10
0
7
9
10
Update
the
cluster
means
4
3
2
1
0
0
1
2
3
4
5
6
出處:IR講義
7
8
9
10
Evoluationary Algorithm
 Evolutionary algorithms are randomized search
optimization techniques guided by the principles of
natural biology evolution processes
(Cont.)
(Cont.)
 They are effective and robust search processes
provide a near optimal solution of a fitness function.
(Cont.)
 Evolutionary algorithms are used to improve the
robustness and accuracy of the more traditional
techniques used in feature extraction, feature
selection, classification and clustering
Gene Algorithm
 1975年 John Holland 學者所提出
 成為非常著名的演算法
 Genetic Algorithm (GA)
 基因演算法
 又稱: 遺傳演算法
出處:基因演算法介紹
基因演算法的概念
 將問題編碼成 染色體的型式
 基因: 0 or 1
 染色體: x = 0 1 0 0 1 1 0
 設計 適應函數 (fitness function)
 f(x) :適應函數,給一 x 可得一 適應值 f(x)
 適應值愈高, 表愈適應環境
出處:基因演算法介紹
基因演算法的三個基本動作
 選擇(Selection)與複製 (Reproduction):
 X=0100110
 選擇 適應值高 之染色體
 複製成多個染色體
出處:基因演算法介紹
(Cont.)
 交換 (crossover):
 父:0 1 0 0 1 1 0
 母:1 0 1 1 0 0 1
交換
0111010
1000101
出處:基因演算法介紹
(Cont.)
 突變 (mutation):
 0100110
1
出處:基因演算法介紹
(Cont.)
產生初始族群
是否滿足
停止條件
否
是
產生最適個體
計算適應値
選擇
複製
交換
突變
產生下一子代
出處:基因演算法介紹
Fitness Function (1)
Fitness Function (2)
(Cont.)
 An ordered set of m realvalued variables
 The dataset could be described in matrix as follows
 An individual is represented as a vector
(Cont.)
dj is a time series with m real-valued
cj is an integer number from 1 to p
 Representation indicates how time series are
assigned to classes
 The value cj of the gene j indicates the classes to
which the object dj is assigned.
 All the times series for dataset are assigned to a
class (cluster).
 The number of classes is determining by the
evolutionary algorithms