O - 國立雲林科技大學

Download Report

Transcript O - 國立雲林科技大學

國立雲林科技大學
National Yunlin University of Science and Technology
Scaling Clustering Algorithm for
Massive Data Sets using Data Streams
Advisor :Dr. Hsu
Graduate:Jian-Lin Kuo
Author :Silvia Nittel
Kelvin T.Leung
Amy Braverman
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Outline









Motivation
Objective
Introduction
Literature review
Implementing K-means Using Data Stram
Space & Time complexity
Parallelizing Partial/Merge K-means
Experimental Evaluation
Conclusions
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Motivation


Computing data mining algorithms such as clustering
techniques on massive data sets is still not feasible nor
efficient today.
To cluster massive data sets or subsets, overall exection
time and scalability are important issues.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Objective

It achieves an overall high performance computation for
massive data.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Introduction


To improved data distribution and analysis, we substitute
data sets with compressed conterparts.
We partition the data set into
1 degree * 1 degree grid cell
to improve data distribution
and analysis.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Literature review

K-means algorithm
1. Initialization: Select a set of k initial cluster centroids ramdomly,
i.e. m j, 1 ≤ j ≤ k.
2. Distance Calculation: For each data point X i , 1 ≤ i ≤ n, computes
its Euclidean distance to each centroid.
3. Centroid Recalculation: For each 1 ≤ j ≤ k, computing  j that
is the actual mean of the cluster C j is the new centroid.
( where
 j   xC x / | C j | )
j
4. Convergence Condition: Repeat (2) to (3) until convergence criteria
is met. The convergence criteria is || MSE(n-1) – MSE(n) || ≤ 1* 10-9
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Literature review (cont.)

Computing K-means via a serial algorithm
1. scanning a grid cell C j at a time,
2. compressing it, and
3. scanning the next grid cell.

All data points of one grid cell are kept in memory
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Literature review (cont.)

The quality of the clustering process is indicated by
the error function E which is defined as
E   k  1 , K  x C ||  k - x ||
k

2
In this case ,
- memory complexity O(N)
- time complexity is O(GRINK)
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Literature review (cont.)

Parallel implementations of K-means
- Method A is a naive way of parallelizing k-means is to
assign the clustering of one grid
cell each to a processor.
- Method B is to assign each run
R i of k-means on one grid cell
C j using one set of initial,
randomly chosen k seeds to
a processor.
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Literature review (cont.)
- Method C is divide the grid cell into disjunct subsets
(clusters) assigened to different slaves by choosing a set
of initial centroids.
- It reduces both the computational and the memory
bottleneck.
10
Intelligent Database Systems Lab
Implementing K-means Using Data
Stream

It consists of the following steps:
- Scan the temporal-spatial grid cells.
1) We assumed that the data had been scanned once, and sorted into
one degree latitude and one degree longitude grid buckets used as
data input.
2) All data points belonging to the grid cell have to be available.
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Implementing K-means Using Data
Stream (cont.)
N.Y.U.S.T.
I.M.
- Partial k-means on a subset of data points.
1) Instead of storing all data points v1,…,vn of a grid cell Cs in memory
divide the data of Cs into p partitions P1,…PP.
2) All data points v1,…,vm of partition Pj can be stored into available
memory.
3) Selecting a set of random k seeds for a partition Pj until the
convergence criteria is met, and repeating for several sets of random
k-seeds.
4) The partial k-means produces a set of weighted centroids cij is
included in Pj {(c1j ,w1j), (c2j ,w2j),..., (ckj ,wkj)}.
12
Intelligent Database Systems Lab
Implementing K-means Using Data
Stream (cont.)
N.Y.U.S.T.
I.M.
- Merge k-means the results of step 2.
1) It performs another k-means using the set of all centroids that were
computed in the partial k-means for all partitions P1,…PP
2) Given a set S of M D-dimensional centroids {(c1 ,w1), (c2 ,w2),...(cm ,
wm)} where M is the sum of centroids of P1,…PP.
13
Intelligent Database Systems Lab
Implementing K-means Using Data
Stream (cont.)
N.Y.U.S.T.
I.M.
- Merge K-means algorithm:
1) Initialization: Select the subset of k initial cluster centroids zi (the
weight wi of zi is one of the k largest weights in S.).
2) Distance Calculation: For each data point ci, 1 < i < m, compute its
Euclidean distance to each centroid zj, 1 < i < k, and then find the
closest cluster centroid.
3) Centroid Recalculation: For each 1 ≤ j ≤ k, computing the actual,
weighted mean of the cluster Cj that is the new centroid.
( where  j   xC j x * wi / | C j | )
4) Convergence Condition: Repeat (2) to (3) until convergence criteria
is met; e.g. is || MSE(n-1) – MSE(n) || ≤ 1* 10-9
14
Intelligent Database Systems Lab
Implementing K-means Using Data
Stream (cont.)
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Space & Time complexity

Partial k-means vs. Serial k-means
Space
Time
Serial K-means
O(N)
O (NKI)
Partial K-means
O(N’)
O (N ’KI’)
where N is the number of data points,
K is the number of centroids,
I is the number of iterations to converge,
O ( N ’ p ) = O ( N ) in the space complexity ( p is the number of
partitions), and
O ( N ’ K I ’ p ) << O ( N K I ) in the time complexity.
16
Intelligent Database Systems Lab
Space & Time complexity (cont.)

Merge k-means
Merge K-means
Space
Time
O(Kp)
O ( I 2 K p)
where K is the number of weighted centroid from each partition,
p is the number of partitions, and
I is the number of iterations to converge.
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Parallelizing Partial/Merge K-means

N.Y.U.S.T.
I.M.
Several options for parallelization can be considered.
- Option1 is to clone the partial k-means to as many machines as
possible, and compute all k-means algorithms on the data
partitions in parallel, and merge the results on one of the
machines.
- Option2 is to send a data partition to several machines at the same
time, and perform partial k-means with a different set of initial
seeds on each machine in parallel.
- Option3 is to break up the partial k-means into several finer
grained operators.
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Evaluation

The goal of the experimental evaluation is to
- compare the scalability of the partial/merge k-means. ( 5
split/10 split case),
- speed-up of the processing if the partial k-means operators
are parallelized, and run on different machines.
- the achieved quality of the clustering with a serial k-means
that clusters all data points in the same iteration.
- analyze the quality of the merge k-means operator with
regard to the size, and number of data partitions.
19
Intelligent Database Systems Lab
Experimental Evaluation (cont.)

Experiment Environment
- Conquest version that was implemented using JDK 1.3.1,
- four Dell Optiplex GX260 PCs which is equipped with a
2.8 GHz Intel Pentium IV processor, 1 GB of RAM, a 80
GB hard disk, and Netgear GS508T GigaSwith.
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Evaluation (cont.)

Data Sets
- EOS MISR data set,
- 1 deg * 1 deg grid cell with the following characteristics:
1) the number of data points per grid cell between 250, 2500, 5000,
20,000, 50,000, 75,000 points,
2) six attributes for each data point,
3) a fixed k for all configurations ( k = 40 )
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Evaluation (cont.)
Overall execution time,
serial v.s partial / merge K-means


The computation time for the serial k-means is increasing exponentially
with the number of data points per grid cell.
The overall execution time of the partial/merge k-means in most cases is
significantly lower.
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Evaluation (cont.)

Comparing 10-split vs. 5-split vs. serial
23
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Experimental Evaluation (cont.)
Minimum mean square error,
Partial K-means processing time,
serial vs. 5-split vs.10-split
5-split vs.10-split
24
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
N.Y.U.S.T.
I.M.
Conclusions

The partial/merge stream-based k-means
- is simpler to find an appropriate cluster representation.
- provides a highly scalable, parallel approach, efficiency,
and a significantly higher clustering quality.
25
Intelligent Database Systems Lab