clustering-basic

Download Report

Transcript clustering-basic

What is Cluster Analysis
• Clustering– Partitioning a data set into
several groups (clusters) such that
– Homogeneity: Objects belonging to the
same cluster are similar to each other
– Separation: Objects belonging to different
clusters are dissimilar to each other.
• Three fundamental
elements of
clustering
– The set of objects
– The set of attributes
– Distance measure
Supervised versus
Unsupervised Learning
• Supervised learning (classification)
– Supervision: Training data (observations,
measurements, etc.) are accompanied by
labels indicating the class of the observations
– New data is classified based on training set
• Unsupervised learning (clustering)
– Class labels of training data are unknown
– Given a set of measurements, observations,
etc., need to establish existence of classes or
clusters in data
What Is Good Clustering?
• Good clustering method will produce high
quality clusters with
– high intra-class similarity
– low inter-class similarity
• Quality of a clustering method is also
measured by its ability to discover some or
all of hidden patterns
• Quality of a clustering result depends on
both the similarity measure used by the
method and its implementation
•
•
•
•
•
•
•
•
•
Requirements of Clustering in Data Mining
Scalability
Ability to deal with different types of attributes
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Discovery of clusters with arbitrary shape
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Application Examples
• A stand-alone tool: explore data
distribution
• A preprocessing step for other algorithms
• Pattern recognition, spatial data analysis,
image processing, market research,
WWW, …
– Cluster documents
– Cluster web log data to discover groups of
similar access patterns
Co-expressed Genes
Gene Expression Data Matrix
Gene Expression Patterns
Co-expressed Genes
Why looking for co-expressed genes?
 Co-expression indicates co-function;
 Co-expression also indicates co-regulation.
Gene-based Clustering
1.5
Expression Value
1
0.5
0
-0.5
-1
-1.5
Time Point
1.5
Expression Level
1
0.5
0
-0.5
-1
-1.5
Time Points
1.5
Expression Value
1
0.5
0
-0.5
-1
-1.5
Time Points
Iyer’s data [2]
Examples of co-expressed genes and
coherent patterns in gene expression data
 [2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science,
283:83–87, 1999.
Data Matrix
• For memory-based clustering
– Also called object-by-variable structure
• Represents n objects with p variables
(attributes, measures)
– A relational table
 x11  x1 f



 
x
 x
i
1
if

 



 xn1  xnf
x 
1p 

 
 x 
ip 

 

 x 
np 

Two-way Clustering of Micoarray
Data
sample sample sample
sample 1 sample 2
3
4
…
• Clustering genes
• Samples are attributes
gene 1
0.13
0.72
0.1
0.57
gene 2
0.34
1.58
1.05
1.15
gene 3
0.43
1.1
0.97
1
gene 4
1.22
0.97
1
0.85
gene 5
-0.89
1.21
1.29
1.08
gene 6
1.1
1.45
1.44
1.12
gene 7
0.83
1.15
1.1
1
gene 8
0.87
1.32
1.35
1.13
• Find samples with similar
phenotype, e.g. cancers.
gene 9
-0.33
1.01
1.38
1.21
• Feature selection.
gene 10
0.10
0.85
1.03
1
gene
…
• Find genes with similar
function
• Clustering samples
• Genes are attributes.
• Informative genes.
• Curse of dimensionality.
Dissimilarity Matrix
• For memory-based clustering
– Also called object-by-object structure
– Proximities of pairs of objects
– d(i,j): dissimilarity between objects i and j
– Nonnegative
 0
– Close to 0: similar

 d (2,1)

0


 d (3,1) d (3,2) 0





 

d (n,1) d (n,2)   0
Distance Matrix
s1
g1
s2
s3
…
s4
0.13
0.72
0.1 0.57
g2
0.34
1.58 1.05 1.15
g3
0.43
g4
1.22
g5
-0.89
1.21 1.29 1.08
g6
1.1
1.45 1.44 1.12
g3
g7
0.83
1.15
g4
g8
0.87
1.32 1.35 1.13
g9
-0.33
1.01 1.38 1.21
g 10
0.10
1
g1
1 0.85
g2
1.1 0.97
0.97
g1
1.1
0.85 1.03
1
0
g2
g3
D(1,2) D(1,3) D(1,4)
0
D(2,3) D(2,4)
0
Original Data Matrix
D(3,4)
0
…
1
Distance Matrix
…
g4
…
How Good Is the Clustering?
• Dissimilarity/similarity depends on
distance function
– Different applications have different functions
– Inter-clusters distance  maximization
– Intra-clusters distance  minimization
• Judgment of clustering quality is typically
highly subjective
Types of Data in Clustering
•
•
•
•
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
Interval-valued Variables
• Continuous measurements of a roughly
linear scale
– Weight, height, latitude and longitude
coordinates, temperature, etc.
• Effect of measurement units in attributes
– Smaller unit  larger variable range  larger
effect to the result
– Standardization + background knowledge
Standardization
• Calculate the mean absolute deviation
m  1n (x  x
sf  1
(|
x

m
|

|
x

m
|

...

|
x

m
|)
,
f
2f
f
nf
f
n 1f
f
1f
2f
 ... 
xnf )
.
– The mean is not squared, so the effect of outliers
is reduced.
• Calculate the standardized measurement (zscore)
xif  m f
zif 
sf
• Mean absolute deviation is more robust
– The effect of outliers is reduced but remains
detectable
Minkowski Distance
• Minkowski distance: a generalization
d (i, j)  q | x  x |q  | x  x |q ... | x  x |q (q  0)
i1
j1
i2
j2
ip
jp
• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance
xi
Xi (1,7)
12
8.48
q=2
q=1
6
6
Xj(7,1)
xj
Properties of Minkowski
Distance
• Nonnegative: d(i,j)  0
• The distance of an object to itself is 0
– d(i,i) = 0
• Symmetric: d(i,j) = d(j,i)
• Triangular inequality
– d(i,j)  d(i,k) + d(k,j)
i
j
k
Major Clustering Approaches
• Partitioning algorithms: Construct various
partitions and then evaluate them by some
criterion
• Hierarchy algorithms: Create a hierarchical
decomposition of set of data (or objects) using
some criterion
• Density-based: based on connectivity and
density functions
• Grid-based: based on a multiple-level
granularity structure
Clustering Algorithms
• If we “clustering” the clustering
algorithms
Clustering
algorithms
Partitionbased
Centroidbased
K-means
Hierarchical
clustering
Densitybased
Medoidbased
PAM
CLARA
CLARANS
Modelbased
…