מצגת של PowerPoint - University of Utah

Download Report

Transcript מצגת של PowerPoint - University of Utah

UNSUPERVISED ANALYSIS
•GOAL A: FIND GROUPS OF GENES THAT HAVE
CORRELATED EXPRESSION PROFILES. THESE GENES ARE
BELIEVED TO BELONG TO THE SAME BIOLOGICAL
PROCESS.
•GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR
GENE EXPRESSION PROFILES. THESE TISSUES ARE
EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL)
STATE.
CLUSTERING
DEFINITION OF THE CLUSTERING PROBLEM
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
BUT WHAT ABOUT THE OKAPI?
STATEMENT OF THE PROBLEM
GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D
- DIMENSIONAL SPACE, IDENTIFY THE
UNDERLYING STRUCTURE OF THE DATA.
AIMS:PARTITION THE DATA INTO M CLUSTERS,
POINTS OF SAME CLUSTER - "MORE SIMILAR“
M ALSO TO BE DETERMINED!
GENERATE DENDROGRAM,
IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS
"ILL POSED": WHAT IS "MORE SIMILAR"?
RESOLUTION
CLUSTER ANALYSIS YIELDS DENDROGRAM
LINEAR ORDERING OF DATA
T
YOUNG
OLD
Need to define the distance between the
new cluster and the other clusters.
Single Linkage:
distance between closest pair.
Agglomerative Hierarchical Clustering
Complete Linkage: distance between farthest pair.
Average
Linkage:
average
Distance between
joined
clustersdistance between all pairs
or distance between cluster centers
4
2
5
3
1
1
3
2
4
Dendrogram
5
The dendrogram induces a linear ordering
of the data points
Hierarchical Clustering Summary
•
•
•
•
Results depend on distance update method
Greedy iterative process
NOT robust against noise
No inherent measure to identify stable
clusters
COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS
2 FLAT CLOUDS - SINGLE LINKAGE WORKS
SINGLE LINKAGE SENSITIVE TO NOISE
Average linkage
Need to define the distance between the
new cluster and the other clusters.
Average Linkage: average distance between all pairs
Mean Linkage: distance between centroids
Distance between joined clusters
4
2
5
3
1
1
3
2
4
Dendrogram
5
STATEMENT OF THE PROBLEM
GIVEN DATA POINTS Xi, i=1,2,...N, EMBEDDED IN D
- DIMENSIONAL SPACE, IDENTIFY THE
UNDERLYING STRUCTURE OF THE DATA.
AIMS:PARTITION THE DATA INTO M CLUSTERS,
POINTS OF SAME CLUSTER - "MORE SIMILAR“
M ALSO TO BE DETERMINED!
GENERATE DENDROGRAM,
IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS
"ILL POSED": WHAT IS "MORE SIMILAR"?
RESOLUTION
how many clusters?
3 LARGE
MANY small
(SPC)
K-means
•Start with random
positions of centroids.
Iteration = 0
K-means
•Start with random
positions of centroids.
•Assign data points to
centroids
Iteration = 1
K-means
•Start with random
positions of centroids.
•Assign data points to
centroids
•Move centroids to center
of assigned points
Iteration = 1
K-means
•Start with random
positions of centroids.
•Assign data points to
centroids
•Move centroids to center
of assigned points
•Iterate till minimal cost
Iteration = 3
K-means - Summary
• Result depends on initial centroids’ position
• Fast algorithm: compute distances from data
points to centroids
• Must preset K
• Fails for non-spherical distributions
TSS vs K
Iris setosa
Iris virginica
50 specimes from each group
4 numbers for each flower
150 data points in 4-dimensional space
Iris versicolor
150 points in d=4
3 large clusters
Output of SPC
Stable clusters
“live” for large T
Choosing a value for T
Same data - Average Linkage
No analog for 
Same data - Average Linkage
Examining
this cluster
glioblastoma
GLIOBLASTOMA:
M. HEGI et al CHUV, CLONTECH ARRAYS
Coupled Two-Way Clustering (CTWC)
S3
S2
of 358 Genes and 36 Samples
S1(G1)
T
G12
GENES
G5
Fig. 2A
A(II)
ScGBM
PrGBM
CL
Super-Paramagnetic Clustering of All Samples
Using Stable Gene Cluster G5
S1(G5)
S14 S13
S11
S12
S10
A B004904
M32977
M35410
X 51602
M96322
A B004903
X 52946
J04111
X 79067
Fig. 2B
S TAT-induced S TA T inhibitor 3
V EGF
IGFB P 2
V E G FR1
gr avin
S TAT-induced S TA T inhibitor 2
P TN
c-jun
TIS 11B