Lecture slides

Download Report

Transcript Lecture slides

Dimension reduction :
PCA and Clustering
by Agnieszka S. Juncker
Part of the slides is adapted from Chris Workman
The DNA Array Analysis Pipeline
Question
Experimental Design
Array design
Probe design
Sample Preparation
Hybridization
Buy Chip/Array
Image analysis
Normalization
Expression Index
Calculation
Comparable
Gene Expression Data
Statistical Analysis
Fit to Model (time series)
Advanced Data Analysis
Clustering
Meta analysis
PCA
Classification
Promoter Analysis
Survival analysis
Regulatory Network
Motivation: Multidimensional data
209619_at
32541_at
206398_s_at
219281_at
207857_at
211338_at
213539_at
221497_x_at
213958_at
210835_s_at
209199_s_at
217979_at
201015_s_at
203332_s_at
204670_x_at
208788_at
210784_x_at
204319_s_at
205049_s_at
202114_at
213792_s_at
203932_at
203963_at
203978_at
203753_at
204891_s_at
209365_s_at
209604_s_at
211005_at
219686_at
38521_at
217853_at
217028_at
201137_s_at
202284_s_at
201999_s_at
Pat1
7758
280
1050
391
1425
37
124
120
179
203
758
570
533
649
5577
648
142
298
3294
833
646
1977
97
315
1468
78
472
772
49
694
775
367
4926
4733
600
897
Pat2
4705
387
835
593
977
27
197
86
225
144
1234
563
343
354
3216
327
151
172
1351
674
375
1016
63
279
1105
71
519
74
58
342
604
168
2667
2846
1823
959
Pat3
5342
392
1268
298
2027
28
454
175
449
197
833
972
325
494
5323
1057
144
200
2080
733
370
2436
77
221
381
152
365
130
129
345
305
107
3542
1834
1657
800
Pat4
7443
238
1723
265
1184
38
116
99
174
314
1449
796
270
554
4423
746
173
298
2066
1298
436
1856
136
260
1154
74
349
216
70
502
563
160
5163
5471
1177
808
Pat5
8747
385
1377
491
939
33
162
115
185
250
769
869
691
710
5771
541
148
196
3726
862
738
1917
85
227
980
127
756
108
56
960
542
287
4683
5079
972
297
Pat6
4933
329
804
517
814
16
113
80
203
353
1110
494
460
455
3374
270
145
104
1396
371
497
822
74
222
1419
57
528
311
77
403
543
264
3281
2330
2303
1014
Pat7
7950
337
1846
334
658
36
97
83
186
173
987
673
563
748
4328
361
131
144
2244
886
546
1189
91
232
1253
66
637
80
61
535
725
273
4822
3345
1574
998
Pat8
5031
163
1180
387
593
23
97
119
185
285
638
1013
321
392
3515
774
146
110
2142
501
406
1092
61
141
554
153
828
235
61
513
587
113
3978
1460
1731
663
Dimension reduction methods
•
•
•
•
•
Principal component analysis (PCA)
Cluster analysis
Multidimensional scaling
Correspondance analysis
Singular value decomposition
Principal Component Analysis
(PCA)
• used for visualization of complex data
• developed to capture as much of the variation in
data as possible
Principal components
• 1. principal component (PC1)
– the direction along which there is greatest variation
• 2. principal component (PC2)
– the direction with maximum variation left in data,
orthogonal to the 1. PC
Principal components
• General about principal components
–
–
–
–
summary variables
linear combinations of the original variables
uncorrelated with each other
capture as much of the original variance as possible
Principal components
PCA - example
PCA on all Genes
Leukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes reduced to 2
PCA of genes (Leukemia data)
Plot of 8973 genes, dimension of 34 patients reduced to 2
Principal components - Variance
25
Variance (%)
20
15
10
5
0
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9 PC10
Clustering methods
• Hierarchical
– agglomerative
(buttom-up)
eg. UPGMA
- divisive
(top-down)
• Partitioning
– eg. K-means clustering
Hierarchical clustering
• Representation of all pairwise distances
• Parameters: none (distance measure)
• Results:
– in one large cluster
– hierarchical tree (dendrogram)
• Deterministic
Hierarchical clustering
– UPGMA Algorithm
•
•
•
•
Assign each item to its own cluster
Join the nearest clusters
Reestimate the distance between clusters
Repeat for 1 to n
Hierarchical clustering
Hierarchical clustering
Data with clustering order
and distances
Dendrogram representation
Leukemia data - clustering of patients
Leukemia data - clustering of patients on
top 100 significant genes
Leukemia data - clustering of genes
K-means clustering
• Partition data into K clusters
• Parameter: Number of clusters (K) must be chosen
• Randomilized initialization:
– different clusters each time
K-means - Algorithm
• Assign each item a class in 1 to K (randomly)
• For each class 1 to K
– Calculate the centroid (one of the K-means)
– Calculate distance from centroid to each item
• Assign each item to the nearest centroid
• Repeat until no items are re-assigned (convergence)
K-mean clustering, K=3
K-mean clustering, K=3
K-mean clustering, K=3
K-means clustering of Leukemia data
K-means clustering of Cell Cycle data
Self Organizing Maps (SOM)
• Partitioning method
(similar to the K-means method)
• Clusters are organized in a two-dimensional grid
• Size of grid is specified
– (eg. 2x2 or 3x3)
• SOM algoritm finds the optimal organization of data
in the grid
SOM - example
Comparison of clustering methods
• Hierarchical clustering
– Distances between all variables
– Timeconsuming with a large number of gene
– Advantage to cluster on selected genes
• K-means clustering
– Faster algorithm
– Does only show relations between all variables
• SOM
– more advanced algorithm
 N

d ( xi , yi )    ( xi  yi ) 2 
 i 1

½
Distance measures
• Euclidian distance
½

2
d ( xi , yi )    ( xi  yi ) 
 i 1

N
• Vector angle distance
x y
x y
d ( xi , yi )  1  cos    1 
i
i
2
i
2
i
• Pearsons distance
d ( xi , yi )  1  CC   1 
 ( x  x)( y  y)
 ( x  x)  ( y  y )
i
i
2
i
i
2
Comparison of distance measures
Summary
• Dimension reduction important to visualize data
• Methods:
– Principal Component Analysis
– Clustering
• Hierarchical
• K-means
• Self organizing maps
(distance measure important)
Coffee break
Next: Exercises in PCA and clustering