Microarray Analysis 3
Download
Report
Transcript Microarray Analysis 3
Microarray Data Analysis
Data preprocessing and visualization
Supervised learning
Unsupervised learning
Machine learning approaches
Clustering and pattern detection
Gene regulatory regions predictions based coregulated genes
Linkage between gene expression data and gene
sequence/function databases
…
Unsupervised learning
Supervised methods
Can only validate or reject hypotheses
Can not lead to discovery of unexpected partitions
Unsupervised learning
No prior knowledge is used
Explore structure of data on the basis of
corrections and similarities
DEFINITION OF THE CLUSTERING PROBLEM
Eytan Domany
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
Eytan Domany
BUT WHAT ABOUT THE OKAPI?
Eytan Domany
Centroid methods – K-means
Data points at Xi , i= 1,...,N
Centroids at Y , = 1,...,K
Assign data point i to centroid ; Si =
Cost E:
N
E(S1 , S2 ,...,SN ; Y1 ,...YK ) =
K
2
(
S
,
)(
X
Y
)
i
i
i 1 1
Minimize E over Si , Y
Eytan Domany
K-means
“Guess” K=3
Eytan Domany
K-means
Start with random
positions of centroids.
Iteration = 0
Eytan Domany
K-means
Start with random
positions of centroids.
Assign each data point
to closest centroid.
Iteration = 1
Eytan Domany
K-means
Start with random
positions of centroids.
Assign each data point
to closest centroid.
Move centroids to
center of assigned
points
Iteration = 2
Eytan Domany
K-means
Start with random
positions of centroids.
Assign each data point
to closest centroid.
Move centroids to
center of assigned
points
Iterate till minimal cost
Iteration = 3
Eytan Domany
K-means - Summary
Fast algorithm: compute distances from data
points to centroids
Result depends on initial centroids’ position
Must preset K
Fails for “non-spherical” distributions
Agglomerative Hierarchical Clustering
Need to define the distance between the
at each step merge pair of nearest clusters
new cluster and the other clusters.
initially – each point = cluster
Single Linkage:
distance between closest pair.
Distance between joined clusters
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
4
2
or distance between cluster centers
5
3
1
1
3
2
4
5
The dendrogram induces a linear ordering
of the data points
Dendrogram
Eytan Domany
Hierarchical Clustering Summary
Results depend on distance update method
Greedy iterative process
NOT robust against noise
No inherent measure to identify stable clusters
Average Linkage – the most widely used clustering
method in gene expression analysis
nature
2002
breast
cancer
Heat map
Cluster both genes and samples
Sample should
cluster together
based on
experimental design
Often a way to catch
labelling errors or
heterogeneity in
samples
Epinephrine Treated
Rat Fibroblast Cell
ID
Probe
1h
5h
10h
18h
24h
1
D21869_s_at
25.7
55.0
170.7
305.5
807.9
2
D25233_at
705.2
578.2
629.2
641.7
795.3
3
D25543_at
2148.7
1303.0
915.5
149.2
96.3
4
L03294_g_at
241.8
421.5
577.2
866.1
2107.3
5
J03960_at
774.5
439.8
314.3
256.1
44.4
6
M81855_at
1487.6
1283.7
1372.1
1469.1
1611.7
7
L14936_at
1212.6
1848.5
2436.2
3260.5
4650.9
8
L19998_at
767.9
290.8
300.2
129.4
51.5
9
AB017912_at
1813.7
3520.6
4404.3
6853.1
9039.4
10
M32855_at
234.1
23.1
789.4
312.7
67.8
Heap map
Correlation coeff
Normalized across each gene
Distance Issues
Euclidean distance
g1
g3
g2
g4
■ Pearson distance
400
350
300
250
time0
time1
time2
time3
200
150
100
50
0
gene1
gene2
gene3
gene4
Exercise
Use Average Linkage
Algorithm and
Manhattan distance.
Gene
ID
Exp1
Exp2
1
2
3
45
55
148
55
78
1303
4
5
6
241
774
607
765
439
383
Exercise
Issues in Cluster Analysis
A lot of clustering algorithms
A lot of distance/similarity metrics
Which clustering algorithm runs faster and uses
less memory?
How many clusters after all?
Are the clusters stable?
Are the clusters meaningful?
Which Clustering Method
Should I Use?
What is the biological question?
Do I have a preconceived notion of how many
clusters there should be?
How strict do I want to be? Spilt or Join?
Can a gene be in multiple clusters?
Hard or soft boundaries between clusters
The End
Thank you for taking this course. Bioinformatics is a very
diverse and fascinating subject. We hope you all decide to
continue your pursuit of it.
We will be very glad to answer your emails or schedule
appointments to talk about any bioinformatics related
questions you might have.
We wish you all have a wonderful summer break!