Microarray Analysis 3

Download Report

Transcript Microarray Analysis 3

Microarray Data Analysis


Data preprocessing and visualization
Supervised learning


Unsupervised learning




Machine learning approaches
Clustering and pattern detection
Gene regulatory regions predictions based coregulated genes
Linkage between gene expression data and gene
sequence/function databases
…
Unsupervised learning


Supervised methods

Can only validate or reject hypotheses

Can not lead to discovery of unexpected partitions
Unsupervised learning

No prior knowledge is used

Explore structure of data on the basis of
corrections and similarities
DEFINITION OF THE CLUSTERING PROBLEM
Eytan Domany
CLUSTER ANALYSIS YIELDS DENDROGRAM
T (RESOLUTION)
Eytan Domany
BUT WHAT ABOUT THE OKAPI?
Eytan Domany
Centroid methods – K-means
Data points at Xi , i= 1,...,N
Centroids at Y ,  = 1,...,K
Assign data point i to centroid  ; Si = 
Cost E:
N
E(S1 , S2 ,...,SN ; Y1 ,...YK ) =
K
2

(
S
,

)(
X

Y
)
 i
i

i 1  1
Minimize E over Si , Y
Eytan Domany
K-means

“Guess” K=3
Eytan Domany
K-means

Start with random
positions of centroids.
Iteration = 0
Eytan Domany
K-means


Start with random
positions of centroids.
Assign each data point
to closest centroid.
Iteration = 1
Eytan Domany
K-means



Start with random
positions of centroids.
Assign each data point
to closest centroid.
Move centroids to
center of assigned
points
Iteration = 2
Eytan Domany
K-means




Start with random
positions of centroids.
Assign each data point
to closest centroid.
Move centroids to
center of assigned
points
Iterate till minimal cost
Iteration = 3
Eytan Domany
K-means - Summary

Fast algorithm: compute distances from data
points to centroids

Result depends on initial centroids’ position
Must preset K
Fails for “non-spherical” distributions


Agglomerative Hierarchical Clustering
Need to define the distance between the
at each step merge pair of nearest clusters
new cluster and the other clusters.
initially – each point = cluster
Single Linkage:
distance between closest pair.
Distance between joined clusters
Complete Linkage: distance between farthest pair.
Average Linkage: average distance between all pairs
4
2
or distance between cluster centers
5
3
1
1
3
2
4
5
The dendrogram induces a linear ordering
of the data points
Dendrogram
Eytan Domany
Hierarchical Clustering Summary

Results depend on distance update method

Greedy iterative process

NOT robust against noise

No inherent measure to identify stable clusters

Average Linkage – the most widely used clustering
method in gene expression analysis
nature
2002
breast
cancer
Heat map
Cluster both genes and samples

Sample should
cluster together
based on
experimental design

Often a way to catch
labelling errors or
heterogeneity in
samples
Epinephrine Treated
Rat Fibroblast Cell
ID
Probe
1h
5h
10h
18h
24h
1
D21869_s_at
25.7
55.0
170.7
305.5
807.9
2
D25233_at
705.2
578.2
629.2
641.7
795.3
3
D25543_at
2148.7
1303.0
915.5
149.2
96.3
4
L03294_g_at
241.8
421.5
577.2
866.1
2107.3
5
J03960_at
774.5
439.8
314.3
256.1
44.4
6
M81855_at
1487.6
1283.7
1372.1
1469.1
1611.7
7
L14936_at
1212.6
1848.5
2436.2
3260.5
4650.9
8
L19998_at
767.9
290.8
300.2
129.4
51.5
9
AB017912_at
1813.7
3520.6
4404.3
6853.1
9039.4
10
M32855_at
234.1
23.1
789.4
312.7
67.8
Heap map
Correlation coeff
Normalized across each gene
Distance Issues

Euclidean distance
g1
g3
g2
g4
■ Pearson distance
400
350
300
250
time0
time1
time2
time3
200
150
100
50
0
gene1
gene2
gene3
gene4
Exercise

Use Average Linkage
Algorithm and
Manhattan distance.
Gene
ID
Exp1
Exp2
1
2
3
45
55
148
55
78
1303
4
5
6
241
774
607
765
439
383
Exercise
Issues in Cluster Analysis






A lot of clustering algorithms
A lot of distance/similarity metrics
Which clustering algorithm runs faster and uses
less memory?
How many clusters after all?
Are the clusters stable?
Are the clusters meaningful?
Which Clustering Method
Should I Use?





What is the biological question?
Do I have a preconceived notion of how many
clusters there should be?
How strict do I want to be? Spilt or Join?
Can a gene be in multiple clusters?
Hard or soft boundaries between clusters
The End

Thank you for taking this course. Bioinformatics is a very
diverse and fascinating subject. We hope you all decide to
continue your pursuit of it.

We will be very glad to answer your emails or schedule
appointments to talk about any bioinformatics related
questions you might have.

We wish you all have a wonderful summer break!