Gene Expression : Clustering
Download
Report
Transcript Gene Expression : Clustering
Gene Expression
Clustering
The Main Goal
Gain insight into the gene’s function.
Using:
Sequence
Transcription levels.
Microarray Technology
Microarray Technology
Microarray - standard laboratory technique.
Information about gene expression.
Tens of thousands of data points.
Analyze by computational methods.
Gene Clustering
To cluster genes means to group together
genes with similarity in their expression
patterns.
Why do we need to cluster genes?
Unknown gene function.
Common regulatory elements.
Pathways and biological processes.
Defining new disease subclasses.
Predict categorization of new samples.
Data reduction and visualization.
Gene Clustering
Clustering methods can be divided into two
major groups:
Supervised clustering –classify according to previous
knowledge (group prediction).
Unsupervised clustering – no previous knowledge is
used (pattern discovery).
Unsupervised clustering
In many cases we have little a-priory knowledge
about genes.
There are many different methods of
unsupervised clustering.
We will present Hierarchical clustering.
The Method
Hierarchical clustering
All data instances start in their own clusters.
Two most closely related clusters are merged.
Repeated until a single cluster remains.
Arranges the data into a tree structure
Can be broken into the desired number of
clusters.
Hierarchical clustering
The raw data
Gene
Chip1
Chip2
…
Chip20
1
x1,1
x1,2
…
x1,20
2
x2,1
x2,2
…
x2,20
3
x3,1
x3,2
…
x3,20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12,000
x12000,1
x12000,2
…
x12000,20
Hierarchical clustering
Normalized data
Hierarchical clustering
Calculate the Distance Matrix
x ( x1 , x2 ...xn ), y ( y1 , y2 ... yn )
Euclidean distance formula:
d ( x, y )
n
2
(
x
y
)
i i
i 1
Correlation coefficient (r):
1
N
d ( x, y )
N
( Xi E ( X ))(Yi E (Y ))
i 1
V ( X )V (Y )
1
E( X )
N
N
C
i 1
A
Xi
( Xi E ( X )) 2
V (X )
N
i 1
N
B
Hierarchical clustering
Calculate the Distance Matrix
Average linkage - midpoint.
Single linkage – smallest distance.
Complete linkage - largest distance.
Hierarchical clustering
Calculate the Distance Matrix
Gene
Chip1
Chip2
A
-2.0
1.0
B
-1.5
-0.5
C
1.0
0.25
A
B
C
A
0.00
1.58
3.09
B
1.58
0.00
2.61
C
3.09
2.61
0.00
Hierarchical clustering
Average Linkage Algorithm
A
B
C
D
A
0.00
1.58
3.09
4.74
B
1.58
0.00
2.61
5.00
C
3.09
2.61
0.00
2.70
D
4.74
5.00
2.70
0.00
A D B C
Hierarchical clustering
Average Linkage Algorithm
AB
A B D C
C
D
AB
0.00
2.85 4.81
C
2.85
0.00 2.70
D
4.81
2.70 0.00
Hierarchical clustering
Average Linkage Algorithm
A B C D
AB
CD
AB
0.00
3.83
CD
3.83
0.00
Hierarchical clustering
dendogram
A B C D
Hierarchical clustering
heat maps
red corresponding
to high expression
levels
green
corresponding to
low expression
levels
black
corresopnding to
intermediate
expression levels.
Hierarchical clustering
Experiment Control
Random 1 –
randomized by
rows.
Random 2 –
randomized by
columns.
Random 3 –
randomized by
both rows and
columns.
Examples
Example I
We present here an experiment of Spellman
et al that was published in Mol. Biol. Cell 9,
3273-3297 (1998).
Goals of the experiment:
Identify all cell cycle regulated genes in Yeast.
Show clustering at work.
Example I
Cell Cycle
Example I
Methods
DNA microarrays contained all the
yeast genome.
Measure levels of mRNA as a
function of time.
Example I
Methods
Synchronization:
Factors:
factor.
Elutriation – size based.
Cdc15 – heat mutation.
cln3p, clb2p deletation.
induced with these factors.
Data from a previously published study (Cho et
al. 1998)
Control sample: asynchronous cultures.
Example I
Methods
Measurements analyzed based on:
Fourier algorithm - assesses periodicity.
Correlation measurement - compared with
previously identified cell cycle regulated
genes.
Example I
Methods
Calculate a score for each
gene - "CDC score".
Threshold CDC value.
91% of the genes previously
shown to be cell cycle
regulated are included.
About 800 genes were
identified as cell cycle
regulated.
Example I
Phasing
By time of
peak
expression:
Example I
Clustering
By similarity of
expression
across the
measurements:
Example I
Clustering
Hierarchical clustering.
Identified 9 clusters.
Genes in each cluster share:
Common upstream elements
Regulation by similar transcription factors.
Common function (only for known genes).
Cln3p and clb2p has the same effect on the
genes in a cluster.
Example I
Clustering
Histone cluster:
A very tight cluster.
Repeated SCB motif in promoter.
Induced by Cln3.
Unaffected by Clb2.
Peak during S phase.
Example I
Results
Genes with known functionality:
Cell cycle regulated functions
The MET cluster.
Genes involved in secretion and lipid synthesis.
Known genes discovered as cell cycle
regulated.
Example I
Results
New binding sites for regulators.
The CLB cluster is highly regulated.
Aligning the genes in the cluster.
New consensus for MCM1+SFF binding site.
Example I
Results
MCM1:T-T-A-C-C-N-A-A-T-T-N-G-G-T-A-A
SFF: G-T-M-A-A-C-A-A
New motif:
T-T-W-C-C-Y-A-A-W-N-N-G-G-W-A-A-W-W-N-RT-A-A-A-Y-A-A
Example II
Gasch AP. et al. Genomic expression
programs in the response of yeast cells to
environmental changes.
Mol Biol Cell. 2000; 11(12): 4241-57
Main Goal:
Characterize the yeast response to environmental
changes, and particularly to stress conditions.
Example II
Methods
Yeast cells responding to diverse
environmental stresses.
Microarray contained all yeast genes.
Results were organized by hierarchical
clustering.
Example II
General features of the stress response
Massive and rapid changes.
Transient changes.
Correlated with the magnitude of the shift:
Duration
Amplitude
Steady-state difference.
Example II
General features of the stress response
Some genes responded in a stereotypical
manner.
Some genes had unique response.
No two expression programs were identical.
Example II
The Environmental Stress Response (ESR)
About 900 genes responded in a
stereotypical manner.
ESR – Environmental Stress Response.
Two large clusters of genes:
repressed genes (~ 600)
induced genes (~ 300)
Showed reciprocal response.
Example II
The Environmental Stress Response (ESR)
Response to
different shift in:
Temperature
Osmolarity.
Example II
The Environmental Stress Response (ESR)
Heat shock
osmolarity
The ESR is not a response to all
environmental changes.
Example II
The Environmental Stress Response (ESR)
Shift between two equally stressful
environments:
29oC and hyper-osmotic medium.
33oC with normal osmolarity.
sum of the responses.
Independent response to each of the
changes.
Example II
The Environmental Stress Response (ESR)
Previously known:
STRE promoter.
Recognized by Msn2p and Msn4p.
One all-purpose regulatory system ?
Example II
The Environmental Stress Response (ESR)
TRX2 cluster genes:
Dependent on Msn2/Msn4p in response to heat
shock.
Unaffected from Msn2/Msn4p in response to H2O2.
Contained binding site for Yap1p.
Yap1p deletion strain.
Example II
The Environmental Stress Response (ESR)
Revealed that TRX2 cluster genes:
Induced by Yap1p in response to H2O2 treatment
Unaffected by the deletion in response to heat shock.
ESR regulated by different transcription
factors.
Regulation is condition-specific and genespecific.
Example II
Specific Response
Response to stress:
Stereotypic response (ESR).
Specific response.
Character cell’s response to specific stress.
Example: Heat-shock response
ESR initiated fast (minutes).
Induction of chaperones.
Alternative carbon source utilization.
Conclusions
Hierarchical clustering
Conclusion
Difficulty:
Post transcriptional regulation.
Solution:
Use the method in cases the main regulation is
in transcription level (example – Yeast cell
cycle).
Hierarchical clustering
Conclusion
Difficulty:
No statistical foundation for the decision of
where to cut the dendogram.
Solution:
Split a tree in such a way which will produce
clusters of genes with homogeneity.
Such a split is considered to be evidence that
the grouping was correct.
Hierarchical clustering
Conclusion
Difficulty:
The algorithm will produce clusters in any
case.
Solution:
Introduces a small amount of random to the
data, re-cluster the data and compare the
results to the original clustering.
If the results are the same, then the clustering
is not representing true biological meaning.
Hierarchical clustering
Conclusion
Discover gene’s function.
Status of cellular processes.
Information on regulatory mechanisms.
General cell behaviors.
Assign genes to pathways.
Unknown biological pathways.
References
Eisen M. B., Spellman P. T., Brown R. O., Botstein D. Cluster
analysis and display of genome-wide expression pattern. Proc. Natl.
Acad. Sci. USA, 95: 14863-14868, 1998
Spellman, P.T. et al. Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by
microarray hybridization. Mol. Biol. Cell 9, 3273-3297 (1998).
Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz
G, Botstein D, Brown PO.
Genomic expression programs in the response of yeast cells to
environmental changes.
Mol Biol Cell. 2000; 11(12): 4241-57.
Shannon William, Culverhouse Robert, Duncan Jill. Analyzing
microarray data using cluster analysis. Pharmacogenomics, 2003,
4(1):41-51. Review.
Kaminski Naftali, Friedman Nir. Practical Approaches to Analyzing
Results of Microarray Experiments. American Journal of Respiratory
and Cell Molecular Biology, 2002, 27:125-132. Reviwe.