Gene Expression : Clustering

Transcript Gene Expression : Clustering

Gene Expression
Clustering
The Main Goal

Gain insight into the gene’s function.

Using:


Sequence
Transcription levels.
Microarray Technology
Microarray Technology




Microarray - standard laboratory technique.
Information about gene expression.
Tens of thousands of data points.
Analyze by computational methods.
Gene Clustering

To cluster genes means to group together
genes with similarity in their expression
patterns.
Why do we need to cluster genes?






Unknown gene function.
Common regulatory elements.
Pathways and biological processes.
Defining new disease subclasses.
Predict categorization of new samples.
Data reduction and visualization.
Gene Clustering

Clustering methods can be divided into two
major groups:


Supervised clustering –classify according to previous
knowledge (group prediction).
Unsupervised clustering – no previous knowledge is
used (pattern discovery).
Unsupervised clustering

In many cases we have little a-priory knowledge
about genes.

There are many different methods of
unsupervised clustering.

We will present Hierarchical clustering.
The Method
Hierarchical clustering





All data instances start in their own clusters.
Two most closely related clusters are merged.
Repeated until a single cluster remains.
Arranges the data into a tree structure
Can be broken into the desired number of
clusters.
Hierarchical clustering
The raw data
Gene
Chip1
Chip2
…
Chip20
1
x1,1
x1,2
…
x1,20
2
x2,1
x2,2
…
x2,20
3
x3,1
x3,2
…
x3,20
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12,000
x12000,1
x12000,2
…
x12000,20
Hierarchical clustering
Normalized data
Hierarchical clustering
Calculate the Distance Matrix
x  ( x1 , x2 ...xn ), y  ( y1 , y2 ... yn )

Euclidean distance formula:
d ( x, y ) 
n
2
(
x

y
)
 i i
i 1

Correlation coefficient (r):
1
N
d ( x, y ) 
N
 ( Xi  E ( X ))(Yi  E (Y ))
i 1
V ( X )V (Y )
1
E( X ) 
N
N
C
i 1
A
 Xi
( Xi  E ( X )) 2
V (X )  
N
i 1
N
B
Hierarchical clustering
Calculate the Distance Matrix



Average linkage - midpoint.
Single linkage – smallest distance.
Complete linkage - largest distance.
Hierarchical clustering
Calculate the Distance Matrix
Gene
Chip1
Chip2
A
-2.0
1.0
B
-1.5
-0.5
C
1.0
0.25
A
B
C
A
0.00
1.58
3.09
B
1.58
0.00
2.61
C
3.09
2.61
0.00
Hierarchical clustering
Average Linkage Algorithm
A
B
C
D
A
0.00
1.58
3.09
4.74
B
1.58
0.00
2.61
5.00
C
3.09
2.61
0.00
2.70
D
4.74
5.00
2.70
0.00
A D B C
Hierarchical clustering
Average Linkage Algorithm
AB
A B D C
C
D
AB
0.00
2.85 4.81
C
2.85
0.00 2.70
D
4.81
2.70 0.00
Hierarchical clustering
Average Linkage Algorithm
A B C D
AB
CD
AB
0.00
3.83
CD
3.83
0.00
Hierarchical clustering
dendogram
A B C D
Hierarchical clustering
heat maps



red corresponding
to high expression
levels
green
corresponding to
low expression
levels
black
corresopnding to
intermediate
expression levels.
Hierarchical clustering
Experiment Control
Random 1 –
randomized by
rows.
Random 2 –
randomized by
columns.
Random 3 –
randomized by
both rows and
columns.
Examples
Example I

We present here an experiment of Spellman
et al that was published in Mol. Biol. Cell 9,
3273-3297 (1998).

Goals of the experiment:


Identify all cell cycle regulated genes in Yeast.
Show clustering at work.
Example I
Cell Cycle
Example I
Methods

DNA microarrays contained all the
yeast genome.

Measure levels of mRNA as a
function of time.
Example I
Methods

Synchronization:




Factors:




 factor.
Elutriation – size based.
Cdc15 – heat mutation.
cln3p, clb2p deletation.
induced with these factors.
Data from a previously published study (Cho et
al. 1998)
Control sample: asynchronous cultures.
Example I
Methods

Measurements analyzed based on:

Fourier algorithm - assesses periodicity.

Correlation measurement - compared with
previously identified cell cycle regulated
genes.
Example I
Methods

Calculate a score for each
gene - "CDC score".

Threshold CDC value.

91% of the genes previously
shown to be cell cycle
regulated are included.

About 800 genes were
identified as cell cycle
regulated.
Example I
Phasing
By time of
peak
expression:
Example I
Clustering
By similarity of
expression
across the
measurements:
Example I
Clustering

Hierarchical clustering.
Identified 9 clusters.

Genes in each cluster share:





Common upstream elements
Regulation by similar transcription factors.
Common function (only for known genes).
Cln3p and clb2p has the same effect on the
genes in a cluster.
Example I
Clustering

Histone cluster:





A very tight cluster.
Repeated SCB motif in promoter.
Induced by Cln3.
Unaffected by Clb2.
Peak during S phase.
Example I
Results

Genes with known functionality:

Cell cycle regulated functions



The MET cluster.
Genes involved in secretion and lipid synthesis.
Known genes discovered as cell cycle
regulated.
Example I
Results

New binding sites for regulators.

The CLB cluster is highly regulated.
Aligning the genes in the cluster.
New consensus for MCM1+SFF binding site.


Example I
Results
MCM1:T-T-A-C-C-N-A-A-T-T-N-G-G-T-A-A
 SFF: G-T-M-A-A-C-A-A
 New motif:
T-T-W-C-C-Y-A-A-W-N-N-G-G-W-A-A-W-W-N-RT-A-A-A-Y-A-A

Example II

Gasch AP. et al. Genomic expression
programs in the response of yeast cells to
environmental changes.
Mol Biol Cell. 2000; 11(12): 4241-57

Main Goal:

Characterize the yeast response to environmental
changes, and particularly to stress conditions.
Example II
Methods

Yeast cells responding to diverse
environmental stresses.

Microarray contained all yeast genes.
Results were organized by hierarchical
clustering.

Example II
General features of the stress response

Massive and rapid changes.
Transient changes.

Correlated with the magnitude of the shift:




Duration
Amplitude
Steady-state difference.
Example II
General features of the stress response



Some genes responded in a stereotypical
manner.
Some genes had unique response.
No two expression programs were identical.
Example II
The Environmental Stress Response (ESR)

About 900 genes responded in a
stereotypical manner.
ESR – Environmental Stress Response.

Two large clusters of genes:




repressed genes (~ 600)
induced genes (~ 300)
Showed reciprocal response.
Example II
The Environmental Stress Response (ESR)

Response to
different shift in:


Temperature
Osmolarity.
Example II
The Environmental Stress Response (ESR)
Heat shock

osmolarity
The ESR is not a response to all
environmental changes.
Example II
The Environmental Stress Response (ESR)

Shift between two equally stressful
environments:


29oC and hyper-osmotic medium.
33oC with normal osmolarity.

sum of the responses.

Independent response to each of the
changes.
Example II
The Environmental Stress Response (ESR)
Previously known:




STRE promoter.
Recognized by Msn2p and Msn4p.
One all-purpose regulatory system ?
Example II
The Environmental Stress Response (ESR)

TRX2 cluster genes:




Dependent on Msn2/Msn4p in response to heat
shock.
Unaffected from Msn2/Msn4p in response to H2O2.
Contained binding site for Yap1p.
Yap1p deletion strain.
Example II
The Environmental Stress Response (ESR)

Revealed that TRX2 cluster genes:




Induced by Yap1p in response to H2O2 treatment
Unaffected by the deletion in response to heat shock.
ESR regulated by different transcription
factors.
Regulation is condition-specific and genespecific.
Example II
Specific Response

Response to stress:


Stereotypic response (ESR).
Specific response.

Character cell’s response to specific stress.

Example: Heat-shock response



ESR initiated fast (minutes).
Induction of chaperones.
Alternative carbon source utilization.
Conclusions
Hierarchical clustering
Conclusion
Difficulty:
Post transcriptional regulation.
Solution:
Use the method in cases the main regulation is
in transcription level (example – Yeast cell
cycle).
Hierarchical clustering
Conclusion
Difficulty:
No statistical foundation for the decision of
where to cut the dendogram.
Solution:
Split a tree in such a way which will produce
clusters of genes with homogeneity.
Such a split is considered to be evidence that
the grouping was correct.
Hierarchical clustering
Conclusion
Difficulty:
The algorithm will produce clusters in any
case.
Solution:
Introduces a small amount of random to the
data, re-cluster the data and compare the
results to the original clustering.
If the results are the same, then the clustering
is not representing true biological meaning.
Hierarchical clustering
Conclusion

Discover gene’s function.

Status of cellular processes.

Information on regulatory mechanisms.
General cell behaviors.
Assign genes to pathways.
Unknown biological pathways.



References






Eisen M. B., Spellman P. T., Brown R. O., Botstein D. Cluster
analysis and display of genome-wide expression pattern. Proc. Natl.
Acad. Sci. USA, 95: 14863-14868, 1998
Spellman, P.T. et al. Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by
microarray hybridization. Mol. Biol. Cell 9, 3273-3297 (1998).
Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz
G, Botstein D, Brown PO.
Genomic expression programs in the response of yeast cells to
environmental changes.
Mol Biol Cell. 2000; 11(12): 4241-57.
Shannon William, Culverhouse Robert, Duncan Jill. Analyzing
microarray data using cluster analysis. Pharmacogenomics, 2003,
4(1):41-51. Review.
Kaminski Naftali, Friedman Nir. Practical Approaches to Analyzing
Results of Microarray Experiments. American Journal of Respiratory
and Cell Molecular Biology, 2002, 27:125-132. Reviwe.

Gene Expression : Clustering

Transcript Gene Expression : Clustering

Directory