Molecular Biology of the Cell

Download Report

Transcript Molecular Biology of the Cell

Predicting Gene Expression
from Sequence
Michael A. Beer and Saeed Tavazoie
Cell 117, 185-198 (16 April 2004)
1
The Authors
Mike Beer
Postdoctoral Researcher
Ph.D, Princeton (1995)
Saeed Tavazoie (middle)
Professor
Dept. of Molecular Biology
The Question
• Transcription factor binding sites are relatively
well-characterized in Saccharomyces cerevisiae
• But - the presence of a TF binding site alone is
not sufficient to predict expression of a gene
• Multiple regulatory factors are often involved
• How do you identify the elaborate rules for gene
regulation?
Simple regulatory structures
Each possible combination of TFs must be tested in the lab;
This is a hugely time-consuming task..
Problems with predicting gene
regulation
Regulatory motif sequences have low consensus
e.g. The well known “TATA box” has a
consensus of TATA(A/T)A(A/T)(A/G)
Numerous transcription factors can bind to any one motif
Many genes have multiple known motifs upstream of ATG
Example of cis-regulatory logic
From Yuh et al (1998), Science 279, 1896-1902
The Approach
1. Using microarray expression data, the authors built
clusters of genes with similar expression patterns.
From brain expression data in Wen et al (1998), PNAS 95, 334-339
The Approach, con’t.
2. From groups of genes with similar expression patterns,
a search is undertaken for consensus sequence motifs
within 800bp upstream of ATG in each cluster.
The Approach, con’t
3. The authors built a Markov model using the TF sequence
motifs as parent nodes, and the expression data as data values.
4. This can be applied to a gene of interest by identifying the
upstream TF motifs for that gene, and finding the model(s) that
best fits the known upstream TF motifs.
5. If the expression data is within the parameters predicted by the
model, then there is a decent chance that its associated gene
regulatory structure can be verified experimentally.
Two examples from yeast
Both clusters have at least 10 genes each, and there is some
confidence that genes with the same upstream TFs will
exhibit the same expression pattern as these clusters.
Constructing the models
Using expression data from 30 microarrays, the authors identified
5547 genes with “significant” expression levels in yeast, and this
data was used to construct 49 models of expression patterns.
Predictive accuracy
These 49 models were applied to five test sets of expression
data, using only the upstream 800 bp region as input.
They found that the expression pattern was correctly predicted
for 1898 genes out of the test set(s) of 2587 genes.
This amounts to 73% accuracy (random would be 1/49, or 2%).
Application to C. elegans
Given the larger amount of regulatory sequences in
higher order organisms, and the potential for more
complex regulation, the authors had low expectations
for applying this model to C. elegans.
Using 2000 bp of upstream sequence, and microarray
expression data including Hill (2000), the authors
were surprised to learn that they could predict
expression patterns for roughly half of the genes in
the C. elegans dataset.
An example from C. elegans
Is it really so simple?
Gene regulation involves a complex combinatorial
dance of numerous factors aside from the presence or
absence of TF binding sites.
The authors have deliberately limited their scope to
cis-acting upstream factors-- ignoring regulatory
elements in introns or downstream regions, as well as
the effects of operons, alternative splicing, histone
modifications, methylation, et cetera
Model constraints
Several bits of information were found to be significant
factors in improving the predictive accuracy of the
models:
A.
B.
C.
D.
Motif orientiation ( <--- or ---> )
Distance from the start codon
The particular order of various TFs
The presence of multiple copies of the same TF
All of those factors were included in the model as priors.
Why is distance from the start
codon significant?
From Harbison et al (2004), Nature 431, 99-104
The number of copies of a TF
binding site is relevant..
From Molecular Biology of the Cell, 4th edition
Motif combinatorics and
predictive accuracy
Combinatoric models are more accurate
than single-TF models (unless a gene
is under the control of only one TF).
The order of various TFs is significant
Future directions..
Because of the sensitivity of the model(s), even a very small
amount of ambiguity can yield junk results.
For this reason, SAGE data is not particularly suitable, as
only unique SAGE tags can be said to be unambiguous; this
in turn excludes all sorts of potentially useful data.
However, we could use the microarray-based predictions to
pick gene regulatory structures to investigate..