Promoter identification

Download Report

Transcript Promoter identification

COMPUTATIONAL ANALYSIS
OF PROMOTERS
Gene regulation
• Genomes usually contain several thousands of different
genes.
• Some of the gene products are required by the cell under
all growth conditions and are called housekeeping
genes.
• genes for DNA polymerase, RNA polymerase, rRNA, tRNA, …
• Many other gene products are required under specific
growth conditions.
• e.g. enzymes responding to a specific environmental condition
such as DNA damage
Gene regulation
• Housekeeping genes must be expressed at some level all
of the time.
• Frequently, as the cell grows faster, more of the housekeeping
gene products are needed.
• The gene products required for specific growth conditions
are not needed all of the time.
• These genes are frequently expressed at extremely low levels, or
not expressed at all when they are not needed and yet made when
they are needed.
• Apparently, the gene expression must be regulated so
that the genes that are being expressed meet the needs
of different cell types, developmental stages, or different
external conditions.
Gene regulation
Gene regulation basically occurs at three different places:
1.
transcriptional regulation
• transcription of the gene is regulated
• control of transcription initiation – most important control mechanism
2.
translational regulation
• translation of the gene is regulated
• How often the mRNA is translated influences the amount of gene
product that is made.
3.
post-transcriptional/post-translational regulation
• regulation of gene products after they are completely synthesized, e.g.
degradation, chemical modifications (methylation, phosphorylation)
Transcriptional regulation
• Transcription control has two key features:
protein-binding regulatory DNA sequences (control
elements) are associated with genes
2. specific proteins that bind to regulatory sequences determine
where transcription will start, and either activate or repress its
transcription
1.
• DNA sequence specifying where RNA polymerase binds and
initiates transcription of a gene is called a promoter.
• Transcription from a particular promoter is controlled by DNAbinding proteins, termed transcription factors.
• DNA control elements in binding transcription factors may be
located very far from the promoter they regulate.
Three different polymerases
• As a result of this arrangement, transcription from a single
promoter may be regulated by binding of multiple
transcription factors to alternative control elements,
permitting complex control of gene expression.
• RNA polymerase I synthesizes rRNA.
• RNA polymerase II synthesizes mRNA.
• RNA polymerase III synthesizes small RNAs and tRNA.
source: Molecular Biology of the Cell. 4th edition. Alberts B
Three parts of promoter
• core promoter
• responsible for actual binding of transcription apparatus
• very close upstream (~35 bp), may also be downstream, see later
• proximal promoter
• contains several regulatory elements
• few hundreds bases upstream of transcriptional start site (TSS)
• distal promoter
• contains enhancers (upstream/downstream), silencers
• They are cis-acting … cis-element regulates gene on the same
DNA molecule. cis-acting sequences are bound by trans-acting
(i.e. acting from a different molecule) regulatory proteins.
• However, the distinctions between proximal elements and
enhancers/silencers is not very clear.
Core promoter
• Eukaryotic RNAPII is not itself capable of transcriptional
initiation in vitro.
• It needs to be supplemented by general (basal)
transcription factors (GTFs).
• Factors are identified as TFIIX, where X is a letter. e.g. TFIIA,
TFIIB, …
• RNAPII + TFs form pre-initiation complex (PIC). Only
then transcription can commence.
• minimal (core) promoter – DNA sequence sufficient for
assembly of pre-initiation complex.
• Transcription initiated by the core promoter is called basal
transcription.
Core promoter elements
• Core promoter is usually located proximal to or overlapping
TSS.
• Contains several sequence motifs. TFs interact with them in
sequence-specific manner.
• Combination of TF-binding motifs vary depending on the
gene.
Core promoter elements
• TATA box … ~ 30 bp upstream, consensus
TATA(A/T)A(A/T)
• Instead of a TATA box, some eukaryotic (TATA-less)
genes contain initiator (Inr) … surrounds TSS, extremely
degenerate consensus sequence YYAN(T/A)YYY (A –
TSS, N – any nucleotide)
• Promoters with both TATA and Inr also exist.
• DPE (downstream promoter element) in TATA-less
• Present in some TATA-, Inr+ promoters, 30 bp downstream.
consensus: RGWCGTG (W = A or T)
Butler JE, Kadonaga JT. The RNA polymerase II core promoter: a key component in the regulation of gene expression. Genes Dev. 2002; 16 (20):2583-92.
Promoter proximal elements
• Found within 100 to 200 bp of the TSS.
• CAAT (CCAAT, CAT) box … consensus GGCCAATCT
• GC box … consensus G/T G/A GGCG G/T G/A G/A C/T.
• It’s GC rich segment.
• Promoter may contain multiple GC boxes, such promoter usually
lack TATA box.
A hypothetic mammalian promoter region
Promoter
Proximal
Element
+1
Enhancer
Enhancer
-10~-50 Kb
-200
TATA
-30
Intron
Exon
Enhancer
+10~50 Kb
CpG island
• Transcription of genes with TATA/Inr promoters begins at
•
•
•
•
•
a well-defined sites.
However, transcription of many protein-coding genes has
been shown to begin at any one of multiple possible sites
over an extended region 20–200 bp long.
As a result, such genes give rise to mRNAs with multiple
alternative 5’ ends.
These are housekeeping genes, they do not contain
TATA, Inr.
Most genes of this type contain a CG-rich stretch of
several hundreds nucleotides – CpG island – within ≈100
base pairs upstream of TSS.
CpG islands are typical for vertebrates (including human).
They are not common in lower eukaryotes.
CpG island
mRNA
~100 bp
CpG island
Multiple
5’-start sites
• Computational analysis is based on CG dinucleotide
imbalance.
• length = 200 bp, C+G content min
0.60
CpG𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑
50%,
CpG𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
=
𝑝(CG)
𝑝 C 𝑝(𝐺)
>
M. Gardiner-Garden, M. Frommer, CpG islands in vertebrate genomes, J. Mol. Biol. 1987, 196, 261-282.
• length = 500 bp, C+G content min 55%,
CpG𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑
CpG𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑
> 0.65
D. Takai, P. A. Jones, Comprehensive analysis of CpG islands in human chromosomes 21 and 22, PNAS 2002, 99, 3740-45.
CpG island
len=51, #C=76, #g=101, #CG=30, 𝑝C =
76
, 𝑝G =
251
101
30
251
251
, 𝑝CG =
,CGcontent = 𝑝C + 𝑝G = 0.71, CpGo/e=0.98
• simple methods based on the frequency of CG perform
remarkably well at correctly predicting regions containing
TSSs
• EMBOSS CpGPlot/CpGReport -http://www.ebi.ac.uk/Tools/emboss/cpgplot/
• CpG Island Searcher - http://cpgislands.usc.edu/ (IE only)
Promoter regions in human genes
Suzuki Y et al., Identification and characterization of the potential promoter regions of 1031 kinds of human genes.
Genome Res. 2001, 11(5):677-84.
TATA
32%
Inr
85%
GC box
97%
CAAT box
64%
located in CpG
48%
TATA+Inr+
28%
TATA+Inr-
4%
TATA-Inr+
56%
TATA-Inr-
12%
Computational analysis of promoters
Introduction
• Regulatory regions typically contain several transcription
factor binding sites strung out over a large region.
• Which particular factor is used not only relies on the
binding site, but also on what factors are available for
binding in a given cell type at a given time.
• Any given gene will typically have its very own pattern of
binding sites for transcriptional activators and repressors
ensuring that the gene is only transcribed in the proper
cell type(s) and at the proper time during the
development.
Introduction
• Transcription factors themselves are also subject to
similar transcriptional regulation, thereby forming
transcriptional cascades and feed-back control loops.
• While this all is very nice and interesting from a biologist’s
point of view, it spells big trouble for promoter
prediction.
Computational difficulties
• There thousands of transcriptional regulators, many of
which have recognition sequences that are not yet
characterized.
• Any given sequence element might be recognized by
different factors in different cell types.
• Core promoter regulatory elements are short and not
completely conserved ⟹ similar elements will be found
purely by chance all over the genome.
What promoter prediction methods
actually predict?
• 1st nucleotide copied at the 5’ end of the corresponding
mRNA – transcription start site TSS
• region around TSS is often referred as the core promoter
• Owing to the strong link between TSS and core promoter,
these terms are often used interchangeably.
• Three distinct types of promoter prediction
1. signal features
2. context features
3. structure features
Evaluating predictions
• sensitivity (Se), recall, TPR
• proportion of correct predictions of TSSs relative to all experimental
TSSs
Se =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
• positive predictive value (PPV), precision
• proportion of correct predictions of TSSs out of all counted positive
predictions
PPV =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
Evaluating predictions
• And how to obtain FP, FN, TP?
• You have a gene sequence for which you know TSS
location. And you make your prediction.
• If it falls within the region [-2000, +2000] relative to
annotated TSS, you have TP.
• Prediction falling into the annotated part of gene within
[+2001, EndOfGene] are FPs.
• If you predict no promoter for this gene sequence, you
have FN.
Signal features
• Recognize “conserved” signals such as TATA box, Inr,
DPE, BRE etc.
• Such motifs are highly variable and degenerate. This
leads to high false positive rate.
• Methods based on core promoter elements and other
specific TFBs (e.g. CAAT box) are far from being
accurate.
• Much more reliable signal is CpG-island feature.
However, only ≈50% of human genes contain CpG
islands.
⇓
CpG and non-CpG promoters are predicted with different
success, prediction of non-CpG is less accurate
Context features
• Extracted from genomic context of promoters
• Represented by a set of n-mers (DNA sequence long n
bases). Their statistics are estimated from training
samples.
• n-mers can cover most biological signals (TFBS: TATAAA,
CCAAT; CpG: GC rich n-mers like CGGCG)
• n-mer representation encodes contextual information of
promoters and has following advantages
• contextual information is independent of any biological signals
• distribution of n-mers may have biological significance (TFBS,
CpG)
• n-mers may reveal details of yet unknown promoter regions
• n-mers reduce FPR while maintaining relatively high TPR
(i.e. Se)
Structure features
• They originate from DNA 3D structures that characterize
proximal promoters.
• DNA actually encodes in its sequence at least two
independent levels of functional information
• DNA sequence – encodes proteins and their regulatory elements.
• Physical and structural properties of DNA itself.
• Example:
• dinucleotide properties – stacking energy, propeller twist
• trinucleotide – bendability, nucleocome position preference
• They have long-range interactions (up to 10 kbp), so they
can exhibit properties not visible in the sequence.
Model for cooperative assembly of an activated transcription-initiation complex.
This figure clearly shows, why are structural features such as flexibility important.
Molecular Cell Biology. 4th edition. Lodish H, Berk A, Zipursky SL, et al. New York: W. H. Freeman; 2000.
Werner T, Fessele S, Maier H, Nelson PJ. Computer modeling of promoter organization as a tool to study transcriptional coregulation. FASEB J. 2003; 17(10):1228-37.
Software
Signal features (two leading CpG predictors)
• FirstEF – different quadratic discriminant functions for CpG
and non-CpG, slightly improves performance by
concentrating to regions around first exon
• Eponine – TATA and G+C rich domain, Relevance Vector
Machine
Context features
• PromoterInspector – IUPAC word groups with wildcards
Structure features
• McPromoter – DNA sequence, bending, DNA twist, ANN
• EP3 – features from1, prediction based just on the threshold
imposed on the structural profile.
1
Florquin K et al., Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nucleic Acids Res. 2005;33(13):4255
Integrated approaches
• combine sequence, context and structural features
• ARTS – SVM, sophisticated kernels, combines n-mers to
structure features (e.g. twist angle, stacking energies)
• does not distinguish CpG related promoter from unrelated, not clear
how it performs on non-CpG
• SCS – sequence (TATA, Inr, DPE, CpG), structure
(flexibility), and context (6-mers) features are used in
different prediction models, their outcomes are combined
by Decission Tree
• CoreBoost – boosting technique with stumps, integrates
core promoter signals, DNA flexibility, n-mer frequency, …
• CoreBoost_HM … adds experimental histone modification data
Boosting, stumps
• Boosting
• Belongs between ensemble methods that produce a very accurate
prediction rule (strong learner) by combining rough and moderately
inaccurate (i.e. just a bit better than random guessing) rules (weak
learners, WL).
• Iteratively learn weak classifiers and add them to a final strong classifier
• When WL is added, it’s weighted based on their accuracy.
• After a WL is added, the data is reweighted: misclassified examples gain
weight and correctly classified examples lose weight.
• Thus, future WLs focus more on the examples that previous WLs
misclassified.
• Stump
• One-level decision tree (i.e. it has one root and
two terminal nodes)
source: wikipedia
Databases
• EPD – Eukaryotic Promoter Database
• http://epd.vital-it.ch
• manually annotated non-redundant collection of eukaryotic POL II
promoters
• DBTSS
• http://dbtss.hgc.jp/
• putative core promoter: e.g. -100 bp … +50 bp, -250 bp … +50 bp,
-200 … +200 bp
Actual state of the promoter prediction
• CpG island promoters are better to predict than non-CpG.
• CpG islands usually correspond to housekeeping genes.
Promoters of housekeeping genes are easier to predict,
but housekeeping genes are not regulated that strongly.
So if biologist wants to up- or down-regulate the
expression and you tell him he has CpG island promoter,
he is usually not happy.
• non-CpG islands correspond to tissue-specific
expression. And are the bottleneck in accurate promoter
prediction.
• Best way how to do it: use transcription data. Alignment of the 5’ of
ESTs or full cDNAs can be indicative of promoter sequence.
However, cDNA does not contain 5’ UTR. This is overcome by new
mRNA cap cloning techniques – DBTSS.
Future directions
• False positives are still the main problem.
• This is because the information about chromatine structure is
missing in prediction models.
• Without knowing which regions of chromatin are opened or
closed (and to what degree),
researchers have to assume
the whole genome is accessible for binding, which is obviously wrong and will lead to
more FP (and FN because
of the extra noise).
• Chromatin remodelling:
enzyme-assisted movement
of nucleosomes on DNA.
source: http://www.nida.nih.gov/NIDA_notes/NNvol21N4/gene.html