Promoter prediction (really)

Download Report

Transcript Promoter prediction (really)

10/26/05
Promoter Prediction
(really!)
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
1
Announcements
• BCB Link for Seminar Schedules (updated)
http://www.bcb.iastate.edu/seminars/index.html
Seminar (Fri Oct 28)
12:10 PM BCB Faculty Seminar in E164 Lagomarcino
Assembly and Alignment of Genomic DNA Sequence
Xiaoqiu Huang, ComS
http://www.bcb.iastate.edu/courses/BCB691-F2005.html#Oct%2028
Mark your calendars:
1:10 PM Nov 14 Baker Seminar in Howe Hall Auditorium
"Discovering transcription factor binding sites"
Douglas Brutlag,Dept of Biochemistry & Medicine
Stanford University School of Medicine
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
2
Announcements
BCB 544 Projects - Important Dates:
Nov 2 Wed noon - Project proposals due to David/Drena
Nov 4 Fri 10A
- Approvals/responses to students
Dec 2 Fri noon
- Written project reports due
Dec 5,7,8,9 class/lab
- Oral Presentations (20')
(Dec 15 Thurs = Final Exam)
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
3
Announcements
Lab 9 - due Wed noon (today)
Exam 2 - this Friday
Posted Online:
Exam 2 Study Guide
544 Reading Assignment (2 papers)
Lab Keys (today)
Thurs No Lab - Extra Office Hrs instead:
David 1-3 PM in 209 Atanasoff
Drena 1-3 PM in 106 MBB
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
4
Promoter Prediction
RNA Structure/Function Prediction
Mon
 Quite a few more words re:
Gene prediction
Wed
Promoter prediction
next Mon:
RNA structure & function
RNA structure prediction
2' & 3' structure prediction
miRNA & target prediction
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
5
Optional - but very helpful reading:
(that's a hint!)
1)
Zhang MQ (2002) Computational prediction of eukaryotic proteincoding genes. Nat Rev Genet 3:698-709
http://proxy.lib.iastate.edu:2103/nrg/journal/v3/n9/full/nrg890_fs.html
2)
Wasserman WW & Sandelin A (2004) Applied bioinformatics for
identification of regulatory elements. Nat Rev Genet 5:276-287
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
Check this out: http://www.phylofoot.org/NRG_testcases/
03489059922
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
6
Reading Assignment (for Mon)
Mount Bioinformatics
• Chp 8 Prediction of RNA Secondary Structure
• pp. 327-355
• Ck Errata: http://www.bioinformaticsonline.org/help/errata2.html
Cates (Online) RNA Secondary Structure Prediction Module
• http://cnx.rice.edu/content/m11065/latest/
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
7
Review last lecture:
Flowchart for Gene Prediction
Performance Assessment Measures
Correction re: slide 10/24 # 27
Promoters
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
8
Gene prediction flowchart
Fig 5.15
Baxevanis &
Ouellette 2005
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
9
Evaluation of Splice Site Prediction
What do measures really mean?
Sp =
Fig 5.11
Baxevanis &
Ouellette 2005
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
10
Correction re: last lecture:
GeneSeqer Performance Graphs
Brendel et al (2004) Bioinformatics 20: 1157
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
11
Performance?

1.00
Human
GT site
0.80
Sn
0.60
-10 -8
-6 -4
0.20
0.20
4
6
Sn
0.60
0.40
2
8
10 12 14 16 18 20
-10 -8
-6 -4
0.00
-2 0
2
4
6
8

1.00
0.80
Sn
0.60
-10 -8
-6 -4
A. thaliana
GT site
0.80
0.20
0.20
4
6
8
10 12 14 16 18 20
Sn
0.60
0.40
2
10 12 14 16 18 20

1.00
0.40
0.00
-2 0
Human
AG site
0.80
0.40
0.00
-2 0

1.00
-10 -8
-6 -4
0.00
-2 0
2
4
6
8
A. thaliana
AG site
10 12 14 16 18 20
 Note: these are not ROC curves (plots of (1-Sn) vs Sp)
• But plots such as these (& ROCs) much better than
using "single number" to compare different methods
• Both types of plots illustrate trade-off: Sn vs Sp
Brendel 2005 10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
12
Fig 2 - Brendel et al (2004) Bioinformatics 20: 1157
Q
ui ck Time™and a
TIFF(LZW
)dec om pres sor
are needed to s ee this pi ctur
e.
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
13
Bayes Factor as Decision Criterion
H0: H=T:
BF 
p{T | S}
p{T }
(1  p{T | S}) (1  p{T })
2-class model: BF  p{S | T} p{S | F}

7 class model: BF 
Brendel 2005 10/26/05
x 1, 2, 0

p{S | Tx } p{Tx }
x 1, 2, 0
p{Tx }

x 1, 2, 0,i

p{S | Fx } p{Fx }
x 1, 2, 0,i
p{Fx }
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
14
Evaluation of Splice Site Prediction
Actual
True False
Predicted
True
TP
FP
PP=TP+FP
False
FN
TN
PN=FN+TN
AP=TP+FN AN=FP+TN
• Misclassification rates:
FN

AP
 TP
/ AP
• Sensitivity: S n SnTP
/ AP
 11 
FP

AN
= Coverage
ANAN AN 1 11
 TP
S
/
PP

TP

/
1
PP
• Specificity: S p SpTP
/ PPp  1    1   
PPPP PP1 11
 r
r
• Normalized specificity:
Brendel 2005 10/26/05
AN
r
AP
1 

1   
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
15
Careful: different definitions for "Specificity"
Actual
True False
Predicted
Brendel definitions
True
TP
FP
PP=TP+FP
False
FN
TN
PN=FN+TN
• Sensitivity: S n  TP / AP  1
• Specificity: S p  TP / PP  1
AP=TP+FN AN=FP+TN
cf. Guig�ó definitions
Sn: Sensitivity = TP/(TP+FN)
Sp: Specificity = TN/(TN+FP) = SpAC: Approximate Coefficient = 0.5 x ((TP/(TP+FN)) + (TP/(TP+FP)) +
(TN/(TN+FP)) + (TN/(TN+FN))) - 1
Other measures? Predictive Values, Correlation Coefficient
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
16
Best measures for comparing different methods?
• ROC curves
(Receiver Operating Characteristic?!!)
http://www.anaesthetist.com/mnm/stats/roc/
"The Magnificent ROC" - has fun applets & quotes:
"There is no statistical test, however intuitive and simple,
which will not be abused by medical researchers"
• Correlation Coefficient
(Matthews correlation coefficient (MCC)
Do not memorize this!
MCC =
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
1 for a perfect prediction
0 for a completely random assignment
-1 for a "perfectly incorrect" prediction
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
17
Promoters
What signals are there?
Simple ones in prokaryotes
Brown Fig 9.17
10/26/05
BIOS Scientific Publishers
Ltd, 1999
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
18
Prokaryotic promoters
• RNA polymerase complex recognizes promoter
sequences located very close to & on 5’ side
(“upstream”) of initiation site
• RNA polymerase complex binds directly to these.
with no requirement for “transcription factors”
• Prokaryotic promoter sequences are highly conserved
• -10 region
• -35 region
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
19
What signals are there?
Complex ones in eukaryotes!
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Fig 9.13
Mount 2004
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
20
Simpler view of complex promoters in eukaryotes:
Fig 5.12
Baxevanis &
Ouellette 2005
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
21
Eukaryotic genes are transcribed by
3 different RNA polymerases
Recognize different types of promoters & enhancers:
Brown Fig 9.18
10/26/05
BIOS Scientific Publishers
Ltd, 1999
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
22
Eukaryotic promoters & enhancers
• Promoters located “relatively” close to initiation site
(but can be located within gene, rather than upstream!)
• Enhancers also required for regulated transcription
(these control expression in specific cell types, developmental
stages, in response to environment)
• RNA polymerase complexes do not specifically
recognize promoter sequences directly
• Transcription factors bind first and serve as
“landmarks” for recognition by RNA polymerase
complexes
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
23
Eukaryotic transcription factors
• Transcription factors (TFs) are DNA binding proteins
that also interact with RNA polymerase complex to
activate or repress transcription
• TFs contain characteristic “DNA binding motifs”
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039
• TFs recognize specific short DNA sequence motifs
“transcription factor binding sites”
• Several databases for these, e.g. TRANSFAC
http://www.generegulation.com/cgibin/pub/databases/transfac
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
24
Zinc finger-containing transcription factors
• Common in eukaryotic proteins
• Estimated 1% of mammalian
genes encode zinc-finger
proteins
• In C. elegans, there are 500!
• Can be used as highly specific
DNA binding modules
Brown Fig 9.12
• Potentially valuable tools for
directed genome modification
(esp. in plants) & human gene
therapy
BIOS Scientific Publishers Ltd, 1999
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
25
New Today: Promoter Prediction
Predicting regulatory regions (focus on promoters)
 Brief review promoters & enhancers
 Predicting promoters: eukaryotes vs prokaryotes
Next week:
 RNA structure & function
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
26
Predicting Promoters
• Overview of strategies
 What sequence signals can be used?
• What other types of information can be used?
• Algorithms
• Promoter prediction software
• 3 major types
• many, many programs!
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
27
Promoter prediction: Eukaryotes vs prokaryotes
Promoter prediction is easier in microbial genomes
Why?
Highly conserved
Simpler gene structures
More sequenced genomes!
(for comparative approaches)
Methods? Previously, again mostly HMM-based
Now: similarity-based. comparative methods
because so many genomes available
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
28
Predicting promoters: Steps & Strategies
Closely related to gene prediction!
• Obtain genomic sequence
• Use sequence-similarity based comparison
(BLAST, MSA) to find related genes

•
•
•
•
But: "regulatory" regions are much less wellconserved than coding regions
Locate ORFs
Identify TSS (if possible!)
Use promoter prediction programs
Analyze motifs, etc. in sequence (TRANSFAC)
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
29
Predicting promoters: Steps & Strategies
Identify TSS --if possible?
• One of biggest problems is determining exact TSS!
Not very many full-length cDNAs!
• Good starting point? (human & vertebrate genes)
Use FirstEF
found within UCSC Genome Browser
or submit to FirstEF web server
Fig 5.10
Baxevanis &
Ouellette 2005
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
30
Automated promoter prediction strategies
1) Pattern-driven algorithms
2) Sequence-driven algorithms
3) Combined "evidence-based"
BEST RESULTS? Combined, sequential
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
31
Promoter Prediction: Pattern-driven algorithms
•
•
Success depends on availability of collections of
annotated binding sites (TRANSFAC & PROMO)
Tend to produce huge numbers of FPs
• Why?
•
•
•
•
•
Binding sites (BS) for specific TFs often variable
Binding sites are short (typically 5-15 bp)
Interactions between TFs (& other proteins) influence
affinity & specificity of TF binding
One binding site often recognized by multiple BFs
Biology is complex: promoters often specific to
organism/cell/stage/environmental condition
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
32
Promoter Prediction: Pattern-driven algorithms
Solutions to problem of too many FP predictions?
•
•
•
Take sequence context/biology into account
• Eukaryotes: clusters of TFBSs are common
• Prokaryotes: knowledge of  factors helps
Probability of "real" binding site increases if
annotated transcription start site (TSS) nearby
• But: What about enhancers? (no TSS nearby!)
& Only a small fraction of TSSs have been
experimentally mapped
Do the wet lab experiments!
• But: Promoter-bashing is tedious
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
33
Promoter Prediction: Sequence-driven algorithms
•
Assumption: common functionality can be deduced from
sequence conservation
• Alignments of co-regulated genes should highlight
elements involved in regulation
Careful: How determine co-regulation?
• Orthologous genes from difference species
• Genes experimentally determined to be
co-regulated (using microarrays??)
• Comparative promoter prediction:
"Phylogenetic footprinting" - more later….
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
34
Promoter Prediction: Sequence-driven algorithms
Problems:
•
•
Need sets of co-regulated genes
For comparative (phylogenetic) methods
•
•
•
•
•
•
Must choose appropriate species
Different genomes evolve at different rates
Classical alignment methods have trouble with
translocations, inversions in order of functional elements
If background conservation of entire region is highly
conserved, comparison is useless
Not enough data (Prokaryotes >>> Eukaryotes)
Biology is complex: many (most?) regulatory elements
are not conserved across species!
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
35
Examples of promoter
prediction/characterization software
Lab: used MATCH, MatInspector
TRANSFAC
MEME & MAST
BLAST, etc.
Others?
FIRST EF
Dragon Promoter Finder (these are links in PPTs)
also see Dragon Genome Explorer (has specialized
promoter software for GC-rich DNA, finding CpG
islands, etc)
JASPAR
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
36
TRANSFAC matrix entry: for TATA box
Fields:
• Accession & ID
•Brief description
•TFs associated
with this entry
•Weight matrix
•Number of sites
used to build
(How many here?)
•Other info
Fig 5.13
Baxevanis &
Ouellette 200510/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
37
Global alignment of human & mouse obese
gene promoters (200 bp upstream from TSS)
Fig 5.14
Baxevanis &
Ouellette 2005
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
38
Check out optional review &
try associated tutorial:
Wasserman WW & Sandelin A (2004) Applied bioinformatics for
identification of regulatory elements. Nat Rev Genet 5:276-287
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
Check this out: http://www.phylofoot.org/NRG_testcases/
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
39
Annotated lists of promoter databases &
promoter prediction software
•
URLs from Mount Chp 9, available online
Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html
•
Table in Wasserman & Sandelin Nat Rev Genet article
•
URLs for Baxevanis & Ouellette, Chp 5:
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm
http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links
More lists:
•
•
•
http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promo
ter
http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104
http://www3.oup.co.uk/nar/database/subcat/1/4/
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
40
Reading Assignment (for Mon)
Mount Bioinformatics
• Chp 8 Prediction of RNA Secondary Structure
• pp. 327-355
• Ck Errata: http://www.bioinformaticsonline.org/help/errata2.html
Cates (Online) RNA Secondary Structure Prediction Module
• http://cnx.rice.edu/content/m11065/latest/
10/26/05
D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)
41