CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics
Lecture 11 Promoter Predication
HAPPY CHINESE NEW YEAR
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Outline
Introduction to DNA Motif
 Motif Representations (Recap)
 Motif database search
 Algorithms for motif discovery

7/16/2015
2
Search Space
Motif width = W
N
Length = L
Size of search space = (L – W + 1)N
L=100, W=15, N=10  size  1019
Worked Example
score 
W

k 1
cki =
 6
ln 
cki!

 N  3! i a ,c , g ,t
 p 
cki
i
i  a , c , g ,t
1
2
3
4
a
0
2
0
3
c
4
0
2
1
g
0
1
2
0
t
0
1
0
0



Score = 1.99 - 0.50 + 0.20 + 0.60 = 2.29
N = 4
pi = ¼
  
  pi
cki
1
N

 1 256
4
i

6
N  3!
 p  
cki
i
i
32
105
Gibbs Sampling Search
1
Suppose the search space is a 2D
rectangle. (Typically, more than 2
dimensions!)
Start at a random point X.
Randomly pick a dimension.
2
X
Look at all points along this dimension.
Move to one of them randomly, proportional
to its score π.
Repeat.
Gibbs Sampling for Motif Search
Choose a random starting state.
Randomly pick a sequence.
Look at all motif positions in this
sequence.
Pick one randomly proportional
to exp(score).
Repeat.
Does it Work in Practice?
Only successful cases get published!
 Seems more successful in microbes (bacteria & yeast)
than in animals.
 The search algorithm seems to work quite well, the
problem is the scoring scheme: real motifs often don’t
have higher scores than you would find in random
sequences by chance. I.e. the needle looks like hay.
 Attempts to deal with this:

◦ Assume the motif is an inverted palindrome (they often are).
◦ Only analyze sequence regions that are conserved in another
species (e.g. human vs. mouse).


As usual, repetitive sequences cause problems.
More powerful algorithm: MEME
1.
Go to our MEME server:
http://molgen.biol.rug.nl/meme/website/meme.ht
ml
1.
Fill in your emailadres, description of the sequences
2.
Open the fasta formatted file you just saved with
Genome2d (click “Browse”)
3.
Select the number of motifs, number of sites and the
optimum width of the motif
4.
Click “Search given strand only”
5.
Click “Start search”
Something like this will appear in your
email. The results are quite self
explanatory.
Promoter Prediction
What are promoters?
 Three strategies for promoter prediction

◦ Signal based
◦ Comparative genomics/phylogenetic
footprinting
◦ Expression profile base de-novo motif
discovery algorthms
What is a Promoter?
Region of gene that binds RNA polymerase and transcription
factors to initiate transcription
Promoters:What signals are there?
Simple ones in prokaryotes
12
Prokaryotic promoters
RNA polymerase complex recognizes
promoter sequences located very close to & on
5’ side (“upstream”) of initiation site
 RNA polymerase complex binds directly
to these. with no requirement for
“transcription factors”
 Prokaryotic promoter sequences are highly
conserved
 -10 region
 -35 region

13
What signals are there?
Complex ones in eukaryotes
14
Eukaryotic genes are transcribed by
3 different RNA polymerases
Recognize different types of promoters & enhancers:
15
Eukaryotic promoters & enhancers

Promoters located “relatively” close to
initiation site
(but can be located within gene, rather than upstream!)

Enhancers also required for regulated
transcription
(these control expression in specific cell types, developmental
stages, in response to environment)
RNA polymerase complexes do not
specifically recognize promoter sequences
directly
 Transcription factors bind first and serve as
“landmarks” for recognition by RNA
polymerase complexes

16
Eukaryotic transcription factors
Transcription factors (TFs) are DNA binding
proteins that also interact with RNA
polymerase complex to activate or repress
transcription
 TFs contain characteristic “DNA binding
motifs”

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=genomes.table.7039

TFs recognize specific short DNA sequence
motifs “transcription factor binding sites”
◦ Several databases for these, e.g. TRANSFAC
http://www.generegulation.com/cgibin/pub/databases/transfac
17
Zinc finger-containing transcription
• Common in eukaryotic proteins
factors
• Estimated 1% of mammalian
encode zinc-finger proteins
genes
• In C. elegans, there are 500!
• Can be used as highly specific DNA
binding modules
• Potentially valuable tools for directed
genome modification (esp. in plants) &
human gene therapy
18
Predicting Promoters
Overview of strategies
◦  What sequence signals can be used?
• What other types of information can be
used?
• Algorithms
• Promoter prediction software
• 3 major types
• many, many programs
•
19
Promoter prediction:
Eukaryotes vs prokaryotes
Promoter prediction is easier in microbial genomes
Why?
Highly conserved
Simpler gene structures
More sequenced genomes!
(for comparative approaches)
Methods? Previously, again mostly HMM-based
Now:
• similarity-based.
• comparative methods (because so many
genomes available)
•
De novo motif discovery
20
Predicting promoters: Steps & Strategies
Closely related to gene prediction
• Obtain genomic sequence
• Use sequence-similarity based comparison
 (BLAST, MSA) to find related genes

 But: "regulatory" regions are much less well-conserved than
coding regions
•
Locate ORFs
Identify TSS (if possible!) FirstEF
Use promoter prediction programs
•
Analyze motifs, etc. in sequence (TRANSFAC)
•
•
21
Automated promoter prediction
strategies
1) Pattern-driven algorithms
2) Sequence-similarity based algorithms
3) Combined "evidence-based"
BEST RESULTS? Combined, sequential
22
1: Promoter Prediction: Pattern-driven algorithms
•
•
•
Success depends on availability of collections of
annotated binding sites (TRANSFAC & PROMO)
Tend to produce huge numbers of FPs
Why?
• Binding sites (BS) for specific TFs often variable
• Binding sites are short (typically 5-15 bp)
• Interactions between TFs (& other proteins)
influence affinity & specificity of TF binding
• One binding site often recognized by multiple BFs
• Biology is complex: promoters often specific to
organism/cell/stage/environmental condition
23
Solutions to problem of too many FP
predictions?
Take sequence context/biology into account
• Eukaryotes: clusters of TFBSs are common
• Prokaryotes: knowledge of  factors helps
• Probability of "real" binding site increases if
annotated transcription start site (TSS) nearby
• But: What about enhancers? (no TSS nearby!)
& Only a small fraction of TSSs have been
experimentally mapped
• CpG islands before promoter around TSS
• TATA Box, CCAAT box
• Content Information: hexamer frequency
24
Why we cannot rely on consensus
sequence?
Inr (Initiator) consensus sequence will appear
once every 512bp in random sequences
 For TATA box, one for every 120bp
 Short-sequence patterns can appear by chance
with high likelihood (false postives)

2: Promoter Prediction: Phylogenetic Footprinting
•
Assumption: common functionality can be
deduced from sequence conservation
• Comparative promoter prediction:
"Phylogenetic footprinting
rVista, ConSite, PromH, FootPrinter
•
•
For comparative (phylogenetic) methods
• Must choose appropriate species
• Different genomes evolve at different rates
• Classical alignment methods have trouble with
translocations, inversions in order of functional
elements
• If background conservation of entire region is
highly conserved, comparison is useless
• Not enough data (Prokaryotes >>> Eukaryotes)
Biology is complex: many (most?) regulatory elements
are not conserved across species!
26
3: Promoter Prediction: Co-expression
based algorithms
Problems:
•
•
•
Need sets of co-regulated genes
Genes experimentally determined to be co-regulated
(using microarrays??)
Careful: How determine co-regulation?
Alignments of co-regulated genes should highlight
elements involved in regulation
Algorithms:
MEME
AlignACE, PhyloCon
27
Examples of promoter
prediction/characterization software
MATCH, MatInspector
TRANSFAC
MEME & MAST
BLAST, etc.
Others?
FIRST EF
Dragon Promoter Finder (these are links in PPTs)
also see Dragon Genome Explorer (has specialized promoter
software for GC-rich DNA, finding CpG
islands, etc)
JASPAR
28
TRANSFAC matrix entry: for TATA box
Fields:
• Accession & ID
•Brief description
•TFs associated with
this entry
•Weight matrix
•Number of sites used
to build (How many
here?)
•Other info
29
Global alignment of human & mouse obese gene
promoters (200 bp upstream from TSS)
30
Check out optional review &
try associated tutorial:
Wasserman WW & Sandelin A (2004) Applied bioinformatics for
identification of regulatory elements. Nat Rev Genet 5:276-287
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html
Check this out:
http://www.phylofoot.org/NRG_testcases/
D Dobbs ISU - BCB 444/544X:
Promoter Prediction (really!)
31
Annotated lists of promoter databases & promoter
prediction software
•
•
•
URLs from Mount Chp 9, available online
Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html
Table in Wasserman & Sandelin Nat Rev Genet article
http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm
URLs for Baxevanis & Ouellette, Chp 5:
http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links
More lists:
•
•
•
http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=prom
oter
http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104
http://www3.oup.co.uk/nar/database/subcat/1/4/
32
Summary
Promoter & gene regulation
 3 types of methods for promoter prediction
 Many programs have sensitivity and specificity less than
0.5
 Integrative algorithms are more promising

Acknowledgement

Zhiping Weng (Boston Uni.)

CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

Directory