genstat - University of Illinois at Urbana

Download Report

Transcript genstat - University of Illinois at Urbana

Gene Prediction:
Statistical Methods
(Lecture for CS498-CXZ Algorithms in Bioinformatics)
Sept. 20, 2005
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Many slides are taken/adapted from
http://www.bioalgorithms.info/slides.htm and Ying Xu’s lecture
Approaches to Gene Prediction
•
•
•
Similarity-based approaches:
– Exploit the fact that many genes are conserved across
species
Given a new genome, identify “known genes” first
– Can be highly reliable
– Only good for finding unknown genes
Statistical approaches
– Exploit statistical characteristics of coding regions and noncoding regions and other knowledge about genes
– Can potentially detect new genes
– May not be reliable
Learn from “known genes” to identify new gene
They can/should be combined
– Currently no principled approaches for doing this
Gene Prediction Analogy
•
•
•
Newspaper written in unknown language
– Certain pages contain encoded message, say 99 letters on
page 7, 30 on page 12 and 63 on page 15.
How do you recognize the message? You could probably
distinguish between the ads and the story (ads contain the
“$” sign often)
Statistics-based approach to Gene Prediction tries to make
similar distinctions between exons and introns.
Statistical Approach: Metaphor in Unknown Language
Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’)
and numerical symbols could you distinguish between a story
and the stock report in a foreign newspaper?
A Few Basic Questions
•
What is exactly a gene for the purpose of prediction?
– In Prokaryotes, gene = mRNAProtein
•
– In Eukaryotes, gene = Exon (coding region)
What does a gene look like?
– Where does it start?
– Where does it end?
– What is the codon usage inside a gene (exon)?
– What is the codon usage outside a gene (intron)?
•
– …
How do we exploit such knowledge to identify genes?
Statistical Characteristics of a Gene
• Gene starts with a start codon
• Gene ends at a stop codon
• Splicing signals
• Codon usage distributions
•…
Gene Structure
Genetic Code and Stop Codons
UAA, UAG and
UGA correspond to
3 Stop codons that
(together with Start
codon ATG)
delineate Open
Reading Frames
Six Frames in a DNA Sequence
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
• stop codons
• start codons
The “Sly Fox” & Effect of Base Deletion
• In the following string
THE SLY FOX AND THE SHY DOG
• Delete 1, 2, and 3 nucleotifes after the first ‘S’:
THE SYF OXA NDT HES HYD OG
THE SFO XAN DTH ESH YDO G
THE SOX AND THE SHY DOG
• Which of the above makes the most sense?
Splicing Signals and Exon Boudnaries
Exons are interspersed with introns and
typically flanked by GT and AG
Splicing mechanism
(http://genes.mit.edu/chris/)
Donor and Acceptor Sites:
GT and AG dinucleotides
•
•
The beginning and end of exons are signaled by donor and
acceptor sites that usually have GT and AC dinucleotides
Detecting these sites is difficult, because GT and AC appear
very often
Donor
Site
GT
exon 1
Acceptor
Site
AC
exon 2
Donor and Acceptor Sites: Motif Logos
Donor: 7.9 bits
Acceptor: 9.4 bits
(Stephens & Schneider, 1996)
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
Codon Usage in Human Genome
Biased codon usage in exons allows us to distinguish exons from introns
Codon Frequencies
•
•
Coding sequences are translated into protein sequences
We found the following – the dimer frequency in protein
sequences is NOT evenly distributed
The average frequency is 5%
Some amino acids prefer to be
next to each other
Some other amino acids prefer
to be not next to each other
shewanella
Dicodon Frequencies
•
•
The biased (uneven) dimer frequencies are the foundation of
many gene finding programs!
Basic idea of gene finding – if a dimer has lower than average
dimer frequency; this means that proteins prefer not to have
such dimers in its sequence; otherwise proteins prefer to have
such dimers
Hence if we see a dicodon encoding this dimer, we
may want to bet against this dicodon being in a
coding region!
General Steps for Gene Prediction
• Identify candidate exons in Open Reading
Frames (ORFs)
– Determine ORFs: An ORF starts with a start codon
and ends at a stop codon
– Determine sites for receptors/donors
• Evaluate the potential of a candidate exon for
coding (Exploit codon usages)
Step 1: Identify ORFs
Long vs.Short ORFs
• Long open reading frames may be a gene
•
– At random, we should expect one stop codon
every (64/3) ~= 21 codons
– However, genes are usually much longer than this
A basic approach is to scan for ORFs whose length
exceeds certain threshold
– This is naïve because some genes (e.g. some
neural and immune system genes) are relatively
short
Prediction of Translation Starts
•
Translation start: ATG
ATG ……
•
How to predict a translation start
GCCATGGCGA …..
ACGATGCTGT ….
GACATGGTAC …
AGGATGGGCT …
GCGATGTGGC …
•
Collect a set of experimentally validated translation starts with flanking
regions and align them up ….
Prediction of Translation Starts
•
Certain nucleotides prefer to be in certain position around start “ATG”
and other nucleotides prefer not to be there
ATG
A
C
G
T
-4
•
•
-3
-2
-1
+3
+4
+5
+6
The “biased” nucleotide distribution is information! It is a basis for
translation start prediction
Question: which one is more probable to be a translation start?
CACC ATG GC
TCGA ATG TT
Prediction of Translation Starts
•
•
Mathematical model: Fi (X): frequency of X (A, C, G, T) in position i
Score a string by i log (Fi (X)/0.25)
CACC ATG GC
TCGA ATG TT
log (58/25) + log (49/25) + log (40/25) +
log (50/25) + log (43/25) + log (39/25) =
log (6/25) + log (6/25) + log (15/25) +
log (7/25) + log (13/25) + log (14/25) =
0.37 + 0.29 + 0.20 + 0.30 + 0.24 + 0.20
-(0.62 + 0.62 + 0.22 + 0.55 + 0.28 +
0.25)
= 1.60
= -2.54
The model captures our intuition!
A
C
G
T
Prediction of Translation Starts
•
•
Build a mathematical model, based on collected translation start
sequence
For each candidate translation start sequence, apply the model and get a
score
ATG ……
•
If the score if larger than zero, predict it is a “translation start”; the
higher score, the higher the probability the prediction is true
Step 2: Identify Exon
Boundaries
Prediction of Splice Junction Sites
•
•
•
A start exon starts with a translation start and ends with a donor site
An internal exon starts with an acceptor site and ends with a donor site
A terminal exon starts with an acceptor site and ends with a stop codon
exon
{ translation start,
acceptor site }
{ translation stop,
donor site }
Accurate prediction of exons/genes requires
accurate prediction of splice junctions
Prediction of Splice Junction Sites
•
Splice junctions:
– donor site: coding region | GT
– acceptor: AG | coding region
•
•
Like translation starts, the flanks of splice junctions (acceptors and
donors) show “biased” distributions of nucleotides in certain positions
These biased distributions of nucleotides are the basis for prediction of
splice junctions
Prediction of Acceptor Sites
•
Nucleotide distribution in the flanks of acceptors
Multiple positions have high “information content”
Information content:
 log (F (X)/0.25)
If every nucleotide has 0.25 frequency in a position,
then the position’s information content is ZERO.
Use “information content” as a criterion
for determining the length of flanks
Prediction of Acceptor Sites
•
•
•
Mathematical model: Fi (X): frequency of X (A, C, G, T) in position i
Score a segment as a candidate acceptor site by i log (Fi (X)/0.25)
For each candidate acceptor sequence, apply the model and get a score
AG
•
If the score if larger than zero, predict it is an “acceptor”; the higher
score, the higher the probability the prediction is true
Prediction of Donor Sites
•
Nucleotide distribution in the flanks of donors
•
•
Mathematical model: Fi (X): frequency of X (A, C, G, T) in position i
Score a segment as a possible donor site by i log (Fi (X)/0.25)
Prediction of Donor Sites
•
For each candidate donor sequence, apply the model and get a
score
GT
•
If the score is larger than zero, predict it is a “donor”; the higher
score, the higher the probability the prediction is true
Prediction of Donors/Acceptors
•
Position specific weight matrix model
•
Build a “position specific weight matrix model”
•
–
collect known {donor, acceptor} sequences and align them so that the GT or AG are
aligned
–
Calculate the percentage of each type of nucleotide at each position
There are more sophisticated models for capturing higher order relationships
between positions
Prediction of Exons
•
For each orf, find all donor and acceptor candidates by finding GT and
YAG motifs
CAG
•
•
•
GT
CAG GT
GT
Score each donor and acceptor candidate using our position-specific
weight matrix models
Find all pairs of (acceptor, donor) above some thresholds
Score the coding potential of the segment [donor, acceptor], using the
hexmer model
Step 3: Classify Candidate
Exons
Testing Exons: Codon Usage
• Create a 64-element hash table and count the
frequencies of codons in a candidate exon
• Amino acids typically have more than one
codon, but in nature certain codons are more
in use
• Uneven use of the codons may characterize
a real gene
Codon Usage and Likelihood Ratio
•
•
•
•
•
An ORF is more “believable” than another if it has more “likely”
codons
Do sliding window calculations to find ORFs that have the
“likely” codon usage
Allows for higher precision in identifying true ORFs; much
better than merely testing for length.
However, average vertebrate exon length is 130 nucleotides,
which is often too small to produce reliable peaks in the
likelihood ratio
Further improvement: in-frame hexamer count (frequencies of
pairs of consecutive codons)
Codon Usage in Mouse Genome
AA codon
Ser TCG
Ser TCA
Ser TCT
Ser TCC
Ser AGT
Ser AGC
/1000
4.31
11.44
15.70
17.92
12.25
19.54
frac
0.05
0.14
0.19
0.22
0.15
0.24
Pro
Pro
Pro
Pro
6.33
17.10
18.31
18.42
0.11
0.28
0.30
0.31
CCG
CCA
CCT
CCC
AA codon
Leu CTG
Leu CTA
Leu CTT
Leu CTC
/1000
39.95
7.89
12.97
20.04
frac
0.40
0.08
0.13
0.20
Ala
Ala
Ala
Ala
GCG
GCA
GCT
GCC
6.72
15.80
20.12
26.51
0.10
0.23
0.29
0.38
Gln
Gln
CAG
CAA
34.18
11.51
0.75
0.25
Exon Prediction Method 1:
TestCode
TestCode
• Statistical test described by James Fickett in
1982: tendency for nucleotides in coding regions
to be repeated with periodicity of 3
– Judges randomness instead of codon
frequency
– Finds “putative” coding regions, not introns,
exons, or splice sites
• TestCode finds ORFs/Exons based on
compositional bias with a periodicity of three
TestCode Statistics
•
Define a window size no less than 200 bp, slide the
window the sequence down 3 bases. In each window:
– Calculate for each base {A, T, G, C}
• max (n3k+1, n3k+2, n3k) / min ( n3k+1, n3k+2, n3k)
• Use these values to obtain a probability from a
lookup table (which was a previously defined and
determined experimentally with known coding and
noncoding sequences
TestCode Statistics (cont’d)
• Probabilities can be classified as indicative
of " coding” or “noncoding” regions, or “no
opinion” when it is unclear what level of
randomization tolerance a sequence carries
• The resulting sequence of probabilities can
be plotted
TestCode Sample Output
Coding
No opinion
Non-coding
Exon Prediction Method 2:
Likelihood Ratio/Suprevised
Learning
Prediction of Coding Regions
•
Build a coding-region predictor
– Collect coding and non-coding sequences from GenBank (NCBI)
– Calculate hexmer frequencies in coding and noncoding sequences (hexmer =
6-mer, or dimer of amino acids)
•
Application – coding-region prediction
– consider both strands: forward and reverse
– for each strand, identify all the orfs
– for prokaryotic genome, for each possible translation start ATG, evaluate the
coding of [ATG, STOP]
– for eukaryotic genome, for the segment defined by each pair of possible
start/acceptor & donor/stop, evaluate the coding potential
Prediction of Exons
•
•
For each segment [acceptor, donor], we get three scores (coding
potential, donor score, acceptor score)
Various possibilities
– all three scores are high – probably true exon
– all three scores are low – probably not a real exon
– all in the middle -- ??
– some scores are high and some are low -- ??
•
What are the rules for exon prediction?
Prediction of Exons
•
Learning to classify exons from nonexons
– Collect a set of exons and non-exons
– Score them using our scoring schemes
– Plot them as follows
– “draw” a separating line between exons and non-exons
•
Making a prediction based on which side of the separating line a new
point falls
coding:
noncoding:
Prediction of Exons
•
•
A “classifier” can be trained to separate exons from non-exons, based
on the three scores
Closer to reality – other factors could also help to distinguish exons
from non-exons
exon length
distribution
150 bp
coding density is different in regions
with different G+C contents
50% G+C
A practical gene finding software may use many
features to distinguish exons from non-exons
Prediction of Exons
Each box represents
a predicted exon
A true exon typically has more than one predicted
candidates, overlapping with each other
Gene Prediction in a New Genome
•
Dicodon (hexmer) frequencies are different from genome to genome –
gene finder for one genome cannot be directly applied to another
genome
bovine
shewanella
Popular Gene Prediction Algorithms
• GENSCAN: uses Hidden Markov Models
(HMMs)
• TWINSCAN
– Uses both HMM and similarity (e.g.,
between human and mouse genomes)
HMM will be covered later in the course
What You Should Know
• What is an open reading frame (ORF)
• What is splicing signal
• How the likelihood ratio method works