Eukaryotic Gene Finding

Download Report

Transcript Eukaryotic Gene Finding

Eukaryotic Gene Finding
Adapted in part from
http://online.itp.ucsb.edu/online/infobio01/burge/
Prokaryotic vs. Eukaryotic
Genes
Prokaryotes
small genomes
high gene density
no introns (or splicing)
no RNA processing
similar promoters
overlapping genes
Eukaryotes
large genomes
low gene density
introns (splicing)
RNA processing
heterogeneous
promoters
polyadenylation
Pre-mRNA Splicing
exon definition
intron definition
SR proteins
...
5 ’ splice signal
exonic repressor
branch signal
intronic enhancers
3 ’ splice signal
5 ’ splice signal
polyY
exonic enhancers
intronic repressor
(assembly of
spliceosome, catalysis)
...
Some Statistics
• On average, a vertebrate gene is about 30KB
long
• Coding region takes about 1KB
• Exon sizes can vary from double digit
numbers to kilobases
• An average 5’ UTR is about 750 bp
• An average 3’UTR is about 450 bp but both
can be much longer.
Human Splice Signal Motifs
5' splice signal
3' splice signal
Semi-Markov HMM Model
GHMM
•
•
•
•
A finite Set Q of states
Initial state distribution Π
Transition probabilities Ti,j for
Length distribution f of the states (fq is the
length distribution of state q)
• Probability model for each state
GHMM – contin.
• A parse Ф of a sequence S of length L is an ordered
sequence of states (q1, . . . , qt) with an associated duration
di to each state
The most probable pass Фopt can be computed as in Veterbi algorithm
Genscan HSMM
GenScan States
• N - intergenic region
• P - promoter
• F - 5’ untranslated region
• Esngl – single exon (intronless) (translation
start -> stop codon)
• Einit – initial exon (translation start ->
donor splice site)
• Ek – phase k internal exon (acceptor
splice site -> donor splice site)
• Eterm – terminal exon (acceptor splice site
-> stop codon)
• Ik – phase k intron: 0 – between codons; 1
– after the first base of a codon; 2 – after
the second base of a codon
GenScan features
• Model both strands at once
• Each state may output a string of symbols (according
to some probability distribution).
• Explicit intron/exon length modeling
• Advanced splice site modeling
• Complete intron/exon annotation for sequence
• Able to predict multiple genes and partial/whole
genes
• Parameters learned from annotated genes
• Separate parameter training for different CpG content
groups (< 43%, 43-51%, 51-57%,>57% CG content)
Various parameters in GENSCAN
GenScan Signal Modeling
• PSSM:
P(S) = P1(S1)•P2(S2) •…•Pn(Sn)
– PolyA signal
– Translation initiation/termination signal
– Promoters
• WAM: P(S) = P1(S1) •P2(S2|S1)•…•Pn(Sn|Sn-1)
– 5’ and 3’ splice sites
GENSCAN Performance
• > 80% correct exon predictions, and > 90%
correct coding/non coding predictions by
bp.
• BUT - the ability to predict the whole gene
correctly is much lower
HMM-based Gene Finding

GENSCAN (Burge 1997)

FGENESH (Solovyev 1997)

HMMgene (Krogh 1997)

GENIE (Kulp 1996)

GENMARK (Borodovsky & McIninch 1993)

VEIL (Henderson, Salzberg, & Fasman 1997)
Using Sequence Similarity for Gene Finding
1.
Compare genomic sequence with expressed sequence tags (ESTs)
(e.g. by BLASTN), to identify regions corresponding to processed
mRNA
2. Compare genomic sequence to Protein DB (e.g. by BLASTX), to
identify probably coding regions
3. “Spliced Alignment” of genomic sequence of a complete gene with
a homologous protein sequence (e.g. by PROCRUSTES) may
enable exon/intron reconstruction
4. Compare predicted peptides (e.g. by GENSCAN) with protein DB
to assign confidence to predictions and functional annotations
5. Compare Genomic sequence with homologous from close
organisms/species (e.g. by BLAST, CLASTW), to identify
conserved regions which might correspond to coding regions and
DNA signals
GenomeScan
• Idea: We can enhance our gene prediction by using
external information: DNA regions with homology to
known proteins are more likely to be coding exons.
• Combine probabilistic ‘extrinsic’ information (BLAST
hits) with a probabilistic model of gene
structure/composition (GenScan)
• Focus on ‘typical case’ when homologous but not identical
proteins are available.
GeneWise [Birney, Amitai]
• Motivation: Use good DB of protein
world (PFAM) to help us annotate
genomic DNA
• GeneWise algorithm aligns a profile
HMM directly to the DNA
Sample GeneWise Output
Developing GeneWise Model
• Start with a PFAM domain HMM
• Replace AA emissions with codon
emissions
P(codon | Mi )   P(codon | aa)P(aa | Mi )
•Allow for sequencing errors
(deletions/insertions)
•Add a 3-state intron model
GeneWise Model
GeneWise Intron Model
PY tract
central
5’ site
spacer
3’ site
GeneWise Model
• Viterbi algorithm -> “best” alignment of
DNA to protein domain
• Alignment gives exact exon-intron
boundaries
• Parameters learned from speciesspecific statistics
GeneWise problems
• Only provides partial prediction, and only
where the homology lies
– Does not find “more” genes
• Pseudogenes, Retrotransposons picked up
• CPU intensive
– Solution: Pre-filter with BLAST
Other Sequence Usage
• Search translated genomic sequences for the
occurrences of the shot peptide motifs that are
characteristic of common protein families (e.g.
zinc finger, ATP/GTP binding modifs etc.)
• Identify sequences which are probably NON
coding: identify known classes of interspersed
repeates (e.g LINE SINE) in none coding regions.
Can be essential to remove these before simple
BLAST is done against EST’s.
Summary
• Genes are complex structures which
are difficult to predict with the required
level of accuracy/confidence
• Different approaches to gene finding:
– Ab Initio : GenScan
– Ab Initio modified by BLAST homologies:
GenomeScan
– Homology guided: GeneWise
Future Directions
• Find genes not for proteins (tRNA, rRNA,
smRNA) – hard !
• Deal better with overlapping genes, multiple genes
in a single sequence
• Alternative splicing/transcription/translation – a
whole separate issue
– The mechanisms governing it, the signals
– predicting the various genes
Very important !