Gene Prediction in Eukaryotes

Download Report

Transcript Gene Prediction in Eukaryotes

Genomics:
Gene prediction and Annotations
Kishor K. Shende
Information Officer
Bioinformatics Center,
Barkatullah University Bhopal
Gene Prediction Strategies
TAA
TAG
TGA
Prokaryotes Gene Architecture
Initiation
-36
-10
ATG
Protein 1
Promoter
Protein 2
Protein 3
Termination
Exon-2
Termination
Gene
Regulatory Seq.
ATG
Initiation
Exon-1
Intron-1
Splicing Sites
Eukaryotes Gene Architecture
TAA
TAG
TGA
Codon Usage Tables
 Each amino acid can be encoded by several codons
 Each organism has characteristic pattern of codon usage
Problems in Gene Prediction

Distinguishing Pseudogenes from Genes
 Exon-Intron Structure in Eukaryotes, Exon flanking
regions – not very well conserved
 Alternative Splicing – Shuffling of Exons
 Genes can overlap each other and occur on
different strand of DNA
Gene Identification
1. Homology Based Gene prediction
 Sequence Similarity Search against gene database using BLAST and
FAST searching tools
 EST (Expressed Sequence Tags) similarity search
2. Ab initio Gene Prediction
 Prokaryotes
- ORF finding
 Eukaryotes
- Promoter prediction
- Start-Stop codon prediction
- Splice site Prediction (Exon-Intron and Intron –Exon)
- PolyA signal prediction
ORF Finding in Prokaryotes
Easier due to ………..
 Small Genome have high gene density (Haemophilus
influenza – 85% genic)
 No Introns or Few Introns
 Operons
- One Transcript, many genes
 Open Reading Frames (ORF)
- Contigous set of codons, start with Met-codon, ends
with stop codon
1. ORF Findings:
 Simplest method
 Length of DNA sequence that contains a contiguous set of
codons, each of which specifies an Amino Acid
 Six possible reading frames
Start Codon
Sense Strand
Antisense Strand
3’
1
2
3
A
T
G
C
C
A
T
C
A
G
T
G
C
C
A
T
T
G
T
A
5’
3
Position 3
Position 2
Position 1
Central Dogma
DNA
5’
mRAN
2
1
Start Codon
Protein
3’
ORF Prediction:
Based on Position of Start Codon & Stop Codon
Start Codon
A
U
ORF
Stop Codon
G
OR
OR
Protein Coding Region
U
G
A
U
A
A
U
A
G
No Protein:
Code for Protein
Due to the
Presence
of many in-frame
stop codons
Example of ORF
There are six possible ORFs in each sequence for both directions of
transcription.
Difficulty in ORF Prediction:
1. Prokaryotes & Viruses: Presence of multiple genes on mRNA and
Overlapping genes in which two different proteins may be encoded
in different reading frames of the same mRNA
2. Eukaryotes: Protein coding region (Exon) is followed by non-coding
region (Intron)
3. Differential mRNA splicing create different mRNA, hence different
proteins
4. Variation in Genetic Code from Universal code
Reliability of ORF Prediction: Characteristics of ORF regions
1. Ordered list of specific codons that reflects the evolutionary origin of
the gene and constraints associated with gene expressions
2. Characteristics pattern of use of synonymous codons i.e. codons that
stands for same Amino Acid
3. In Eukaryotes strong preferences for codon pairs at Intron-Exon or
Exon-Intron junction
4. High genome content of GC have a strong bias of G & C in the third
codon positions
3 Test of ORF
First Test: It is based on an unusual type of sequence
variation that is found in ORF have been devised to variety
that a predicted ORF is in fact likely to encode a protein
Second Test: It is analyzed, to determine whether the
codon in the ORF correspond to these used in other genes
of the same organism
Third Test: ORF may be translated into an amino acid
sequence and the resulting sequence then compound to the
databases of existing sequence
Repeated Sequence Elements and Nucleosome
Structure
1. Eukaryotic DNA is wrapped around histon-protein complexes
2. Some base pairs in the major or minor grooves of the DNA molecules
face the nucleosome surface
3. Other pair face outside of the structures
4. Nucleosome located in the promoter regions are remodeled in a manner
that can influence the availability of binding sites for regulatory proteins
making them more or less available
Hidden Morkov Model (HMM) of Eukaryotic Internal Exon
Computational Background: Repeated patterns of sequence have been found in
the Introns and Exons and near the start site of Transcriptuion of Eukaryotic
genes
Bending Pattern: Bending is influenced by
1. Repeated pattern i.e. not T, A or G, G
2. AA/TT dinucleotide
Ab initio gene prediction
Predictions are based on the observation that gene
DNA sequence is not random:
- Gene-coding sequence has start and stop
-
codons.
Each species has a characteristic pattern of
synonymous codon usage.
Non-coding ORFs are very short.
Gene would correspond to the longest ORF.
These methods look for the characteristic features
of genes and score them high.
Ab initio gene prediction methods




GeneScan – Fourier transform of DNA sequence to find
characteristic patterns.
GeneParser – predicts the most likely combination of
exons/introns. Dynamic programming.
GeneMark – mostly for prokaryotes, Hidden Markov
Models. Also for Eukaryotes
Grail II – predicts exons, promoters, Poly(A) sites. Neural
network plus dynamic programming.
Gene Preference Score :
Important indicator of coding region
Observation: frequencies of codons and codon pairs in coding and
non-coding regions are different.
Given a sequence of codons:
and assuming independence, the probability of finding coding
region:
The probability of finding sequence “C” in non-coding regions:
The gene preference score:
P(C )
GPS  log(
)
P0 (C )
Confirming gene location using EST libraries


Expressed Sequence Tags (ESTs) –
sequenced short segments of cDNA. They
are organized in the database “UniGene”.
If region matches ESTs with high
statistical significance, then it is a gene or
pseudogene.
Gene prediction accuracy
True positives (TP) – nucleotides, which are
correctly predicted to be within the gene.
Actual positives (AP) – nucleotides, which
are located within the actual gene.
Predicted positives (PP) – nucleotides, which
are predicted in the gene.
Sensitivity = TP / AP
Specificity = TP / PP
Gene prediction accuracy
Common Difficulties of Gene Prediction



First and last exons difficult to annotate because
they contain UTRs.
Smaller genes are not statistically significant so
they are thrown out.
Algorithms are trained with sequences from
known genes which biases them against genes
about which nothing is known.
Genome Analysis for Gene Prediction
Genome analysis
Genome – the sum of genes and intergenic
sequences of haploid cell.
The value of genome sequences lies in their annotation


Annotation – Characterizing genomic features using
computational and experimental methods
Genes: levels of annotation



Gene Prediction – Where are genes?
What do they encode?
What proteins/pathways involved in?
Flowchart: Gene Prediction
Process
Genomic DNA Sequence
Analyze the
Regulatory Sequences
in the Gene
1. Translate in all
six Reading Frames &
compare to Protein
sequence database
2. Perform database
similarity search of
EST database of
some Organism
Use Gene
Prediction
program to
locate genes
Try this first
using BLAST
& FASTA
PSI-BLAST,
PHI-BLAST
& Other
BLAST/FAS
TA
programs
&
EST, cDNA
database
search
Compare with
Genome of Other
Organism
ORF Finding
Promoter,
Splicing
Site, Poly-A
tail, 5’ TUR,
3’ UTR
Let’s have some Practice on Gene Finding
using some Gene Finding Programs
1. GenMark (http://exon.gatech.edu/GeneMark/
)
2. Genscan (http://genes.mit.edu/GENSCAN.html )
3. Grail II (http://compbio.ornl.gov/Grail-1.3/
)
4. Gene Finder in GlimmerM
(http://www.tigr.org/tdb/glimmerm/glmr_form.ht
ml )
HMMgene - Prediction of genes in vertebrate and C. elegans
Gene Discovery Page
FramePlot - protein-coding region prediction tool for high GC-content bacteria
tRNAscan-SE Search for transfer RNA genes in genomic sequence
NETGENE - Predict splice sites in human genes
ORF Finder
BCM Gene Finder
Grail
Genemark
Genie: A Gene Finder Based on Generalized Hidden Markov Models
GENSCAN - predict complete gene structures
Splice Site Prediction by Neural Network
Procrustes
GenePrimer
GenLang
MZEF Gene Finder
Webgene - Tools for prediction and analysis of protein-coding gene structure
MAR-Finder - Nuclear matrix attachment region prediction
Glimmer bacterial/archael gene finder
Promoter Region, Transscription Factor and Signals
1.
TRANSFAC - Transcription Factor database
TFD Transcription Factor Database
TransTerm - A Translational Signal Database
PLACE - a database of plant cis-acting regulatory DNA elements
NNPP: Promoter Prediction by Neural Network
FastM/ModelInspector
TFSEARCH
MatInd and MatInspector
Transcription Element Search Software (TESS)
CorePromoter (Core-Promoter Prediction Program)
Gene Express - analysis of genomic regulatory sequences
Signal Scan
PromoterInspector
Promoter Scan II
Pol3scan
TargetFinder - finds DNA-binding proteins.
Overview
GENE PREDICTION TOOLS
TM
GenMark (http://exon.gatech.edu/GeneMark/ )
Mark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology,
Atlanta, Georgia
GeneMark.hmm for Prokaryotes (Version 2.4)
Referen
ce:
Lukashin A. and Borodovsky M., GeneMark.hmm: new
solutions for gene finding, NAR, 1998, Vol. 26, No. 4, pp.
1107-1115
Bacterial and archaeal gene prediction, you can use the parallel
combination of the GeneMark and GeneMark.hmm programs
Heuristic Approach for Gene Prediction in Prokaryotes
If the DNA sequence of interest belongs to a species whose name is
not in the list of available models, use the Heuristic models option
Self Training Program of Genmarks
If the sequence is longer than 1 Mb, generate models with the selftraining program GeneMarkS
Gene Prediction in Eukaryotes
Eukaryotic gene prediction: Use the
parallel combination of the GeneMark
and GeneMark.hmm
Select the Related
Organisms from this list
Gene Prediction in EST and cDNA
To analyze ESTs and cDNAs
Gene Prediction in Viruses
Viral gene prediction through virus database “VIOLIN”
GenMark
Output
GenMark
Output
New GENSCAN Web Server at MIT
Genescan Output
GrailEXP
1. Locate protein coding genes within DNA
sequence,
2. Locate EST/mRNA alignments,
3. Locate certain types of promoters,
polyadenylation sites, CpG islands, and
repetitive elements.
GrailEXP is a gene finder………….
1. EST alignment utility
2. exon prediction program,
3. a promoter/polya recognizer,
4. a CpG island finer,
5. a repeat masker,
GrailEXP
Predicts exons, genes, promoters, polyas, CpG islands, EST similarities, and
repetitive elements within DNA sequence
GlimmerM: http://www.tigr.org/tdb/glimmerm/glmr_form.html
A system for finding genes in microbial DNA, especially the
genomes of bacteria and archaea.Glimmer (Gene Locator
and Interpolated Markov Modeler) uses interpolated Markov
models (IMMs) to identify the coding regions and
distinguish them from noncoding DNA.
GlimmerHMM: For Eukaryotic Organisms
Genesplicer: Fast, flexible system for detecting
splice sites in the genomic DNA of various
eukaryotes.
GLimmerM Gene Finder
Kishor K. Shende
Information Officer
Bioinformatics Center,
Barkatullah University Bhopal