Advancing Science with DNA Sequence Finding the genes in

Download Report

Transcript Advancing Science with DNA Sequence Finding the genes in

Advancing Science with DNA Sequence
Finding the genes in
microbial genomes
Natalia Ivanova
MGM Workshop
September 16, 2008
Advancing Science with DNA Sequence
Outline
1. Introduction
2. Tools out there
3. Basic principles behind tools
4. Known problems of the tools:
why you may need manual
curation
Advancing Science with DNA Sequence
Outline
1. Introduction (who said annotating
prokaryotic genomes is easy?)
2. Tools out there
3. Basic principles behind tools
4. Known problems of the tools:
why you may need manual
curation
Advancing Science with DNA Sequence
Finding the genes in microbial genomes
features
Well-annotated
bacterial
genome in Artemis
genome viewer:
Sequence
features
in prokaryotic
genomes:
 stable RNA-coding genes (rRNAs, tRNAs, RNA
component of RNaseP, tmRNA)
 protein-coding genes (CDSs)
 transcriptional features (mRNAs, operons, promoters,
terminators, protein-binding sites, DNA bends)
 translational features (RBS, regulatory antisense
rRNA tRNA
RNAs, mRNA secondary structures, translational
operon
recoding and programmed frameshifts, inteins)
promoter terminator
 pseudogenes (tRNA and protein-coding
genes)
protein-coding gene
…
CDS
protein-binding site
Advancing Science with DNA Sequence
Outline
1. Introduction (who said annotating
prokaryotic genomes is easy?)
2. Tools out there (don’t bother to write
down the names and links, all presentations
will be available on the web site)
3. Basic principles behind tools
4. Known problems of the tools:
why you may need manual
curation
Advancing Science with DNA Sequence
Tools out there: servers for
microbial genome annotation - I
• IMG-ER
http://img.jgi.doe.gov/er
IMG-ER submission page:
http://durian.jgi-psf.org/~imachen/cgi-bin/Submission/main.cgi
• RAST
http://rast.nmpdr.org/
• JCVI Annotation Service
Output:
stable RNA-encoding genes,
CDSs,
functional annotations
output in GenBank format
Output:
rRNAs and tRNAs,
CDSs,
functional annotations
output in several formats
http://www.tigr.org/tigr-scripts/AnnotationEngine/ann_engine.cgi
Output:
CDSs, stable RNAs?
functional annotations
format?
Advancing Science with DNA Sequence
Tools out there: servers for
microbial genome annotation - II
• REGANOR
https://www.cebitec.uni-bielefeld.de/groups/brf/software/reganor/
• RefSeq
Output:
rRNAs and tRNAs,
CDSs,
output in gff format
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/genemark.cgi
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
• EasyGene
Output:
CDSs,
output in tbl format
http://www.cbs.dtu.dk/services/EasyGene/
Output:
CDSs,
size restriction
<1Mb
Advancing Science with DNA Sequence
Tools out there: genome browsers for
manual annotation of microbial
genomes
• Artemis
http://www.sanger.ac.uk/Software/Artemis/
• Manatee
http://manatee.sourceforge.net/
• Argo
http://www.broad.mit.edu/annotation/argo/
Windows and Linux
versions;
works with files in many
formats, annotated by
any pipeline
Linux versions only;
genome needs to be
annotated by the
JCVI Annotation
Service
Windows and
Linux;
works with files in
many formats
Major difference: viewer vs editor?
Advancing Science with DNA Sequence
Tools out there: tools for finding
stable (“non-coding”) RNAs - I
•
Large structural RNAs (23S and 16S rRNAs)
•
Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA
component)
The only known tool is search_for_RNAs script developed by Niels Larsen and
available upon request; used by IMG, RAST and REGANOR annotation servers
Rfam database, INFERNAL search tool
http://www.sanger.ac.uk/Software/Rfam/
http://rfam.janelia.org/
http://infernal.janelia.org/
ARAGORN
http://130.235.46.10/ARAGORN1.1/HTML/aragorn1.2.html
tRNAScan-SE
http://lowelab.ucsc.edu/tRNAscan-SE/
Web service:
sequence search is
limited to 2 kb
Web service: sequence
search is limited to 15 kb,
finds tRNAs and tmRNAs
only
Web service: sequence
search is limited to 5 Mb,
finds tRNAs only
Advancing Science with DNA Sequence
Tools out there: tools for finding
“non-coding” RNAs - II
• Short regulatory RNAs (riboswitches, etc.)
Rfam database, INFERNAL search tool
http://www.sanger.ac.uk/Software/Rfam/
http://rfam.janelia.org/
http://infernal.janelia.org/
Web service:
sequence search is
limited to 2 kb;
Provides list of precalculated RNAs for
publicly available
genomes
Other (less popular) tools:
Pipeline for discovering cis-regulatory ncRNA motifs:
http://bio.cs.washington.edu/supplements/yzizhen/pipeline/
RNAz
http://www.tbi.univie.ac.at/~wash/RNAz/
Advancing Science with DNA Sequence
Tools out there: finding proteincoding genes (not ORFs!)
Reading frames: translations of the nucleotide sequence with an offset of 0, 1
and 2 nucleotides (three possible translations in each direction)
Open reading frame (ORF): reading frame between a start and stop codon
Advancing Science with DNA Sequence
Tools out there: most popular
CDS-finding tools
• CRITICA
• Glimmer family (Glimmer2, Glimmer3, RBS finder)
http://glimmer.sourceforge.net/
• GeneMark family (GeneMark-hmm, GeneMarkS)
http://exon.gatech.edu/GeneMark/
• EasyGene
Combinations and variations of the above
• REGANOR (CRITICA + Glimmer3 + pre-processing)
• Old ORNL pipeline (CRITICA + Glimmer3)
• RAST (Glimmer2 + pre- and post-processing)
Advancing Science with DNA Sequence
Tools out there: metagenome
annotation
• BLASTx
• Fgenesb
http://linux1.softberry.com/berry.phtml?topic=fgenesb&group=programs&su
bgroup=gfindb
• GeneMark (GeneMark-hmm for reads, GeneMarkS for longer
contigs)
http://exon.gatech.edu/GeneMark/
• MetaGene
http://metagene.cb.k.u-tokyo.ac.jp/metagene/
• GISMO ?
http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/
Full-service servers
• IMG/M-ER – uses GeneMark for Sanger, proxygenes for 454
http://img.jgi.doe.gov/submit
• MG-RAST
http://metagenomics.nmpdr.org/
Advancing Science with DNA Sequence
Outline
1. Introduction (who said annotating
prokaryotic genomes is easy?)
2. Tools out there (don’t bother to write
down the names and links, all presentations
will be available on the web site)
3. Basic principles behind tools (very
basic, see specific papers for details)
4. Known problems of the tools:
why you may need manual
curation
Advancing Science with DNA Sequence
Basic principles: finding RNAs
 16S and 23S rRNAs are too
large to be represented as
statistical models (cannot use
secondary structure in prediction)
 Approximate location is
identified by sequence
similarity search (BLASTn)
against a database of 16S
and 23S rRNA sequences (but
GenBank contains too many
fragments)
 Boundaries of these RNAs are
usually defined as the
coordinates of a BLAST hit
(but alignment is local)







For small RNAs statistical models
can be generated
Both sequence similarity AND
secondary structure are taken into
account
RNAs are grouped into families by
sequence similarity
Multiple sequence alignment is
performed
Secondary structure annotation is
added
Combined secondary structure and
primary sequence profiles are captured
by profile stochastic context-free
grammars (statistical models)
These models are stored in the
database and new sequences are
searched against them
Advancing Science with DNA Sequence
Basic principles: finding CDSs using
evidence-based vs ab initio algorithms
Two major approaches to prediction of protein-coding genes:
• “evidence-based” (ORFs with translations
homologous to the known proteins are CDSs)
Advantages: finds “unusual” genes (e. g. horizontally transferred);
relatively low rate of false positive predictions
Limitations: cannot find “unique” genes; low sensitivity towards short
genes; prone to propagation of false positive results of ab initio
annotation tools
• ab initio (ORFs with nucleotide composition similar
to CDSs are also CDSs)
Advantages: finds “unique” genes; high sensitivity
Limitations: often misses “unusual” genes; high rate of false positives
Advancing Science with DNA Sequence
Basic principles: finding proteincoding genes with ab initio methods






Example:
the overall HMM architecture used
in EasyGene (from Larsen & Krogh,
BMC Bioinformatics, 2003).



An ORF that is likely to be proteincoding is found by searching for
“coding potential”
“Coding potential” is defined by comparing
nucleotide sequence of an ORF to a hidden
Markov model (HMM)
HMM is generated using a training set from
the genome or from average frequencies
observed for multiple genomes
Probability that an ORF is a protein-coding
gene is computed
N-terminal (5’) boundary is found by
finding a start codon (ATG, GTG, TTG)
next to a ribosomal binding site (RBS,
Shine-Dalgarno sequence)
Different genomes have different frequencies
of start codons
RBS is found by (Gibbs sampling) multiple
sequence alignment of upstream sequences
and represented by a weighted positional
frequency matrix
Or RBS is found by multiple sequence
alignment and represented as one of the
states in an HMM model
Or ...
Advancing Science with DNA Sequence
Features and differences between
gene finding tools
• Training set selection (evidence-based vs purely ab initio)
 Example: CRITICA and EasyGene use evidence-based training sets (BLASTn with
counting synonymous/non-synonymous codons in CRITICA, BLASTx in EasyGene);
Glimmer uses ab initio training set of long non-overlapping ORFs; GeneMark uses ab
initio heuristic model
• Statistical model of coding and non-coding regions (codon frequencies,
dicodon frequencies, hidden Markov models)
 Example: CRITICA uses dicodon frequencies to model coding regions; Glimmer uses
interpolated Markov models (IMM) of up to 5-th order; GeneMark uses order 2 hmm
for coding regions, order 0 hmm for non-coding regions
• Statistical model architecture (i. e. which parts of the CDS are explicitly modeled
– may include RBS, spacer region, start codon, second codon, internal codons, stop
codon, etc.)
 Example: EasyGene explicitely models RBS, spacer region, start codon, second
codon, internal codons, stop codon, codons surrounding stop codon, non-coding
sequence; all other tools have less comprehensive architectures of HMM
• Additional algorithms for refinement of predictions (RBS finder, overlap
resolution, etc.)
 Example: Glimmer2.0 has a scoring schema for overlap resolution; Glimmer3. uses a
dynamic programming algorithm to select the highest-scoring set of predictions
consistent with the maximum allowed overlap
Advancing Science with DNA Sequence
Outline
1. Introduction (who said annotating
prokaryotic genomes is easy?)
2. Tools out there (don’t bother to write
down the names and links, all presentations
will be available on the web site)
3. Basic principles behind tools (very
basic, see specific papers for details)
4. Known problems of the tools:
why you may need manual
curation (more on manual curation in the
next talk by Thanos Lykidis)
Advancing Science with DNA Sequence
Known problems: RNAs

Genome
Sequencing
center
16S rRNA,
nt
Synechococcus sp. CC9311
UCSD, TIGR
1477

Synechococcus sp. CC9605
JGI
1440

Synechococcus elongatus PCC
7942
JGI
1490
Synechococcus sp. JA-23BA(2-13)
TIGR
1323
Synechococcus sp. JA-3-3Ab
TIGR
1324


Synechococcus sp. RCC307
Genoscope
1498


Synechococcus sp. WH7803
Genoscope
1497,
1464

Large rRNAs: approximate
position is correct, boundaries
are inaccurate
variation of rRNA sizes in closely
related strains
most 16S rRNA are missing antiShine-Dalgarno sequence
Small structural RNAs:
covariance models are
generally accurate, but may
miss some tRNAs in Archaea
Check for the full complement of
tRNAs with all necessary anticodons
No model for pyrrolysine tRNA
Small regulatory RNAs: search
is accurate but slow (too many
models)
Annotations of regulatory RNAs are
missing from many genomes
Advancing Science with DNA Sequence
Known problems: CDSs












Short CDSs: many are missed, others are overpredicted
short ribosomal proteins (30-40 aa long) are often missed
short proteins in the promoter region are often overpredicted
N-terminal sequences are often inaccurate (many features of the
sequence around start codon are not accounted for)
Glimmer2.0 is calling genes longer than they should be
GeneMark, Glimmer3.0 err both ways, but mostly call genes shorter
Pseudogenes
all tools are looking for ORFs (needs valid start and stop codons)
“unique” genes are often predicted on the opposite strand of a pseudogene
Proteins with unusual translational features (recoding,
programmed frameshifts)
these genes are often mistaken for pseudogenes
see pseudogenes
Advancing Science with DNA Sequence
Known problems: different gene finding
tools applied to the same genome
total
features
total CDSs
non-pseudo
CDSs
pseudo
total rRNA
total misc
RNA
total tRNA
manual
7124
7042
6699
343
18
63
1
GeneMark
7059
6974
6974
0
18
62
2
ORNL
7076
6994
6994
0
18
63
1
RAST
5503
5422
5422
0
18
63
0
REGANOR
6420
6339
6339
0
18
63
0
Glimmer3
8218
8218
8218
0
0
0
0
missed by
automated
annotation
#
false positive
(deleted by
manual curation)
% CDSs
%
CDSs
#
too short
(extended by
manual curation)
#
% CDSs
too long
(truncated by
manual curation)
#
% CDSs
total
modifications
#
%
CDSs
GeneMark
347
4.9
282
4.0
846
12.0
106
1.5
1581
22.4
ORNL
610
8.6
560
7.9
300
4.2
1152
16.3
2622
37.2
RAST
1783
25.3
167
2.3
83
1.1
2228
31.6
4261
60.5
REGANOR
904
12.8
203
2.8
207
2.9
1059
15.0
2373
33.6
Glimmer3
237
3.3
1408
19.9
500
7.1
542
7.6
2687
38.1
Advancing Science with DNA Sequence
Conclusions
• There are plenty of tools for automated annotation of microbial
genome, including several “full-service” servers and annotation
pipelines
• Even “full-service” pipelines identify a limited range of features
and development of automated or semi-automated tools for
identification of operons, promoters, terminators etc. is highly
desirable, but not likely in the absence of experimental data
• Nearly all of the annotation tools and servers are using different
strategies, algorithms, models, settings, etc., so the results may
and will vary
• Different automated gene finders have different advantages and
limitations; the best strategy is using any of them or a
combination followed by evidence-based manual curation