Advancing Science with DNA Sequence - C-MORE
Download
Report
Transcript Advancing Science with DNA Sequence - C-MORE
Advancing Science with DNA Sequence
Microbial Genome Annotation
Nikos Kyrpides
DOE Joint Genome institute
Advancing Science with DNA Sequence
Two main goals of genome analysis:
• Evolutionary analysis
– How does an organism compare to the
rest?
• Metabolic reconstruction
– What can an organism do and how?
Advancing Science with DNA Sequence
Overview of Annotation Steps
DNA
sequence
Gene Finding
Function Prediction
>Contig1
ataacaacacattagcggc
asacacacaacaggatatt
aggagagagagaaagttac
Identify Genes
(Proteins, RNAs)
Blast
Identify Regulatory elements
Clusters
(BBH, COGs, TIGRFam)
Automatic
Identify Repeat elements
Motifs
(HMM, Pfam, InterPro)
Manual
Gene QC
Gene Context
(Fusions, Operons, Regulons)
Missing Genes
Advancing Science with DNA Sequence
1. Finding the genes in microbial
genomes
1. Introduction
2. Tools out there
3. Basic principles behind tools
4. Known problems of the tools: why
you may need manual curation
Advancing Science with DNA Sequence
Finding the genes in microbial genomes
features
Sequence features in prokaryotic genomes:
stable RNA-coding genes (rRNAs, tRNAs, RNA
component of RNaseP, tmRNA)
protein-coding genes (CDSs)
transcriptional features (mRNAs, operons, promoters,
terminators, protein-binding sites, DNA bends)
translational features (RBS, regulatory antisense
RNAs, mRNA secondary structures, translational
recoding and programmed frameshifts, inteins)
pseudogenes (tRNA and protein-coding genes)
…
Advancing Science with DNA Sequence
Tools out there: finding proteincoding genes (not ORFs!)
Reading frames: translations of the nucleotide sequence with an offset of 0, 1
and 2 nucleotides (three possible translations in each direction)
Open reading frame (ORF): reading frame between a start and stop codon
Advancing Science with DNA Sequence
Finding features in microbial genomes
Well-annotated bacterial genome in Artemis genome viewer:
Advancing Science with DNA Sequence
Finding the genes in microbial
genomes
1. Introduction
2. Tools out there
3. Basic principles behind tools
4. Known problems of the tools: why
you may need manual curation
Advancing Science with DNA Sequence
Tools out there: servers for
microbial genome annotation - I
• IMG-ER
http://img.jgi.doe.gov/er
IMG-ER submission page:
http://img.jgi.doe.gov/submit
• RAST
http://rast.nmpdr.org/
• JCVI Annotation Service
Output:
stable RNA-encoding genes,
CDSs,
functional annotations
output in GenBank format
Output:
rRNAs and tRNAs,
CDSs,
functional annotations
output in several formats
http://www.jcvi.org/cms/research/projects/annotation-service/overview/
Output:
CDSs, stable RNAs?
functional annotations
format?
Advancing Science with DNA Sequence
Tools out there: servers for
microbial genome annotation - II
• AMIGENE
http://www.genoscope.cns.fr/agc/tools/amiga/Form/form.php
• RefSeq
Output:
CDSs,
output in gff format
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/genemark.cgi
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
• EasyGene
Output:
CDSs,
output in tbl format
http://www.cbs.dtu.dk/services/EasyGene/
Output:
CDSs,
size restriction
<1Mb
Advancing Science with DNA Sequence
Tools out there: genome browsers for
manual annotation of microbial
genomes
• Artemis
http://www.sanger.ac.uk/Software/Artemis/
• Manatee
http://manatee.sourceforge.net/
• Argo
http://www.broad.mit.edu/annotation/argo/
Windows and Linux
versions;
works with files in many
formats, annotated by
any pipeline
Linux versions only;
genome needs to be
annotated by the
JCVI Annotation
Service
Windows and
Linux;
works with files in
many formats
Major difference: viewer vs editor?
Advancing Science with DNA Sequence
Tools out there: tools for finding
stable (“non-coding”) RNAs - I
• Large structural RNAs (23S and 16S rRNAs)
RNAmmer http://www.cbs.dtu.dk/services/RNAmmer/
• Small structural RNAs (5S rRNA, tRNAs, tmRNA,
RNaseP RNA component)
Rfam database, INFERNAL search tool
http://www.sanger.ac.uk/Software/Rfam/
http://rfam.janelia.org/
http://infernal.janelia.org/
Web service:
sequence search is
limited to 2 kb
ARAGORN
Web service: sequence
http://130.235.46.10/ARAGORN1.1/HTML/aragorn1.2.html
tRNAScan-SE
http://lowelab.ucsc.edu/tRNAscan-SE/
search is limited to 15 kb,
finds tRNAs and tmRNAs
only
Web service: sequence
search is limited to 5 Mb,
finds tRNAs only
Advancing Science with DNA Sequence
Tools out there: tools for finding
“non-coding” RNAs - II
• Short regulatory RNAs
Rfam database, INFERNAL search tool
http://www.sanger.ac.uk/Software/Rfam/
http://rfam.janelia.org/
http://infernal.janelia.org/
Web service:
sequence search is
limited to 2 kb;
Provides list of precalculated RNAs for
publicly available
genomes
Other (less popular) tools:
Pipeline for discovering cis-regulatory ncRNA motifs:
http://bio.cs.washington.edu/supplements/yzizhen/pipeline/
RNAz
http://www.tbi.univie.ac.at/~wash/RNAz/
Advancing Science with DNA Sequence
Tools out there: most popular
CDS-finding tools
• CRITICA
• Glimmer family (Glimmer2, Glimmer3, RBS finder)
http://glimmer.sourceforge.net/
• GeneMark family (GeneMark-hmm, GeneMarkS)
http://exon.gatech.edu/GeneMark/
• EasyGene
• AMIGENE
• PRODIGAL (default JGI gene finder)
http://compbio.ornl.gov/prodigal/
Combinations and variations of the above
• RAST (Glimmer2 + pre- and post-processing)
Advancing Science with DNA Sequence
Basic principles: finding CDSs using
evidence-based vs ab initio algorithms
Two major approaches to prediction of protein-coding genes:
• “evidence-based” (ORFs with translations
homologous to the known proteins are CDSs)
Advantages: finds “unusual” genes (e. g. horizontally transferred);
relatively low rate of false positive predictions
Limitations: cannot find “unique” genes; low sensitivity on short genes;
prone to propagation of false positive results of ab initio annotation tools
• ab initio (ORFs with nucleotide composition similar
to CDSs are also CDSs)
Advantages: finds “unique” genes; high sensitivity
Limitations: often misses “unusual” genes; high rate of false positives
Advancing Science with DNA Sequence
Finding the genes in microbial
genomes
1. Introduction
2. Tools out there
3. Basic principles behind tools
4. Known problems of the tools:
why you may need manual
curation
Advancing Science with DNA Sequence
Known problems: CDSs
Short CDSs: many are missed, others are overpredicted
short ribosomal proteins (30-40 aa long) are often missed
short proteins in the promoter region are often overpredicted
N-terminal sequences are often inaccurate (many features of the
sequence around start codon are not accounted for)
Glimmer2.0 is calling genes longer than they should be
GeneMark, Glimmer3.0 mostly call genes shorter
Pseudogenes and sequencing errors (artificial frameshift)
all tools are looking for ORFs (needs valid start and stop codons)
“unique” genes are often predicted on the opposite strand of a pseudogene or a gene with
a sequencing error
Proteins with unusual translational features (recoding,
programmed frameshifts)
these genes are often mistaken for pseudogenes
see pseudogenes
Advancing Science with DNA Sequence
Known problems: CDSs
Lack of Standards
Advancing Science with DNA Sequence
Finding unique genes
Obligate parasite of horses
Causes human disease in tropical areas
(melioidosis)
Advancing Science with DNA Sequence
•
Phylogenetic profiler finds 548 unique genes in B. mallei
•
However, 497 of them in fact exist in B. pseudomallei, but they have
not been called as real genes.
The difference in gene models reveals 89.2% error rate in unique
genes
•
Advancing Science with DNA Sequence
Advancing Science with DNA Sequence
GenePRIMP
GenePRIMP
http://geneprimp.jgi-psf.org
Gene Prediction Improvement Pipeline
GenePRIMP is a pipeline that consists of a series of
computational units that identify erroneous gene
calls and missed genes and correct a subset of the
identified defective features.
APPLICATIONS
•
Identify gene prediction anomalies
•
Benchmark the quality of gene prediction
algorithms
•
Benchmark the quality of combination /
coverage of sequencing platforms
•
Improve the sequence quality
Pati A. et al, Nature Methods June 2010
Advancing Science with DNA Sequence
GenePRIMP steps
Advancing Science with DNA Sequence
Intergenic regions identify missed ORFs …
Find missing genes
Advancing Science with DNA Sequence
… and wrong ORFs
or2654 is
unique and hides
a real CDS which
is acyl carrier protein
Advancing Science with DNA Sequence
Everything looks perfect in this area
of Nitrobacter winogradskyi, but …
Advancing Science with DNA Sequence
… hides a real ORF
Advancing Science with DNA Sequence
Guinness Book of
protein-coding genes
The longest human gene is 2,220,223 nucleotides long. It has
79 exons, with a total of only 11,058 nucleotides, which specify
the sequence of the 3,685 amino acids and codes for a protein
dystrophin. It is part of a protein complex located in the cell
membrane, which transfers the force generated by the actinmyosin structure inside the muscle fiber to the entire fiber.
The smallest human gene is 252 nucleotides long, it specifies
a polypeptide of 67 amino acids and codes for an insulin-like
growth factor II.
The longest bacterial gene is 110,418 nucleotides long, which specify the
sequence of 36,805 amino acids. Its function is unknown, most likely a
surface protein.
The smallest bacterial gene is 54 nucleotides long, it specifies a
polypeptide of 17 amino acids and codes for a regulatory protein in
cyanobacteria
Advancing Science with DNA Sequence
False positives
Genome name
CDSs with no
hits < 100 aa
% with
tBLASTn hit
% tBLASTn hits with
frameshifts/stop codons
Prochlorococcus AS9601
18
88.9
68.8
Prochlorococcus MIT 9211
62
40.3
80
Prochlorococcus MIT 9215
24
58.3
64.2
Prochlorococus MIT 9301
12
75
66.7
Prochlorococcus MIT9303
501
83
61.8
Prochlorococcus MIT 9313
35
8.6
66.7
Prochlrococcus MIT 9515
32
81.3
50
Prochlorococcus NATL1A
209
95.2
48.2
Prochlorococcus CCMP1375
50
34
82.4
Synechococcus PCC 7942
39
0
0
Synechococcus CC9311
313
11.5
83.3
Synethococcus CC9605
83
38.6
81.3
Synechococcus CC9902
21
57.1
100
Synechococcus JA-2-3Ba
176
26.7
85.1
Synechococcus JA-3-3Ab
142
35.2
92
Synechococcus PCC 7002
93
17.2
56.3
Synechococcus RCC307
184
10.3
68.4S
Synechococcus WH 7803
32
18.8
83.3
Synechococcus WH 8102
39
38.4
46.7
Advancing Science with DNA Sequence
2. Finding the functions
in microbial genomes
1. Introduction
2. Tools out there
3. Known problems
Advancing Science with DNA Sequence
what is function?
cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase
(CbiF)
• molecular/enzymatic (methyltransferase)
– Reaction (methylation)
– Substrate (cobalt-precorrin-4)
– Ligand (S-adenosyl-L-methionine)
• metabolic (cobalamin biosynthesis)
• physiological (maintenance of healthy nerve and red blood cells,
through B12).
Advancing Science with DNA Sequence
Functional characterization
Advancing Science with DNA Sequence
Computational approaches to
Functional characterization
Advancing Science with DNA Sequence
Sequence Homology
Two sequences are homologous, if there existed a
molecule in the past that is ancestral to both of the
sequences.
Types of Homology:
Orthology: bifurcation in molecular tree reflects speciation
Paralogy: bifurcation in molecular tree reflects gene duplication
Advancing Science with DNA Sequence
Homology & analogy
• The term homology is confounded & abused in the
literature!
– sequences are homologous if they’re related by divergence
from a common ancestor
– analogy relates to the acquisition of common features from
unrelated ancestors via convergent evolution
• e.g., b-barrels occur in soluble serine proteases & integral
membrane porins; chymotrypsin & subtilisin share groups of
catalytic residues, with near identical spatial geometries, but no
other similarities
• Homology is not a measure of similarity & is not
quantifiable
– it is an absolute statement that sequences have a divergent
rather than a convergent relationship
– the phrases "the level of homology is high" or "the sequences
show 50% homology", or any like them, are strictly
meaningless!
Advancing Science with DNA Sequence
Function prediction
• Function transfer by
homology
• Homology
– implies a common
evolutionary origin.
– not retention of similarity in
any of their properties.
• Homology ≠ similarity of
function.
Punta & Ofran. PLOS Comp Biol. 2008
Advancing Science with DNA Sequence
Dos and Don’ts
Type
Don’t
Do
Homology
Same function
Probability for same function
Orthology
Same function
Probability for same function
Paralogy
Same function
Probability for same function
Sequence similarity
Same function
Probability for same function
High sequence similarity
Same function
Probability for same function
Same sequence
Same function
Probability for same function
Advancing Science with DNA Sequence
Application areas of analysis tools
•
•
The scale indicates %
identity between aligned
sequences
Alignment of 2 random seqs
can produce ~20% identity
– less than 20% does not
constitute a significant
alignment
– around this threshold is the
Twilight Zone, where
alignments may appear
plausible to the eye, but
can’t be proved by
conventional methods
Advancing Science with DNA Sequence
Finding the functions in microbial
genomes
1. Introduction
2. Tools out there
3. Known problems
Advancing Science with DNA Sequence
Function prediction
• Similarity searches (BLAST).
• Domain identification(Pfam).
• Small sequence identification(PROSITE).
Advancing Science with DNA Sequence
What if nothing is similar ?
•
•
•
•
Subcellular localization
Gene context
Structure
Prediction of binding residues (DISIS,
bindN)
S~S
S~S
Periplasm
Cytoplasm
Advancing Science with DNA Sequence
Annotation should make sense
Model pathway
Substrate
A
Enzyme 1
Substrate
B
Enzyme 2
Substrate
C
Enzyme 3
Substrate
D
Genome annotation
?
Enzyme 1
Enzyme 2
?
Enzyme 3
Advancing Science with DNA Sequence
Annotation should make sense
Advancing Science with DNA Sequence
Databases
• Databases used for the analysis of biological
molecules.
• Databases contain information organized in a way
that allows users/researchers to retrieve and exploit
it.
• Why bother?
–
–
–
–
–
Store information.
Organize data.
Predict features (genes, functions ...).
Predict the functional role of a feature (annotation).
Understand relationships (metabolic reconstruction).
Advancing Science with DNA Sequence
Primary nucleotide databases
EMBL/GenBank/DDBJ
(http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl)
Archive containing all sequences from:
genome projects
sequencing centers
individual scientists
patent offices
The sequences are exchanged between the
three centers on a daily basis.
Database is doubling every 10 months.
Sequences from >140,000 different species.
1400 new species added every month.
Database name nt / nr
Year
2004
2005
2006
2007
2008
Base pairs
44,575,745,176
56,037,734,462
69,019,290,705
83,874,179,730
99,116,431,942
Sequences
40,604,319
52,016,762
64,893,747
80,388,382
98,868,465
Advancing Science with DNA Sequence
Primary protein sequence databases
• Contain coding sequences derived from
the translation of nucleotide sequences
– GenBank
• Valid translations (CDS) from nt
GenBank entries.
– UniProtKB/TrEMBL (1996)
• Automatic CDS translations from
EMBL.
• TrEMBL Release 40.3 (26-May-2009)
contains 7,916,844 entries.
Advancing Science with DNA Sequence
RefSeq
Curated transcripts and proteins.
reviewed by NCBI staff.
Model transcripts and proteins.
generated by computer algorithms.
Assembled Genomic Regions (contigs).
Chromosome records.
Advancing Science with DNA Sequence
Classification databases
Groups (families/clusters) of proteins based on…
Overall sequence similarity.
Local sequence similarity.
Presence / absence of specific features.
Structural similarity.
...
These groups contain proteins with similar properties.
Specific function, enzymatic activity.
Broad function.
Evolutionary relationship.
…
Advancing Science with DNA Sequence
Overall sequence similarity
Advancing Science with DNA Sequence
Clusters of orthologous groups
(COGs)
• COGs were delineated by comparing protein
sequences encoded in 43 complete genomes
representing 30 major phylogenetic lineages.
– Each Cluster has representatives of at least 3 lineages
• A function (specific or broad) has been assigned to
each COG.
http://www.ncbi.nlm.nih.gov/COG/
Advancing Science with DNA Sequence
How it works
Reciprocal best hit
Bidirectional best hit
Blast best hit
Unidirectional best hit
COG1
COG2
Advancing Science with DNA Sequence
Profiles & Pfam
• A method for classifying proteins into groups
exploits region similarities, which contain
valuable information (domains/profiles).
• These domains/profiles can be used to detect
distant relationships, where only few residues are
conserved.
Advancing Science with DNA Sequence
Regions similarity
Advancing Science with DNA Sequence
Pfam
http://pfam.sanger.ac.uk
HMMs of protein alignments
(local) for domains,
or global (cover whole protein)
Advancing Science with DNA Sequence
PROSITE
http://au.expasy.org/prosite/
R-Y-x-[DT]-W-x-[LIVM]-[ST]-T-P-[LIVM](3)
Advancing Science with DNA Sequence
KEGG orthology
Advancing Science with DNA Sequence
Composite pattern databases
• To simplify sequence analysis, the family
databases are being integrated to create a unified
annotation resource – InterPro
–
–
Release 32.0 (Apr 11) contains 21516 entries
Central annotation resource, with pointers to its
satellite dbs
http://www.ebi.ac.uk/interpro/
Advancing Science with DNA Sequence
* It is up to the user to decide if the annotation is correct *
Advancing Science with DNA Sequence
KEGG
•
Contains information about biochemical pathways, and protein
interactions.
http://www.kegg.com
Advancing Science with DNA Sequence
Summary
• We have main archives (Genbank), and currated
databases (Refseq, SwissProt), and protein
classification database (COG, Pfam).
• This is the tip of the iceberg of databases.
• They help predict the function, or the network
of functions.
• Systems that integrate the information from
several databases, visualize and allow handling
of data in an intuitive way are required
Advancing Science with DNA Sequence
Functional annotation in IMG
• Automated protein
product assignment
pipeline
• Functional context in
IMG
• KEGG Pathways, Modules, KEGG
Orthology
• MetaCyc Pathways
• IMG Pathways
No longer maintained:
• TIGR Role Categories
• TIGR Genome Properties
• COG Functional Categories
Advancing Science with DNA Sequence
Lack of Standards