Transcript Document

Genome Annotation
Daniel Lawson
VectorBase @ EBI
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
1
Genome annotation - building a pipeline
Genome sequence
Map repeats
Map ESTs
Map Peptides
Genefinding
nc-RNAs
Protein-coding genes
Functional annotation
Release
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
3
Repeat features
 Genomes contain repetitive sequences
Genome
Aedes aegypti
August 2008
Size (Mb)
% Repeat
1,300
~70
Anopheles gambiae
260
~30
Culex pipiens
540
~50
Bioinformatics tools for Comparative Genomics of Vectors
4
Repeat features: Tandem repeats
 Pattern of two or more nucleotides repeated where the repetitions
are directly adjacent to each other
 Polymorphic between individuals/populations
 Example programs: Tandem, TRF
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
5
Repeat features: Interspersed elements
 Transposable elements (TEs)
 Transposons, Retrotransposons etc
 Entire research field in itself
 Example programs: Repeatscout, RECON
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
6
Finding repeats as a preliminary to gene prediction
 Repeat discovery
 Literature and public databanks
 Automated approaches (e.g. RepeatScout or RECON)
 Generate a library of example repeat sequences (FASTA file with a
defined header line format)
 Use RepeatMasker to search the genome and mask the sequence
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
7
Masked sequence


Repeatmasked sequence is an artificial construction where those regions which
are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and
to reduce the impact of TE’s in the final annotation set
>my sequence
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattggcttctctagactcgtctatctctatta
gctatcatctcgatagcgatcagctagcgatcagg
ctactattggcttcgatagcgatcagctagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctactattggctgatcttaggtcttctga
tcttct
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
8
Masked sequence - Hard or Soft?

Sometimes we want to mark up repetitive sequence but not to exclude it from
downstream analyses. This is achieved using a format known as soft-masked
>my sequence
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT
AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC
AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTggcttctctagactcgtctatctctatt
agtatcATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTggcttcgatagcgatcagcTAGCGATC
AGGCTACTATTggcttcgatagcgatcagcTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
9
Genome annotation - building a pipeline
Genome sequence
Map Repeats
Map ESTs
Map Peptides
Genefinding
nc-RNAs
Protein-coding genes
Functional annotation
Release
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
11
Genome annotation - building a pipeline
Genome sequence
Map Repeats
Map ESTs
Map Peptides
Genefinding
nc-RNAs
Protein-coding genes
Functional annotation
Release
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
13
More terminology
 Gene prediction
Predicted exon structure for the primary transcript of a gene
 CDS
Coding sequence for a protein-coding gene prediction (not necessarily
continuous in a genomic context)
 ORF
Open reading frame, sequence devoid of stop codons
 Similarity
Similarity between sequences which does not necessarily infer any
evolutionary linkage
 ab initio prediction
Prediction of gene structure from first principles using only the genome
sequence
 Hidden Markov Model (HMM)
Statistical model (dynamic Baysian network) which can be used as a sensitive
statistically robust search algorithm. Use of profile HMMs to search sequence
data is widespread
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
14
Eukaryote genome annotation
Find locus
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
Find exons
using transcripts
AAAn
Translation
Find exons
using peptides
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
August 2008
A
B
Bioinformatics tools for Comparative Genomics of Vectors
15
Prokaryote genome annotation
Find locus
Genome
Transcription
Primary Transcript
RNA processing
Processed RNA
START
STOP START
Find CDS
STOP
Translation
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
August 2008
A
B
Bioinformatics tools for Comparative Genomics of Vectors
16
Genefinding
ab initio
August 2008
similarity
Bioinformatics tools for Comparative Genomics of Vectors
17
Genefinding resources



Transcript
 cDNA sequences
 EST sequences
 Other (MPSS, SAGE, ditags)
Peptide
 Non-redundant (nr) protein database
 Protein sequence data, Mass spectrometry data
Genome
 Other genomic sequence
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
18
ab initio prediction
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
AAAn
Translation
Polypeptide
Protein folding
Folded protein
Enzyme activity
Functional activity
August 2008
A
B
Bioinformatics tools for Comparative Genomics of Vectors
19
ab initio prediction
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
AAAn
Translation
Polypeptide
Protein folding
Folded protein
Enzyme activity
Functional activity
August 2008
A
B
Bioinformatics tools for Comparative Genomics of Vectors
20
Genefinding - ab initio predictions
 Use compositional features of the DNA sequence to define coding
segments (essentially exons)
 ORFs
 Coding bias
 Splice site consensus sequences
 Start and stop codons
 Each feature is assigned a log likelihood score
 Use dynamic programming to find the highest scoring path
 Need to be trained using a known set of coding sequences
 Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
21
ab initio prediction
Genome
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
22
ab initio prediction
Genome
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
23
ab initio prediction
Genome
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
Find best prediction
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
24
Similarity prediction
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
AAAn
Translation
Polypeptide
Protein folding
Folded protein
Enzyme activity
Functional activity
August 2008
A
B
Bioinformatics tools for Comparative Genomics of Vectors
25
Similarity prediction
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
Find exons
using transcripts
AAAn
Translation
Find exons
using peptides
Polypeptide
Protein folding
Folded protein
Enzyme activity
Functional activity
August 2008
A
B
Bioinformatics tools for Comparative Genomics of Vectors
26
Genefinding - similarity
 Use known coding sequence to define coding regions
 EST sequences
 Peptide sequences
 Needs to handle fuzzy alignment regions around splice sites
 Needs to attempt to find start and stop codons
 Examples: EST2Genome, exonerate, genewise
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
27
Similarity-based prediction
Genome
cDNA/peptide
Align
Create prediction
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
28
Genefinding - comparative
 Use 2 or more genomic sequences to predict genes based on
conservation of exon sequences
 Examples: Twinscan and SLAM
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
29
Genefinding - manual




Manual annotation is time consuming
Annotators use specialized utilities to view genomic regions with
tiers/columns of data from which they construct a gene prediction
Most decisions are subjective and tedious to document
Avoids the systematic problems of ab initio predictors and automated
annotation pipeline
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
30
Manual prediction
EST
similarity
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
31
Manual prediction
EST
similarity
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
32
Manual prediction
EST
similarity
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
Predict structure
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
33
Genefinding - non-coding RNA genes
 Non-coding RNA genes can be predicted using knowledge of their
structure or by similarity with known examples
 tRNAscan - uses an HMM and co-variance model for prediction of tRNA
genes
 Rfam - a suite of HMM’s trained against a large number of different
RNA genes
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
34
Overview of current annotation system
Assembled genome
Sequencing centre gene predictions
VectorBase gene predictions
Merge into canonical set
Protein analysis
Display on genome browser
Release to GenBank/EMBL/DDBJ
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
36
VectorBase gene prediction pipeline
Blessed predictions
Manual annotations
Community submissions
(Apollo)
(Genewise, Exonerate, Apollo)
Species-specific predictions
Similarity predictions
(Genewise)
(Genewise)
Canonical
predictions
ncRNA predictions
Protein family HMMs
(Genewise)
(Rfam)
Transcript based predictions
Ab initio gene predictions
(Exonerate)
(SNAP)
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
37
VectorBase curation database pipeline for
manual/community annotation
Manual annotation
(Harvard)
Curation
warehouse db
Chado-XML
Apollo
Community annotation
(Community representatives)
Chado-XML
Chado
Community annotation
Apollo
GFF3
Ensembl
Gene build db
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
38
Genefinding - Review




Gene prediction relies heavily on similarity data
EST/cDNA sequences are vital for genefinding
 Training for ab initio approaches
 Similarity builds
 Validating predictions
Protein data is the predominant supporting evidence for prediction in
most vector genomes
 Need to be wary of predicting from predictions
Genefinding is still something of a dark art
 Efforts to standardize and document supporting evidence for
prediction and modifications are ongoing
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
39
Genefinding omissions



Alternative splice forms
 Currently there is no good method for predicting alternative isoforms
 Only created where supporting transcript evidence is present
Pseudogenes
 Each genome project has a fuzzy definition of pseudogenes
 Badly curated/described across the board
Promoters
 Rarely a priority for a genome project
 Some algorithms exist but usually not integrated into an annotation set
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
40
Functional
annotation
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
41
Functional annotation




Utilise known structure/function information to infer facts related to the predicted
protein sequence
Provide users with results from a number of standard algorithms/searches
Provide users with cross-references (dbxrefs) to other resources
Assign a simple one line description for each gene product


This will never be comprehensive
This will always be somewhat general
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
42
Genome annotation
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
AAAn
Translation
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
August 2008
A
B
Bioinformatics tools for Comparative Genomics of Vectors
43
Functional annotation - protein similarities



Predicted proteins are searched against the non-redundant protein
database to look for similarities
Visually assess the top 5-10 hits to identify whether these have been
assigned a function
It is important to check how the function of the top hits has been
assigned in order not to transfer erroneous annotations
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
44
Functional annotation - Protein domains





Protein domains have a number of definitions based on their size, folding and
function/evolution.
Domains are a part of protein structure description
Domains with a similar structure are likely to be related evolutionarily and have a
similar function
We can use this to infer function (& structure) for an unknown protein be
comparison to known proteins
The tool of choice here is a Hidden Markov Model (HMM)
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
45
Protein Domain databases

InterPro











August 2008
UniProt - protein database
Prosite - database of regular expressions
Pfam - profile HMMs
PRINTS - conserved protein signatures
Prodom - collection of multiple sequence alignments
SMART - HMMs
TIGRfams - HMMs
PIRSF
Superfamily
Gene3D
Panther - HMMs
Bioinformatics tools for Comparative Genomics of Vectors
46
Functional annotation - Other features

Other features which can be determined
 Signal peptides
 Transmembrane domains
 Low complexity regions
 Various binding sites, glycosylation sites etc.
See http://expasy.org/tools/ for a good list of possible prediction
algorithms
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
47
Signal peptides

Short peptide sequence found at the N-terminus of a pre-protein which
mark the peptide for transport across one or more membranes

e.g. SignalP
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
48
Transmembrane domains





Simple hydrophobic regions which sit inside a membrane
Transmembrane domains anchor proteins in a membrane and can
orient other domains in the protein correctly
Examples: Receptors, transporters, ion channels
Identified based on the protein composition using a simple sliding
window algorithm or an HMM
e.g. Tmpred, TMHMM
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
49
Ontologies

Use of ontologies to annotate gene products
 Gene Ontology (GO)
 Cellular component
 Molecular function
 Biological process
 Sequence Ontology (SO)
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
50
Other data to look at



Enzyme classification (EC) numbers
Phenotype information
 Alleles
 Gene knockouts
 RNAi knockdowns
Expression data
 EST libraries (source of RNA material)
 Microarrays
 SAGE tags
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
51
Functional assignment


The assignment of a function to a gene product can be made by a
human curator by assessing all of the data (similarities, protein
domains, signal peptide etc)
This is a labour intensive process and like gene prediction is subjective

There are automated approaches (based on family and domain
databases such as Panther or InterPro) but these are under-developed

Large number of predictions from a genome project remain
‘hypothetical protein’ or ‘conserved hypothetical protein’.
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
52
Caveats to genome annotation



Annotation accuracy is only as good as the available supporting data at the time
of annotation
Gene predictions will change over time as new data becomes available (ESTs,
related genomes)
Functional assignments will change over time as new data becomes available
(characterisation of hypothetical proteins)


Gene predictions are ‘best guess’
Functional annotations are not definitive and only a guide

If you want the annotation to improve you should get involved with whoever is
(or has) sequenced your genome of interest.

For vectors you can mail [email protected] with suggestions and corrections.
August 2008
Bioinformatics tools for Comparative Genomics of Vectors
53