Genome Annotation

Transcript Genome Annotation

Genome Annotation
Rosana O. Babu
1
Sequence to Annotation
Input1-Variant Annotation
Input2- Structural Annotation

Structural Annotation was conducted using
AUGUSTUS (version 2.5.5),
Magnaporthe_grisea as genome model

However, we have to develop genome model
for Oomycete to obtain accurate result
Input3-Functional Annotation
Genome Annotation

The process of identifying the locations of
genes and the coding regions in a genome to
determe what those genes do

Finding and attaching the structural elements
and its related function to each genome
locations
6
Genome Annotation
gene structure prediction
gene function prediction
Identifying elements
(Introns/exons,CDS,stop,start)
in the genome
Attaching biological information
to these elements- eg: for which
7
protein exon will code for
Eukaryote genome annotation
Find locus
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
Find exons
using transcripts
AAAn
Translation
Find exons
using peptides
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
A
B
9
Prokaryote genome annotation
Find locus
Genome
Transcription
Primary Transcript
RNA processing
Processed RNA
START
STOP START
Find CDS
STOP
Translation
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
A
B
10
Genome annotation - workflow
Genome sequence
Repeats
Masked or un-masked genome sequence
Structural annotation-Gene finding
nc-RNAs, Introns
Protein-coding genes
Functional annotation
Viewed & Released in Genome viewer
11
Genome Repeats & features
Polymorphic between individuals/populations
 Percentage of repetitive sequences in different organisms
Genome
Aedes aegypti





Genome Size
(Mb)
% Repeat
1,300
~70
Anopheles gambiae
260
~30
Culex pipiens
540
~50
Microsatellite
Minisatellite
Tandem repeat
Short tandem repeat
SSR
12
Finding repeats as a preliminary to gene prediction
 Repeat discovery
 Literature and public databanks
Homology based approaches
 Automated approaches (e.g. RepeatScout or RECON)
Tandem repeats: Tandem, TRF
Use RepeatMasker to search the genome and mask the sequence
13
Masked sequence


Repeatmasked sequence is an artificial construction where those regions which
are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and
to reduce the impact of TE’s in the final annotation set
>my sequence
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattggcttctctagactcgtctatctctatta
gctatcatctcgatagcgatcagctagcgatcagg
ctactattggcttcgatagcgatcagctagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctactattggctgatcttaggtcttctga
tcttct
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct
Positions/locations are not affected by masking
14
Types of Masking- Hard or Soft?

Sometimes we want to mark up repetitive sequence but not to exclude it from
downstream analyses. This is achieved using a format known as soft-masked
>my sequence
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT
AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC
AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTggcttctctagactcgtctatctctatt
agtatcATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTggcttcgatagcgatcagcTAGCGATC
AGGCTACTATTggcttcgatagcgatcagcTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga15
tcttct
Genome annotation - workflow
Genome sequence
Map repeats
Masked or un-masked
Gene finding- structural annotation
nc-RNAs, Introns
Protein-coding genes
Functional annotation
Viewed & Released in Genome viewer
16
Structural annotation
Identification of genomic elements






Open reading frame and their localization
Coding regions
Location of regulatory motifs
Start/Stop
Splice Sites
Non coding Regions/RNA’s
17
Methods
 Similarity
•
Similarity between sequences which does not necessarily infer any
evolutionary linkage
 Ab- initio prediction
•
Prediction of gene structure from first principles using only the genome
sequence
19
Genefinding
ab initio
similarity
20
Gene_finding resources for Homology
based methods

Transcript

cDNA sequences

EST sequences

Peptide

Non-redundant (nr) protein database

Protein sequence data, Mass spectrometry data

Genome

Other genomic sequence
21
ab initio prediction
Genome
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
22
Genefinding - ab initio predictions
Use compositional features of the DNA sequence to define coding segments
(essentially exons)
 ORFs
 Coding bias
 Splice site consensus sequences
 Start and stop codons
Methods

Training sets are required

Each feature is assigned a log likelihood score

Use dynamic programming to find the highest scoring path for accuracy
Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh
23
Genefinding - similarity
 Use known coding sequence to define coding regions
 EST sequences
 Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise
Gene-finding - comparative
 Use two or more genomic sequences to predict genes based on
conservation of exon sequences
 Examples: Twinscan and SLAM
24
Genefinding - non-coding RNA genes
 Non-coding RNA genes can be predicted using knowledge of their
structure or by similarity with known examples
 tRNAscan - uses an HMM and co-variance model for prediction of
tRNA genes
 Rfam - a suite of HMM’s trained against a large number of different
RNA genes
25
Gene-finding omissions
Alternative isoforms
Currently there is no good method for predicting alternative isoforms
Only created where supporting transcript evidence is present
Pseudogenes
Each genome project has a fuzzy definition of pseudogenes
Badly curated/described across the board
Promoters
Rarely a priority for a genome project
Some algorithms exist but usually not integrated into an annotation set
26
Practical- structural annotation
Eukaryotes- AUGUSTUS (gene model)
~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial -singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=tru
progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea
our_genome.fasta >structural_annotation.gff
Prokaryotes – PRODIGAL (Codon Usage table)
~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa
-f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt
27
Structural Annotation
Structural Annotation was conducted using AUGUSTUS (version
2.5.5), Magnaporthe_grisea as genome model

However, we have to develop genome model for obtaining accurate
result
Functional
annotation
29
Functional annotation
Attaching biological information to genomic elements




Biochemical function
Biological function
Involved regulation and interactions
Expression
• Utilise known structural information to predicted protein sequence
30
Genome annotation - workflow
Genome sequence
Map repeats
Masked or un-masked
Gene finding- structural annotation
nc-RNAs, Introns
Protein-coding genes
Functional annotation
Viewed & Released in Genome viewer
31
Genome annotation
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
AAAn
Translation
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
A
B
32
Functional annotation – Homology Based

Predicted Exons/CDS/ORF are searched against the non-redundant
protein database (NCBI, SwissProt) to search for similarities

Visually assess the top 5-10 hits to identify whether these have
been assigned a function

Functions are assigned
33
Functional annotation - Other features

Other features which can be determined





Signal peptides
Transmembrane domains
Low complexity regions
Various binding sites, glycosylation sites etc.
Protein Domain
See http://expasy.org/tools/ for a good list of
possible prediction algorithms
34
Functional annotation - Other features
(Ontologies)

Use of ontologies to annotate gene products

Gene Ontology (GO)



Cellular component
Molecular function
Biological process
35
Practical - FUNCTIONAL
ANNOTATION







Homology Based Method
setup blast database for nucleotide/protein
Blasting the genome.fasta for annotations (nucleotide/protein)
sorting for blast minimum E-value (>=0.01) for nucleotide/protein
Further filtering for best blast hit (5-15) and assigning functions
Removing Positive strand blast hits
Removing negative strand blast hits
36
Functional annotation- output
August 2008
Bioinformatics tools for Comparative Genomics
of Vectors
37
Conclusion

Annotation accuracy is only as good as the available supporting data at the
time of annotation- update information is necessary

Gene predictions will change over time as new data becomes available (ESTs,
related genomes) that are much similar than previous ones

Functional assignments will change over time as new data becomes available
(characterization of hypothetical proteins)
38
Thank You
39

Genome Annotation

Transcript Genome Annotation

Directory