Genome Annotation
Download
Report
Transcript Genome Annotation
Genome Annotation
Rosana O. Babu
1
Sequence to Annotation
Input1-Variant Annotation
Input2- Structural Annotation
Structural Annotation was conducted using
AUGUSTUS (version 2.5.5),
Magnaporthe_grisea as genome model
However, we have to develop genome model
for Oomycete to obtain accurate result
Input3-Functional Annotation
Genome Annotation
The process of identifying the locations of
genes and the coding regions in a genome to
determe what those genes do
Finding and attaching the structural elements
and its related function to each genome
locations
6
Genome Annotation
gene structure prediction
gene function prediction
Identifying elements
(Introns/exons,CDS,stop,start)
in the genome
Attaching biological information
to these elements- eg: for which
7
protein exon will code for
Eukaryote genome annotation
Find locus
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
Find exons
using transcripts
AAAn
Translation
Find exons
using peptides
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
A
B
9
Prokaryote genome annotation
Find locus
Genome
Transcription
Primary Transcript
RNA processing
Processed RNA
START
STOP START
Find CDS
STOP
Translation
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
A
B
10
Genome annotation - workflow
Genome sequence
Repeats
Masked or un-masked genome sequence
Structural annotation-Gene finding
nc-RNAs, Introns
Protein-coding genes
Functional annotation
Viewed & Released in Genome viewer
11
Genome Repeats & features
Polymorphic between individuals/populations
Percentage of repetitive sequences in different organisms
Genome
Aedes aegypti
Genome Size
(Mb)
% Repeat
1,300
~70
Anopheles gambiae
260
~30
Culex pipiens
540
~50
Microsatellite
Minisatellite
Tandem repeat
Short tandem repeat
SSR
12
Finding repeats as a preliminary to gene prediction
Repeat discovery
Literature and public databanks
Homology based approaches
Automated approaches (e.g. RepeatScout or RECON)
Tandem repeats: Tandem, TRF
Use RepeatMasker to search the genome and mask the sequence
13
Masked sequence
Repeatmasked sequence is an artificial construction where those regions which
are thought to be repetitive are marked with X’s
Widely used to reduce the overhead of subsequent computational analyses and
to reduce the impact of TE’s in the final annotation set
>my sequence
>my sequence (repeatmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattggcttctctagactcgtctatctctatta
gctatcatctcgatagcgatcagctagcgatcagg
ctactattggcttcgatagcgatcagctagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctactattggctgatcttaggtcttctga
tcttct
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga
tcttct
Positions/locations are not affected by masking
14
Types of Masking- Hard or Soft?
Sometimes we want to mark up repetitive sequence but not to exclude it from
downstream analyses. This is achieved using a format known as soft-masked
>my sequence
>my sequence (softmasked)
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTGGCTTCTCTAGACTCGTCTATCTCTATT
AGTATCATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTGGCTTCGATAGCGATCAGCTAGCGATC
AGGCTACTATTGGCTTCGATAGCGATCAGCTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGC
TACTATTggcttctctagactcgtctatctctatt
agtatcATCTCGATAGCGATCAGCTAGCGATCAGG
CTACTATTggcttcgatagcgatcagcTAGCGATC
AGGCTACTATTggcttcgatagcgatcagcTAGCG
ATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGA
TCTTCT
>my sequence (hardmasked)
atgagcttcgatagcgatcagctagcgatcaggct
actattxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxatctcgatagcgatcagctagcgatcagg
ctactattxxxxxxxxxxxxxxxxxxxtagcgatc
aggctactattggcttcgatagcgatcagctagcg
atcaggctxxxxxxxxxxxxxxxxxxxtcttctga15
tcttct
Genome annotation - workflow
Genome sequence
Map repeats
Masked or un-masked
Gene finding- structural annotation
nc-RNAs, Introns
Protein-coding genes
Functional annotation
Viewed & Released in Genome viewer
16
Structural annotation
Identification of genomic elements
Open reading frame and their localization
Coding regions
Location of regulatory motifs
Start/Stop
Splice Sites
Non coding Regions/RNA’s
17
Methods
Similarity
•
Similarity between sequences which does not necessarily infer any
evolutionary linkage
Ab- initio prediction
•
Prediction of gene structure from first principles using only the genome
sequence
19
Genefinding
ab initio
similarity
20
Gene_finding resources for Homology
based methods
Transcript
cDNA sequences
EST sequences
Peptide
Non-redundant (nr) protein database
Protein sequence data, Mass spectrometry data
Genome
Other genomic sequence
21
ab initio prediction
Genome
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
22
Genefinding - ab initio predictions
Use compositional features of the DNA sequence to define coding segments
(essentially exons)
ORFs
Coding bias
Splice site consensus sequences
Start and stop codons
Methods
Training sets are required
Each feature is assigned a log likelihood score
Use dynamic programming to find the highest scoring path for accuracy
Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh
23
Genefinding - similarity
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Problem to handle fuzzy alignment regions around splice sites
Examples: EST2Genome, exonerate, genewise
Gene-finding - comparative
Use two or more genomic sequences to predict genes based on
conservation of exon sequences
Examples: Twinscan and SLAM
24
Genefinding - non-coding RNA genes
Non-coding RNA genes can be predicted using knowledge of their
structure or by similarity with known examples
tRNAscan - uses an HMM and co-variance model for prediction of
tRNA genes
Rfam - a suite of HMM’s trained against a large number of different
RNA genes
25
Gene-finding omissions
Alternative isoforms
Currently there is no good method for predicting alternative isoforms
Only created where supporting transcript evidence is present
Pseudogenes
Each genome project has a fuzzy definition of pseudogenes
Badly curated/described across the board
Promoters
Rarely a priority for a genome project
Some algorithms exist but usually not integrated into an annotation set
26
Practical- structural annotation
Eukaryotes- AUGUSTUS (gene model)
~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial -singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=tru
progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea
our_genome.fasta >structural_annotation.gff
Prokaryotes – PRODIGAL (Codon Usage table)
~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa
-f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt
27
Structural Annotation
Structural Annotation was conducted using AUGUSTUS (version
2.5.5), Magnaporthe_grisea as genome model
However, we have to develop genome model for obtaining accurate
result
Functional
annotation
29
Functional annotation
Attaching biological information to genomic elements
Biochemical function
Biological function
Involved regulation and interactions
Expression
• Utilise known structural information to predicted protein sequence
30
Genome annotation - workflow
Genome sequence
Map repeats
Masked or un-masked
Gene finding- structural annotation
nc-RNAs, Introns
Protein-coding genes
Functional annotation
Viewed & Released in Genome viewer
31
Genome annotation
Genome
Transcription
Primary Transcript
RNA processing
Processed mRNA
ATG
STOP
m7G
AAAn
Translation
Polypeptide
Protein folding
Folded protein
Find function
Enzyme activity
Functional activity
A
B
32
Functional annotation – Homology Based
Predicted Exons/CDS/ORF are searched against the non-redundant
protein database (NCBI, SwissProt) to search for similarities
Visually assess the top 5-10 hits to identify whether these have
been assigned a function
Functions are assigned
33
Functional annotation - Other features
Other features which can be determined
Signal peptides
Transmembrane domains
Low complexity regions
Various binding sites, glycosylation sites etc.
Protein Domain
See http://expasy.org/tools/ for a good list of
possible prediction algorithms
34
Functional annotation - Other features
(Ontologies)
Use of ontologies to annotate gene products
Gene Ontology (GO)
Cellular component
Molecular function
Biological process
35
Practical - FUNCTIONAL
ANNOTATION
Homology Based Method
setup blast database for nucleotide/protein
Blasting the genome.fasta for annotations (nucleotide/protein)
sorting for blast minimum E-value (>=0.01) for nucleotide/protein
Further filtering for best blast hit (5-15) and assigning functions
Removing Positive strand blast hits
Removing negative strand blast hits
36
Functional annotation- output
August 2008
Bioinformatics tools for Comparative Genomics
of Vectors
37
Conclusion
Annotation accuracy is only as good as the available supporting data at the
time of annotation- update information is necessary
Gene predictions will change over time as new data becomes available (ESTs,
related genomes) that are much similar than previous ones
Functional assignments will change over time as new data becomes available
(characterization of hypothetical proteins)
38
Thank You
39