Transcript Document

Comparative Genomics of Viruses:
VirGen as a case study
Dr. Urmila Kulkarni-Kale
Bioinformatics Centre
University of Pune
Pune 411 007
[email protected]
Biodiversity
Data diversity
Data Diversity
at various
levels of
Biocomplexity
Viral Comparative Genomics
• Viruses: Best represented taxa with respect
to complete genome sequences
• Viral genomes sequences:
– ‘Entries’ in primary sequence databases such as
GenBank/EMBL/DDBJ
– Lack of annotations for genomic sequences
– Opportunity to develop ‘Derived’ database
VirGen: Comparative genomics & data mining of
viral genomes
© Bioinformatics Centre, University of Pune
Browse VirGen at
http://bioinfo.ernet.in/virgen/virgen.html
Public Repository  Database
Issues involved in data curation
• Source of data: GenBank
• Retrieval engine: Entrez
• Queries: well-designed Perl scripts
• Consistent with ICTV nomenclature
• Annotation including strain information
• Generation of representative list of genomes
• Sequence-based ontology for protein name
• Annotation of unannotated entries using representative genomes
What is complete and putative genome record?
• Complete genome
– annotated as 'complete genome record' by the primary sequence
databases available in the public domain.
• Putative genome
– is not annotated as a ‘complete genome record’ but is likely to be a
complete genome, as the sequence length is in the typical range of
the complete genome for the respective virus.
• As the database contains, multiple genomic entries for
various strains/isolates for most of the viruses, a
'representative genomic entry' is identified for
every viral species. The representative entries provide a
non-redundant set of viral genome sequences, which are
subsequently used for annotation and to study the
phylogenetic relationships.
Organisation of VirGen
Salient features of VirGen:
 Organizes genomic data in a structured fashion navigating from the
family to an isolate
 Full genomes of viruses
 Compilation of representative genome entries for every viral species
(Virus Taxonomy, 7th report of ICTV)
 Complete annotation of every genomic entry
 Graphical representation of genome organization using SVG
technology
 Generation of alternative names of proteins
 On-the-fly genome comparisons using BLAST2
 Multiple Sequence Alignment (MSA) of genomes, proteomes and
individual proteins
 Whole genome phylogeny
 Prediction of B-cell epitopes
Design & Implementation
OS:
DBMS:
Data processing & Query system:
Graphical interface:
Web interface
Microsoft Windows 2000 server
MySQLTM
CGI Perl scripts and ASP
SVG
HTML implementing VB and Java scripts
Sequence analysis programs used
Sequence similarity search:
Genome comparisons:
Multiple Sequence Alignment:
Phylogeny:
B- cell epitope prediction:
BLAST v2.2.5 (Altschul et al., 1997).
BLAST2 v2.2.5
Parallel version of ClustalWv1.8
(Chenna et al., 2003)
Parallel version of PHYLIP v3.573
(Felsenstein & SGI)
Kolaskar & Tongaonkar (1990).
VirGen home
Menu to browse
viral families
Navigation
bar
Search using
Keywords &
Motifs
Genome analysis &
Comparative
genomics resources
Guided tour
& Help
Sample genome record in VirGen
Tabular display of
genome annotation
Retrieve
sequence
in FASTA
format
‘Alternate names’ of
proteins
Graphical view of Genome Organization
Viral polyprotein along
with the UTRs
Graphical view generated dynamically using
Scalable Vector Graphics technology
Multiple Sequence Alignment
MSA
Link for batch retrieval of
sequences
Dendrogram
Browsing the module of Whole Genome Phylogenetic trees
Most parsimonious tree of genus Flavivirus
Input data: Whole genome
Method: DNA parsimony
Bootstrapping: 1000
Browsing the module of Predicted epitopes
B-cell epitopes predicted using
Kolaskar & Tongaonkar method
VirGen: Structure bin
Links to CEP for precomputed sequential and conformational epitopes
CEP: Conformational Epitope Prediction Server
http://bioinfo.ernet.in/cep.htm
Precomputed CEs:
OCA browser (PDB)
links CEP predictions
Applications of VirGen
• Representative Genome list
– A curated and annotated data set for analyses
• Genome View
– Graphical representation of genome organisation
– Insertion/Deletion analysis
– Gene order
• MSA data
– Discovery of patterns: Diagnostics
– Primer design
• Predicted epitopes
– Vaccinome at a glance: DNA/peptide vaccine
• Whole Genome Phylogeny
– Evolution of strains/viruses
– Characterisation of virus
Case study: Whole Genome Phylogeny
depicts clustering of viruses w.r.t. their vectors
Family: Flaviviridae
Pestivirus
Flavi:Tick borne
Flavi:Mosquito borne
Hepacivirus
Unassigned ?
Pestivirus
Case Study: Insertions in Pestivirus 1
891-1787 bp region remains
unannotated using
representative strain
What is the
origin of the
insert ???
BLAST with VirGen confirmed the non-viral origin of the insert
BLAST with GenBank produced significant
match with Bos taurus J-domain protein
VirGen: current statistics