Transcript Slides

Swiss-Prot Fortelaza
• Sybil: Comparative Analysis System
• Gemina: Epidemiological Resource
Owen White
July 31st, 2006
ISMB POSTERS
• Sam Angiuoli Ergatis/Sybil Poster H-82
• Aaron Gussman Gemina Poster B-46
Swiss-Prot Fortelaza
• Sybil: Comparative Analysis System
• Gemina: Epidemiological Resource
Sybil searches and computes
• Searches:
– All v all blastp searches
– Mummer: Nucleotide/Protein  SNPs
• Clustering:
–
–
–
–
–
Evaluation of proteins-match networks
Scoring system set by user
COGs – bidirectional best hits
Jaccard COG-clustering for transitive closure
Also paralogs
• Syntenic block:
– Collections of J-COG between species
– Runs of J genes without K non-homologous intervening genes
How Sybil Computes are
Performed
• Blast
• Position effect
(conserved gene
order)
• MUMmer
blastp
COG
BSML
Dumper
PE
– SNPs
• PROmer
• Gene families
– COGs
– Paralogs
MUMmer
SNPs
BSML
Loader
PROmer
Primary output: BSML-XML
Data Prep For Comparative
Analysis GMOD
Consortium
GenBank Files
EMBL Files
Custom Files
BSML
GFF3
Chado
blastp
COG
BSML
Dumper
PE
MUMmer
PROmer
SNPs
BSML
Loader
Jaccard Clustering
Jaccard-filtered Orthologs
Match Reduction
Fig. 6. Using a minimal spanning tree (MST) algorithm to remove redundant matches.
Protein cluster image before (left) and after (right) applying the MST filter.
Sybil: Chromosomal summaries
Preferences for pop-up displays are user configurable.
Jaccard-filtered COGs
Syntenic blocks
Fig. 1. Whole genome alignment of GBS strains
Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955
Copyright ©2005 by the National Academy of Sciences
Fig. 2. GBS core genome
Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955
Copyright ©2005 by the National Academy of Sciences
Fig. 3. GBS pan-genome
Tettelin, Hervé et al. (2005) Proc. Natl. Acad. Sci. USA 102, 13950-13955
Copyright ©2005 by the National Academy of Sciences
Other Sybil Features
•
•
•
•
Open source.
sybil.sourceforge.net
Complete demo database
Other packages:
–
–
–
–
–
Chado relation database
BSML XML (Bioinformatic Sequence Markup Language)
Bioperl (Lincoln Stein's Bio::Graphics package)
Apache Batik SVG toolkit
MUMmer suffix-tree alignment tools
Important: To run Sybil
• You must load data into Chado.
• We have Flat file  BSML parsers
• To be released as open source.
Ergatis: latest discussion.
Sam Angiuoli Ergatis/Sybil Poster H-82
• me: then when're we releasing ergatis?
• Sam: so, the plan was that all these scripts would just
come bundled with Ergatis
• me: right.
• Sam: we need a deadline
• me: oh. is this on the record? I think I'll just put this chat
in my power point for tomorrow.
• Sam: i reallly don't think there is that much we need to
do in order to release it. most of the concerns will be
about how a user can install and configure it to point to
their installs of all the 3rd party search tools they'd want
to use.
Swiss-Prot Fortelaza
• Sybil: Comparative Analysis System
• Gemina: Epidemiological Resource
Defining Infection Systems
Pathogen
Host
Transmission Method
Bacterial and Viral pathogens
Anatomy
Disease
Blood and blood-forming organs
diseases
NIAID Category A, B & C Priority Pathogens
Circulatory system diseases
human and animal
animal structure
Complications of pregnancy
body region
Digestive system diseases
cardiovascular system
Genitourinary system diseases
cell
direct
indirect
mechanical
vector-borne
Symptom
digestive system
endocrine system
mechanical
vector-borne
Reservoir
Geographic Location
Infection Systems distinguish modes of transmission, hosts, disease
Pathogen
Host
Transmission
Method
Anatomy
Disease
Clostridium botulinum C
Bos taurus
indirect: vehicle-borne
ingestion
gastrointestincal (GI)
tract
Foodborne botulism
Clostridium botulinum F
Homo sapiens
indirect: vehicle-borne
ingestion
gastrointestincal (GI)
tract
Infant botulism
Clostridium botulinum B
Homo sapiens
direct: contact
skin
Wound botulism
Clostridium botulinum
Homo sapiens
Indirect: airborne
respiratory tract
Botulism
Mycobacterium tuberculosis
Homo sapiens
direct: droplet spread
respiratory tract
Tuberculosis
Mycobacterium tuberculosis
Homo sapiens
Indirect: airborne
respiratory tract
Tuberculosis
Mycobacterium tuberculosis
Homo sapiens
brain
Meningitis
Mycobacterium tuberculosis
Pan troglodytes
lymph nodes
Tuberculosis
Indirect: airborne
Ontologies & Controlled Vocabularies in Gemina
• infectious disease and body system oriented
• hierarchical query and retrieval
• Mapping of terms from newly defined threat_systems and MRS terms
disease – anatomy – symptom – transmission method – reservoir – geographic location
(1667)
(1322)
(424)
disease
+diseases of the respiratory system
+infectious and parasitic diseases
+arthropod-borne viral disease
+intestinal infectious diseases
+other bacterial diseases
+bacterial infection
+gas gangrene
+staphylococcus infection
+tetanus
(16)
(243)
reservoir
+animal reservoir
+arthropod
+mollusc
+environmental reservoir
+soil
+food
+human reservoir
+blood
+respiratory tract
(964)
Anatomy Ontology
+Animal_structure
+Body_region
+Cardiovascular_system
+Cell
+Digestive_system
+Embryonic_structure
+Endocrine_system
+Fluids_and_secretions
+Hemic_and_immune_system
+Integumentary_system
+Musculoskeletal_system
+Nervous_system
+Respiratory_system
+Sense_organ
+Stomatognathic_system
+Tissue
+Urogenital_system
Respiratory_system
+ larynx
+ lung
+ pharynx
+ nasopharynx
+ oropharynx
+Africa
+Americas
+Caribbean
+Central America
+North America
+South America
+Argentina
+Bolivia
+Brazil
+North Region
+Northeast Region
+Rio Grando do Norte
+Sergipe
+ Fortaleza
+Central West Region
+Antarctic Regions
+Arctic Regions
+Asia
+Atlantic Islands
+Europe
+Indian Ocean Islands
+Oceania
+Oceans and Seas
+World Wide
Geographic location
Gemina query page: select topic tabs to add terms to Selection Summary
Scroll down the list of choices or click on Tree view to navigate the hierarchy of terms
Query Anatomy Ontology for terms including ‘tissue’
Identify Infection Systems involving nerve tissue, select, add to Selection
Box
Gemina Search Results
View and sort Infection Systems by topic. Unique ID.
Navigate back to the Gemina Query Page
Curated GEMINA Infection Systems
(as of July 28th, 2006)
NIAID Category
Pathogen
Number of Infection
Systems
Number of Geographic
Locations
Total
22
1616
3852
A
Bacillus anthracis
18
-
A
Clostridium botulinum
61
257
A
Francisella tularensis
44
18
A
Yersinia pestis
33
48
B
Brucella abortus
3
-
B
Brucella canis
7
-
B
Brucella melitensis
18
-
B
Brucella spp.
11
-
B
Brucella suis
15
-
B
Burkholderia mallei
55
47
B
Burkholderia pseudomallei
210
108
B
Campylobacter jejuni
42
148
B
Clostridium perfringens
120
30
B
Coxiella burnetti
67
69
B
Escherichia coli
328
545
B
Listeria monocytogenes
96
191
B
Rickettsia prowazekii
10
74
B
Salmonella typhimurium
105
89
B
Staphylococcus aureus
100
86
B
Vibrio cholerae
31
178
C
Influenza
168
97
C
Mycobacterium tuberculosis
74
1867
Microbial Rosetta Stone (MRS): is a database that relates microorganism names,
taxonomic classifications, diseases, and scientific literature for the the most important
human, animal and plant microbial pathogens, with linkage to public genomic sequence
databases
Applications of Gemina
• Pathogen Identification Applications:
– biodefense, animal health care, food safety,
diagnostics, pathology, clinical research,
forensics, drug discovery
• Under Open Access.
Applications of Gemina
• Pathogen Identification Applications:
– biodefense, animal health care, food safety,
diagnostics, pathology, clinical research,
forensics, drug discovery
• Under Open Access.
• Disease/Anatomy/Symptoms
– DNA sequence, genomes
– Physical resources
– Proteomic data
Case Study: Submit queries of multiple terms to
view related Infection Systems
Microbial Identification of Clinically Significant Microbes
NIH Clinical Center Collaboration: Dr. Patrick Murray
• Creation of Identification Clinical Reference Set
• Identify unique signature tags to distinguish organisms
• Goal: identify the minimum number of tests (50 bp unique signatures) to identify
a gram-negative rod bacteria using Pyrosequencing
• Genus-level identification
• Species, Strain-level identification
• Test Set: Clinical Isolates of Gram Negative Rods not reliably identified by
biochemical testing: 140 Proteobacteria
Case Study2: Insignia Homeland
Security
PANDA
DNA Sequence
Sequence Data Flow
Data Input:
NCBI: Genomic Sequence
TIGR: Infection Systems
Annotation
extractor
Chado
genome
annotation
Diagnosics:
DNA Signatures:
Univ. MD
MRS
Database schema:
Pathogens and
Disease
Epidemiology Data Flow
TAXON_ID
Web Interface
Gemina
DNA datasets outside of GenBank that we have identified and included in PANDA.
Organism Name
Sequencing Center
Acidobacteria bacterium Ellin345
DOE Joint Genome Institute
Acinetobacter baumannii
Genoscope
Actinobacillus actinomycetemcomitans HK1651
University of Oklahoma
Bacteriovorax marinus
Wellcome Trust Sanger Institute
Bordetella avium
Wellcome Trust Sanger Institute
Burkholderia cenocepacia J2315
Wellcome Trust Sanger Institute
Chromohalobacter salexigens
DOE Joint Genome Institute
Citrobacter rodentium
Wellcome Trust Sanger Institute
Clavibacter michiganensis subsp. sepedonicus
Wellcome Trust Sanger Institute
Clostridium botulinum A
Wellcome Trust Sanger Institute
Clostridium difficile 630
Wellcome Trust Sanger Institute
Erwinia amylovora
Wellcome Trust Sanger Institute
Escherichia coli 042
Wellcome Trust Sanger Institute
Escherichia coli E2348/69
Wellcome Trust Sanger Institute
Francisella tularensis subsp. holarctica FSC200
Baylor College of Medicine
Frankia sp. EAN1pec
DOE Joint Genome Institute
Geobacillus stearothermophilus 10
University of Oklahoma
Helicobacter mustelae
Wellcome Trust Sanger Institute
Lactobacillus brevis
DOE Joint Genome Institute
Mannheimia haemolytica PHL213
Baylor College of Medicine
Methylobacterium extorquens AM1
Integrated Genomics
Mycobacterium marinum M
Wellcome Trust Sanger Institute
Mycobacterium microti
Wellcome Trust Sanger Institute
Neisseria meningitidis FAM18
Wellcome Trust Sanger Institute
Organism Name
Sequencing Center
Paenibacillus larvae subsp. larvae
Baylor College of Medicine
Proteus mirabilis
Wellcome Trust Sanger Institute
Pseudomonas fluorescens SBW25
Wellcome Trust Sanger Institute
Rhizobium leguminosarum bv. viciae 3841
Wellcome Trust Sanger Institute
Rhodobacter capsulatus SB1003
Integrated Genomics
Salmonella bongori 12149
Wellcome Trust Sanger Institute
Salmonella typhimurium DT104
Wellcome Trust Sanger Institute
Salmonella typhimurium SL1344
Wellcome Trust Sanger Institute
Salmonella typhimurium TR7095
Wellcome Trust Sanger Institute
Serratia marcescens subsp. marcescens Db11
Wellcome Trust Sanger Institute
Shewanella baltica
DOE Joint Genome Institute
Shigella dysenteriae M131649
Wellcome Trust Sanger Institute
Shigella sonnei 53G
Wellcome Trust Sanger Institute
Spiroplasma kunkelii CR2-3x
University of Oklahoma
Streptococcus equi
Wellcome Trust Sanger Institute
Streptococcus equi subsp. zooepidemicus
Wellcome Trust Sanger Institute
Streptococcus pneumoniae 23F
Wellcome Trust Sanger Institute
Streptococcus pyogenes Manfredo
Wellcome Trust Sanger Institute
Streptococcus suis P1/7
Wellcome Trust Sanger Institute
Streptococcus uberis 0140J
Wellcome Trust Sanger Institute
Thermoanaerobacter ethanolicus
DOE Joint Genome Institute
Vibrio salmonicida LFI1238
Wellcome Trust Sanger Institute
Wolbachia endosymbiont of Onchocerca volvulus
Wellcome Trust Sanger Institute
Wolbachia pipientis
Wellcome Trust Sanger Institute
Yersinia enterocolitica (type 0:8)
Wellcome Trust Sanger Institute
Ongoing Development
• Creating links to Insignia from Results page
• Enable choice of target and background
genomes from Gemina Search Results
• links to Web resources for each Pathogen
• Community involvement in development of
ontologies
• Workshop on Ontology of Diseases: Nov. 6-7, 2006
• Inclusion of additional datasets (Scotland –
Disease data)
Sam Angiuoli
Sybil/Ergatis
Poster H-82
Jonathan
Crabtree
Sybil Interface
Aaron Gussman
Gemina
Poster B-46