Powerpoint File - Centre for Microbial Diseases and Immunity
Download
Report
Transcript Powerpoint File - Centre for Microbial Diseases and Immunity
IslandPath: A computational aid for identifying genomic islands
that may play a role in microbial pathogenicity
William Hsiao1*, Nancy Price2, Ivan Wan3, Steven J. Jones3, and Fiona S. L. Brinkman1.
1Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, 2Department of Medical Genetics, University of British Columbia, Vancouver, and
3Genome Sequence Centre, B.C. Cancer Agency, British Columbia, Canada
www.pathogenomics.bc.ca/brinkman
Abstract
As more genomes from bacterial pathogens are sequenced, it is becoming apparent that a significant proportion of virulence factors are encoded in
clusters of genes, termed Pathogenicity Islands (reviewed in 1). These islands and other genomic islands, tend to have atypical guanine and
cytosine content (%G+C), contain mobility genes (e.g. transposases and integrases), and are associated with tRNA sequences. We have
developed a web-based computational tool, IslandPath, to aid the visualization of these features in a full genome display in order to facilitate the
identification of genes in new genome sequences that may be involved in virulence or have horizontal origins. The ability to visualize these features
within the genomic context can facilitate better detection of the genomic island borders and neighbouring genes. Atypical %G+C by itself is not
indicative of the horizontal origin of the sequence involved, however, the predictive power increases when such regions are associated with mobile
elements, direct repeats, or contain genes with similarity to known virulence factors. Therefore, we are incorporating into IslandPath algorithms to
detect partial tRNAs in new genomic sequences that are likely to be the reminiscent of phage insertion events, and are also comparing the genomic
sequences to a custom-built database of a subset of known virulence factors. Preliminary results are encouraging through our investigation of the
ability of IslandPath to visualize known Pathogenicity Islands as distinct regions within the genomes. This computational tool also permitted us to
perform a more in-depth analysis of %G+C variance in genomes and enabled us to detect correlations not previously reported. As more and more
genome data become available, tools like IslandPath, which can be updated in an automated fashion, will become valuable for genomic research.
Whole Genome (predicted) ORF Display:
Horizontal Gene Transfer and
Bacterial Pathogenicity:
Genome ORFs are displayed to allow interesting regions (rich in mobility genes, abnormal %G+C,
close to structural RNAs) to be viewed in a genome context. E.g. H. Pylori 26695 Genome
Several types of mobile elements have
been shown to carry virulence factors:
Transposons:
ST enterotoxin genes in
E. coli
Prophages:
Shiga-like toxins in EHEC
Diptheria toxin gene
Cholera toxin
Botulinum toxins
Plasmids:
Shigella, Salmonella,
Yersinia
Pathogenicity Islands:
Uro/Entero-pathogenic E.
coli
Salmonella typhimurium
Yersinia spp.
Helicobacter pylori
Vibrio cholerae
IslandPath Graphical Display:
Each dot in a graphic corresponds to a
predicted protein-coding ORF in the genome.
Dot colours indicate if an ORF has a higher or
lower %G+C than cutoffs you set (default
settings are +/- 3.48* of the mean %G+C).
You may click on a dot to view a portion of an
annotation table presented below the graphic.
Several low %G+C regions can be seen in the
graphic display:
= CAG island
= region contains virB homologues; not present in strain J99
= plasticity zone (contain different genes for J99 and 26695)
Detection of Known Pathogenicity Islands:
Yersinia pestis strain CO92:
High Pathogenicity Island core
(in red rectangle)
Mean: 47.9 STD DEV: 4.9
•3.48 = 1.5 S.D. of the mean for Chlamydia genomes, which are
proposed to have undergone no recent horizontal gene transfer (data
not shown).
%GC S.D.
56.48 +1
58.81 +2
58.33 +2
60.40 +2
60.79 +2
60.15 +2
56.35 +1
57.29 +1
58.62 +2
59.48 +2
55.25 +1
52.65
Location
Orientation
Product
2140840..2142861 pesticin/yersiniabactin receptor protein
2142992..2144569 yersiniabactin siderophore biosynthetic protein
2144573..2145376 yersiniabactin biosynthetic protein YbtT
2145373..2146473 yersiniabactin biosynthetic protein YbtU
2146470..2155961 yersiniabactin biosynthetic protein
2156049..2162156 yersiniabactin biosynthetic protein
2162347..2163306 transcriptional regulator YbtA
2163473..2165275 +
lipoprotein inner membrane ABC-transporter
2165262..2167064 +
inner membrane ABC-transporter YbtQ
2167057..2168337 +
putative signal transducer
2168365..2169669 +
putative salicylate synthetase
2169863..2171125 integrase
Vibrio cholerae chromosome I: VPI (toxin regulated pili)
VPI delineated as a stretch of low %G+C
region flanked by mobility genes
Detection of Proposed or Potential Genomic Islands:
Methods:
Escherichia coli O157:H7:
Core scripts written in Perl and CGI/Perl
Sequence Data: NCBI Genome FTP site
Potential mobility elements: COG analysis2,3 plus
keyword scan
RNA locations: NCBI data plus
tRNAscan-SE4
%G+C calculated for each ORF
Mean and Std. Dev. for all ORFs in genome
calculated
File containing all ORF information used to generate
a graphical representation
Virulence Gene Subset (VGS) database developed
through literature analysis of genes identified as
virulence factors using the “Molecular Koch’s
Postulates” (i.e. gene knockout affects virulence)
Area displayed in white rectangle is ~ 28kb in size
(from 3708kbp to 3736kbp) and contains Type III
Secretion proteins Epr’s, Epa’s, and Eiv’s; and
numerous hypothetical proteins with unknown
functions
Vibrio cholerae chromosome I:
Area displayed in red rectangle is ~ 34kb in size (from 1896kbp to 1930kbp) and contains a tRNA-ser in the
same orientation as the phage integrase downstream of it. The ORFs contain one putative helicase, one
chemotaxis protein MotB-related protein, one putative type I restriction enzyme HsdR, one putative DNA
methylase, one putative N-acetylneuraminate lyase, one C4-dicarboxylate-binding periplasmic protein, and
numerous hypothetical proteins and conserved hypothetical proteins.
tRNA when adjacent to an abnormal %G+C region is often observed to be in the same orientation as the stretch.
This might be an artefact of phage insertion and excision events as 3’ end of tRNA are common phage
attachment (att) sites.
%G+C Analysis for Complete Genome Sequences:
Frequencies of ORF %G+C in Genomes:
Histograms of frequencies of %G+C were plotted for several organisms.
Bacterial
Pathogens
%G+C %G+C
Mean S.D.
Primary Diseases
Cellular
# of
Localization ORFs
Neisseria meningitidis
serogroup B strain MC58
meningitis
extracellular
2025
52.4
6.9
Neisseria meningitidis
serogroup A strain Z2491
meningitis
extracellular
2121
52.6
6.5
Xylella fastidiosa
Citrus variegated
chlorosis
extracellular
Escherichia coli O157:H7
(E. coli O157:H7_EDL933)
diarrhoea
facultative
intracellular
5361
(5349)
51.1
(51.9)
5.3
(5.3)
Mycoplasma pneumoniae
M129
mycoplasmal pneumonia
("walking pneumonia")
extracellular
677
40.3
4.9
Yersinia pestis strain CO92 bubonic plague and
Pneumonic plague
facultative
intracellular
3885
48.3
4.7
Streptococcus pneumoniae bacterial pneumonia,
TIGR4
meningitis, sepsis, and
otitis media
(S. pneumoniae R6)
extracellular
2094
40.3
4.4
(2043)
(40.4)
(4.3)
Treponema pallidum
Nichols
syphilis
extracellular
1031
51.4
4.2
Mycoplasma pulmonis
murine respiratory
mycoplasmosis
extracellular
Pseudomonas aeruginosa
PAO1
variety of mucosal
infections (opportunistic)
extracellular
5565
67.0
3.8
Rickettsia conorii Malish 7
Mediterranean spotted
fever
obligate
intracellular
1374
32.4
3.8
(ORFs
(ORFs
>300bp) >300bp)
Observations:
2766
782
53.4
27.2
5.4
3.8
Lowest kurtosis occurs most commonly with a mode of 33.33% for
%G+C values of ORFs in a genome (e.g. M. jannaschii DSM2661)
This G+C value corresponds to maximum A/T in synonymous sites
for the standard codon usage table.
Long tails in the frequency plots occur more frequently downward
(e.g. H. pylori J99 and N. meningitidis) than upward
These observations likely reflect either a bias in gene identification in
high G+C genomes, or a selection to higher A+T content.
%G+C Analysis General Observations:
High %G+C variance is associated with species with evidence of recent
horizontal gene transfers (e.g. N. meningitidis).
Low %G+C variance is associated with highly clonal species and species
with no evidence of horizontal gene transfers (e.g. Chlamydia species,
which are obligate intracellular microbes thought to have been ecologically
isolated from other bacteria for a longer period than other obligate
intracellular bacteria).
%G+C variance is similar for single species, with the exception of the two
V. cholerae chromosomes and two E. coli strains. However, chromosome
II of V. cholerae appears to have originated from a megaplasmid captured
by Vibrio5. For E. coli, pathogenic strain O175:H7 has higher %G+C
variance. This might be due to the presence of PAI and other potentially
horizontally transferred genetic elements.
Ureaplasma urealyticum
serovar 3
urethritis
extracellular
613
25.8
3.8
Vibrio cholerae N16961
cholera
extracellular
I: 2736
II: 1092
I: 48.1
II: 46.9
I: 3.7
II: 4.3
Borrelia burgdorferi B31
Lyme disease
facultative
intracellular
851
28.7
3.6
Streptococcus pyogenes
scarlet fever, toxic shock
like syndrome
extracellular
1696
38.9
3.6
Mycoplasma genitalium
G37
urethritis (opportunistic,
usually HIV patients)
extracellular
484
31.4
3.5
Campylobacter jejuni
NCTC11168
gastroenteritis
extracellular
1654
30.6
3.5
Helicobacter pylori 26695
(H. pylori J99)
peptic ulcers and gastritis
extracellular
1566
(1491)
39.4
(39.7)
3.4
(3.3)
Haemophilus influenzae
Rd-KW20
upper respiratory infection extracellular
meningitis
1709
38.5
3.4
Mycobacterium
tuberculosis CDC1551
(M. tuberculosis H37Rv)
tuberculosis
4187
65.5
3.3
(3918)
(65.6)
(3.3)
Pasteurella multocida
PM70
fowl cholera, cattle
septicemia, etc.
extracellular
2014
40.8
3.3
Rickettsia prowazekii
Madrid E
epidemic typhus
obligate
intracellular
834
30.1
3.3
Staphylococcus aureus
Mu50
(S. aureus N315)
food poisoning, toxic
shock syndrome,
necrotizing fascitis
extracellular
2714
33.3
3.0
(2595)
(32.2)
(3.0)
Mycobacterium leprae
Leprosy
obligate
intracellular
2720
60.0
2.9
Agrobacterium tumefacien
C58 (Cereon)
crown gall (in plants)
Extracellular
c:2721
l:1833
c: 59.8
l: 59.7
c: 2.7
l: 2.9
Chlamydophila
pneumoniae AR39
(C. pneumoniae J138)
[C. pneumoniae CWL029]
chlamydial pneumonia
1110
41.1
2.6
(1070)
[1052]
(41.1)
[41.1]
(2.6)
[2.6]
2 Tatusov RL, et al., 1997, Science 278(5338):631-7
Chlamydia trachomatis D
chlamydia
obligate
intracellular
894
41.5
2.3
4 Lowe TM and Eddy SR, 1997, Nucleic Acids Res. 25(5):955-64
obligate
intracellular
909
Chlamydia muridarum
MoPn
chlamydia
Non-pathogens
Escherichia coli K12
facultative
intracellular
obligate
intracellular
# of ORFs
4289
51.3
Discussion:
IslandPath appears to be an effective automated tool to
visualize and detect genomic islands. Previous reports
have expressed concern about the use of %G+C to detect
HGT; however, these reports were examining %G+C for
individual genes. We propose that %G+C analysis is
effective if clusters of genes containing motifs associated
with mobility elements are considered.
Foreign genes with similar %G+C to the organism’s
genome are not detected, and due to gene amelioration,
only “recent” HGT can be detected. This tool represents
one approach that can be complemented with others, to
prioritize particular genomic islands that merit further
research.
Future developments:
Virulence factor homology search (based on
comparison to our VGS dataset)
Alternative DNA signatures (e.g. codon usage)
Allow users to input their own sequences for analysis
References
1 Hacker J and Kaper JB, 2000, Annu Rev Microbiol. 54:641-79
3 Tatusov RL, et al., 2001, Nucleic Acids Res. 29(1)22-8
5 Heidelberg JF, et al., 2000, Nature 406:477-84
40.8
2.2
%G+C Mean
%G+C S.D.
(ORFs >300bp)
(ORFs >300bp)
4.7
Acknowledgements
This project is funded by the Peter Wall Institute for Advanced Studies.
We wish to thank Tatiana Tatusov of NCBI for providing helpful files for
IslandPath and acknowledge the efforts of the many genome projects that
have made our analysis possible.