dwuFamMasterSlide
Download
Report
Transcript dwuFamMasterSlide
Identify Archaeal and Bacterial Phylogenetic Markers
Dongying Wu
We have more than 700 compete genome sequences:
1.Select 100 representatives
2.Build gene families
3.Identify families that present in all organisms with equal numbers
4.Hmm building and phylogenetic analysis to identify the true makers
Proteobacteria
Firmicutes
Phylogenetic Tree of Bacteria (built from 31 concatenate marker alignments)
Gene Family Classification
Blastp: E value cutoff 1e-10, report 10000 hits
Only blastp hits that span 80% of the lengths of both genes are kept as links
313,139 genes from 100 genomes => 28,710,015 links
MCL Clustering Algorithm
Links (matrix of
sequence similarities)
Expansion
Inflation (I=2)
equilibrium state
73686 Singletons, 23336 families(239453 genes)
Rules for Families of Markers:
1.The family has to cover all 100 genomes (high universality)
2.Each genome has to have equal numbers (high evenness)
− 4× Ng× ∑ ¿ Ni− Nm/ ¿
Evenness= 100× e
¿
i
Ni: the number of the gene family members from the genome i;
Nm: the medium of Ni of the 100 genomes;
Ng: the total genome number;
Universality is the genome number a family involves
Phylogenetic Marker Identification
Out of the 502 families with high universality:
* 31 phylogenetic markers from AMPHORA
* 39 marker candidates with high evenness number (>=80)
(25 families are either single copied in each genome or double copied
in one genome that co-branched in phylogenetic trees)
Build PHYML trees with the AMPHORA markers and 25 marker candidates, and
compare the tree topologies with the genome tree
NODAL distance
(TOPD/FMTS)
Split (Robinson-Foulds) Distance
(TOPD/FMTS)
G
F
A
F
A
G
bad edge
bad edge
bad edge
bad edge
good edge
good edge
good edge
good edge
E
B
C
D
E
B
C
ratio of the internal edges being bad (0-1)
D
Distances between gene trees and the AMPHORA concatenated genome tree
rpmA
coaE
trmD
rpsS
radA
rplD
tsf
frr
ttf
rplR
rplM
rplI
rpsB
rpsO
mraW
rpsH
rplQ
rplL
rplT
rplE
rpsP
rplC
rplV
rplS
infC
rpsM
rplO
rplU
rpsL
rpsQ
guaA
rpsG
smpB
priA
rpsK
rplK
serS
rplA
rplF
ruvA
rpsC
rplN
rplP
rpsE
pyrH
rpsI
secY
rpsJ
purA
rplB
nusA
ruvB
rRNA16S
0
1
2
3
4
5
6
coaE
rpmA
rplL
rpsQ
rplR
rplQ
rpsH
smpB
rpsO
rplP
rpsS
rplV
rplT
rplO
rpsP
rpsK
rplU
tsf
trmD
rplS
ttf
rpsI
mraW
rpsL
rpsG
rplM
rplI
pyrH
rpsM
ruvA
radA
purA
rplK
rplD
infC
rplC
rplE
rplA
frr
rplF
serS
rplN
guaA
ruvB
rpsB
rpsJ
rRNA16S
secY
rplB
priA
rpsE
rpsC
nusA
0
0.1
0.2
0.3
NODAL distance
AMPHORA marker
Ribosomal protein
0.4
0.5
0.6
0.7
SPLIT distance
Transcription/translation related protein
Distance between the genome tree and 100 random trees (average standard deviation)
DNA repair protein
Protein of other function
0.8
0.9
SACCHAROPHAGUS DEGRADANS 2 40
PSEUDOMONAS SYRINGAE PV SYRINGAE B728A
HAHELLA CHEJUENSIS KCTC 2396
ACINETOBACTER SP ADP1
ALCANIVORAX BORKUMENSIS SK2
ESCHERICHIA COLI K12
PSYCHROMONAS INGRAHAMII 37
COLWELLIA PSYCHRERYTHRAEA 34H
gamma
THIOMICROSPIRA CRUNOGENA XCL 2
XYLELLA FASTIDIOSA TEMECULA1
NITROSOCOCCUS OCEANI ATCC 19707
METHYLOCOCCUS CAPSULATUS STR BATH
LEGIONELLA PNEUMOPHILA STR LENS
COXIELLA BURNETII RSA 493
FRANCISELLA TULARENSIS SUBSP TULARENSIS FSC198
NEISSERIA MENINGITIDIS Z2491
NITROSOMONAS EUTROPHA C91
beta
METHYLIBIUM PETROLEIPHILUM PM1
BURKHOLDERIA MALLEI ATCC 23344
MAGNETOSPIRILLUM MAGNETICUM AMB 1
GLUCONOBACTER OXYDANS 621H
ZYMOMONAS MOBILIS SUBSP MOBILIS ZM4
alpha
BARTONELLA HENSELAE STR HOUSTON 1
NITROBACTER HAMBURGENSIS X14
ROSEOBACTER DENITRIFICANS OCH 114
CAULOBACTER CRESCENTUS CB15
HYPHOMONAS NEPTUNIUM ATCC 15444
MYXOCOCCUS XANTHUS DK 1622
SORANGIUM CELLULOSUM SO CE 56
BDELLOVIBRIO BACTERIOVORUS HD100
delta
GEOBACTER SULFURREDUCENS PCA
SYNTROPHUS ACIDITROPHICUS SB
DESULFOTALEA PSYCHROPHILA LSV54
DESULFOVIBRIO VULGARIS SUBSP VULGARIS STR HILDENBOROUGH
NITRATIRUPTOR SP SB155 2
SULFURIMONAS DENITRIFICANS DSM 1251
ARCOBACTER BUTZLERI RM4018
SULFUROVUM SP NBC37 1
Epsilon
CAMPYLOBACTER JEJUNI SUBSP JEJUNI NCTC 11168
HELICOBACTER HEPATICUS ATCC 51449
HELICOBACTER PYLORI J99
TREPONEMA DENTICOLA ATCC 35405
Spirochaetes
LEPTOSPIRA BORGPETERSENII SEROVAR HARDJO BOVIS L550
Planctomycetes
RHODOPIRELLULA BALTICA SH 1
CANDIDATUS PROTOCHLAMYDIA AMOEBOPHILA UWE25
Chlamydiae
CHLOROBIUM TEPIDUM TLS
Chlorobi
SALINIBACTER RUBER DSM 13855
CYTOPHAGA HUTCHINSONII ATCC 33406
Bacteroidetes
PORPHYROMONAS GINGIVALIS W83
GRAMELLA FORSETII KT0803
FUSOBACTERIUM NUCLEATUM SUBSP NUCLEATUM ATCC 25586
Fusobacteria
CORYNEBACTERIUM EFFICIENS YS 314
NOCARDIA FARCINICA IFM 10152
SALINISPORA TROPICA CNB 440
Actinobacteria
FRANKIA ALNI ACN14A
STREPTOMYCES COELICOLOR A3 2
ARTHROBACTER AURESCENS TC1
CLAVIBACTER MICHIGANENSIS SUBSP MICHIGANENSIS NCPPB 382
PROPIONIBACTERIUM ACNES KPA171202
BIFIDOBACTERIUM LONGUM NCC2705
PROCHLOROCOCCUS MARINUS SUBSP PASTORIS STR CCMP1986
SYNECHOCYSTIS SP PCC 6803
Cyanobacteria
GLOEOBACTER VIOLACEUS PCC 7421
DEHALOCOCCOIDES SP CBDB1
Chloroflexi
CHLOROFLEXUS AURANTIACUS J 10 FL
CARBOXYDOTHERMUS HYDROGENOFORMANS Z 2901
PELOTOMACULUM THERMOPROPIONICUM SI
DESULFITOBACTERIUM HAFNIENSE Y51
SYMBIOBACTERIUM THERMOPHILUM IAM 14863
CLOSTRIDIUM DIFFICILE 630
CLOSTRIDIUM KLUYVERI DSM 555
THERMOANAEROBACTER TENGCONGENSIS MB4
Firmicutes
BACILLUS LICHENIFORMIS ATCC 14580
OCEANOBACILLUS IHEYENSIS HTE831
STAPHYLOCOCCUS SAPROPHYTICUS SUBSP SAPROPHYTICUS ATCC 15305
LISTERIA WELSHIMERI SEROVAR 6B STR SLCC5334
LACTOCOCCUS LACTIS SUBSP CREMORIS SK11
LACTOBACILLUS CASEI ATCC 334
LACTOBACILLUS HELVETICUS DPC 4571
LACTOBACILLUS REUTERI F275
OENOCOCCUS OENI PSU 1
THERMUS THERMOPHILUS HB27
DEINOCOCCUS RADIODURANS R1
AQUIFEX AEOLICUS VF5
THERMOTOGA MARITIMA MSB8
Genome Tree
0.2
NITROSOCOCCUS OCEANI ATCC 19707
PSYCHROMONAS INGRAHAMII 37
THIOMICROSPIRA CRUNOGENA XCL 2
COLWELLIA PSYCHRERYTHRAEA 34H
ESCHERICHIA COLI K12
SACCHAROPHAGUS DEGRADANS 2 40
METHYLOCOCCUS CAPSULATUS STR BATH
HAHELLA CHEJUENSIS KCTC 2396
ALCANIVORAX BORKUMENSIS SK2
PSEUDOMONAS SYRINGAE PV SYRINGAE B728A
NEISSERIA MENINGITIDIS Z2491
BURKHOLDERIA MALLEI ATCC 23344 1
NITROSOMONAS EUTROPHA C91
ZYMOMONAS MOBILIS SUBSP MOBILIS ZM4
MAGNETOSPIRILLUM MAGNETICUM AMB 1
ROSEOBACTER DENITRIFICANS OCH 114
NITROBACTER HAMBURGENSIS X14
GLUCONOBACTER OXYDANS 621H
CAULOBACTER CRESCENTUS CB15
BARTONELLA HENSELAE STR HOUSTON 1
CANDIDATUS PROTOCHLAMYDIA AMOEBOPHILA UWE25
CYTOPHAGA HUTCHINSONII ATCC 33406
PORPHYROMONAS GINGIVALIS W83
GRAMELLA FORSETII KT0803
COXIELLA BURNETII RSA 493
LEGIONELLA PNEUMOPHILA STR LENS
FRANCISELLA TULARENSIS SUBSP TULARENSIS FSC198
HYPHOMONAS NEPTUNIUM ATCC 15444
METHYLIBIUM PETROLEIPHILUM PM1
ACINETOBACTER SP ADP1
XYLELLA FASTIDIOSA TEMECULA1
PROCHLOROCOCCUS MARINUS SUBSP PASTORIS STR CCMP1986
SALINIBACTER RUBER DSM 13855
BIFIDOBACTERIUM LONGUM NCC2705
STREPTOMYCES COELICOLOR A3 2
ARTHROBACTER AURESCENS TC1
CLAVIBACTER MICHIGANENSIS SUBSP MICHIGANENSIS NCPPB 382
NOCARDIA FARCINICA IFM 10152
SALINISPORA TROPICA CNB 440
FRANKIA ALNI ACN14A
CORYNEBACTERIUM EFFICIENS YS 314
PROPIONIBACTERIUM ACNES KPA171202
CHLOROBIUM TEPIDUM TLS
TREPONEMA DENTICOLA ATCC 35405
LEPTOSPIRA BORGPETERSENII SEROVAR HARDJO BOVIS L550 1
THERMOTOGA MARITIMA MSB8
AQUIFEX AEOLICUS VF5
THERMUS THERMOPHILUS HB27
HELICOBACTER PYLORI J99
HELICOBACTER HEPATICUS ATCC 51449
ARCOBACTER BUTZLERI RM4018
NITRATIRUPTOR SP SB155 2
CAMPYLOBACTER JEJUNI SUBSP JEJUNI NCTC 11168
SULFUROVUM SP NBC37 1
SULFURIMONAS DENITRIFICANS DSM 1251
DEHALOCOCCOIDES SP CBDB1
BDELLOVIBRIO BACTERIOVORUS HD100
MYXOCOCCUS XANTHUS DK 1622
SORANGIUM CELLULOSUM SO CE 56
SYNTROPHUS ACIDITROPHICUS SB
RHODOPIRELLULA BALTICA SH 1
DESULFOTALEA PSYCHROPHILA LSV54
DESULFOVIBRIO VULGARIS SUBSP VULGARIS STR HILDENBOROUGH
CHLOROFLEXUS AURANTIACUS J 10 FL
GEOBACTER SULFURREDUCENS PCA
GLOEOBACTER VIOLACEUS PCC 7421
SYNECHOCYSTIS SP PCC 6803
CLOSTRIDIUM DIFFICILE 630
SYMBIOBACTERIUM THERMOPHILUM IAM 14863
DESULFITOBACTERIUM HAFNIENSE Y51
THERMOANAEROBACTER TENGCONGENSIS MB4
CARBOXYDOTHERMUS HYDROGENOFORMANS Z 2901
PELOTOMACULUM THERMOPROPIONICUM SI
CLOSTRIDIUM KLUYVERI DSM 555
OENOCOCCUS OENI PSU 1
LACTOBACILLUS HELVETICUS DPC 4571
LACTOBACILLUS REUTERI F275
LACTOBACILLUS CASEI ATCC 334
LACTOCOCCUS LACTIS SUBSP CREMORIS SK11
LISTERIA WELSHIMERI SEROVAR 6B STR SLCC5334
BACILLUS LICHENIFORMIS ATCC 14580
OCEANOBACILLUS IHEYENSIS HTE831
STAPHYLOCOCCUS SAPROPHYTICUS SUBSP SAPROPHYTICUS ATCC 15305
FUSOBACTERIUM NUCLEATUM SUBSP NUCLEATUM ATCC 25586
DEINOCOCCUS RADIODURANS R1 1
rpmA
0.2
Better Tree Comparison is Needed
Not all edges are equal
We need to know how a marker performs at
different taxonomic levels and groups
Actinobacteria:
47 pre-GEBA genomes
26 GEBA genomes(16 completed)
Only 63 competed actinobacterial
genomes are included in this study
Basic rules:
Every genome should have only one
copy from a family for that family to
be counted as marker candidate
(plus/minus 1)
63 genome (251585 proteins, 18534 large family-proteins)
BLASTP (cutoff 1e-10 over 80% span)
20460854 links
MCL (I=2)
38450 MCL clusters
818 cluster (>=62 members and <2000 members)
170 families with 62-64 members:
105 can be marker candidates
Universality 100 (size=63-64), 98 (size=62)
ISSUE ONE:
Are there any markers embedded in the larger clusters?
ObgE
YchF
GTP-binding protein
Automatic Tree Screening:
1.Pick clades with the desirable number of taxa
1.calculate universality and envenness
1.Generate families and Building HMMs
1.Search the hmm profiles against the entire
actinobacterial peptides to see if the families are
distinct
818 trees (60-2000 genes/tree, Build by MUSCLE/FastTree)
155 clades with leave-number=63, universality=100, evenness=100
The Good
murD UDP-N-acetylmuramoyl-L-alanyl-D-glutamate synthetase
The Bad
serS seryl-tRNA synthetase
HMMER3 hmmbuild into 155 profiles
Search against the actinobacterial genomes, only keep the following HMMs:
Scenario 1:
Seeds hit E-value<=1e-20, none-seeds E-value >1e-3
Scenario 2:
Best None-seed hit E-value (En) <= 1e-3
The Worst seed hit E-value <= 1e-17*En
The Worst seed bit-score is more than twice of the None-seed bit-score
extreme value
distribution
Best
none-seed
Worst
seed
ISSUE TWO:
Are there any markers families torn apart in the clustering process?
BLASTP links
Exclude
(1) Marker candidates
(2) Large MCL family members (>=1000/family)
Single linkage clustering
Single Linkage Families
ISSUE THREE: Miss-placed deep branch
lepA GTP-binding protein
136 actinobacterial markers from 63 Actinobacterial genomes
One deletion
in one genome
One duplication
in one genome
single-linkage
clusters
tree topology correction
2
9
22
18
Tree-based picking
from MCL clusters
32
93
96
original MCL clusters
One copy/genome
Select completed genomes from IMG for the following group
Archaea
Actinobacteria
Alphaproteobacteria
Bacteriodetes
Betaproteobacteria
Chlamydae
Gammaproteobacteria
Chloroflexi
Deltaproteobacteria
Cyanobacteria
Epsilonproteobacteria
Firmicutes
Spirochaetes
Thermi
Thermotogae
Use wget to get the sequences from the website
Gene marker identify pipeline
(BLASTP,MCL clustering, tree building, clade evaluation)
Screen gene markers for any given taxonomic group
Phylogenetic group
Genome Number
Gene Number
Maker Candidates
Archaea
62
145415
106
Actinobacteria
63
267783
136
Alphaproteobacteria
94
347287
121
Betaproteobacteria
56
266362
311
Gammaproteobacteria
126
483632
118
Deltaproteobacteria
25
102115
206
Epislonproteobacteria
18
33416
455
Bacteriodes
25
71531
286
Chlamydae
13
13823
560
Chloroflexi
10
33577
323
Cyanobacteria
36
124080
590
Firmicutes
106
312309
87
Spirochaetes
18
38832
176
Thermi
5
14160
974
Thermotogae
9
17037
684
Cluster HMM profiles
Hmm Profile
One Consensus Sequence
Consensus Sequences for all HMMs
(5133 lineage specific families + 56 Bacterial marker families)
All vs All BLASTP (E value cutoff = 1e-3)
Single Linkage Clustering -> 570 clusters (size >= 2) -> build 404 trees
Example of A tree
(SIN323: carbamoyl-phosphate synthase)
Sampling and Analysis the Tree Automatically
Split the tree one edge at a time
Evaluate evenness
If evenness = 100 and single copied for each group
HMM profile building and search against all the consensus sequences
Are the seed peptides distinct? What cutoff to use?
A sample of HMM search output (the seeds are marked red)
THERMI894
THERMO479
EPSI254
ALPHA58
CHLOFL232
BARIO164
CHLAM424
GAMMA93
CYANO551
ARCH63
CYANO18
THERMO26
EPSI354
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
RELAX5_SIN270.tre.ID542.faa.trim
3.8e-57
4.8e-56
7.1e-56
4.2e-55
8.5e-52
4.5e-51
1.3e-50
1.2e-43
1.4e-42
7.2e-21
2.8e-06
3.1e-06
6.5e-05
191.5
187.9
187.3
184.8
174.0
171.7
170.1
147.4
144.0
73.3
25.7
25.6
21.3
Exponential curve Fitting to identify Hmmsearch E value cutoff
lg(Eseed_cutoff/Etop_none_seed) = A e B lg(Etop_none_seed)
lg(Eseed_cutoff/Etop_none_seed)
Lg(Etop_none_seed)
B=
A=e
Position 1: [x1=lg(1e-3) y1=lg(1e-15/1e-3)]
Position 2: [x2=lg(1e-250) y2=lg(1e-1000/1e-250)]
Get all the potential groups that can be marker candidates
A.One consensus sequence from one phylogenetic group in one clade
B.The sequences are distinct from other sequences
Overlap problems and solutions
Example of A tree
(SIN323: carbamoyl-phosphate synthase)
684 families that span multiple taxonomic group
Family Number
Accumulative Distribution
Simple Distribution
Family Size
383 families that span >=4 taxonomic groups
We have 382 clades (including whole trees) that are potential marker families that
each spans at least four different taxonomic groups
Example: Family 00001 (ribosomal protein S4)
Included
Not included
Archaea
Betaproteobacteria
Alphaproteobacteria
Deltaproteobacteria
Gammaproteobacteria
Actinobacteria
Epsilonproteobacteria
Firmicutes
Bacteriodetes
Spirochaetes
Chlamydae
Chloroflexi
Cyanobacteria
Thermi
Thermotogae
Search a group of genomes using a distant HMM profile
A group of HMM profiles
Combine the seeds and build a new profile HMM
Search one group that is missing from the HMM list
Get a large number of hits from the top (2 x genome
number) and mark the very top hits
Search a group of genomes using a HMM profile of insiders
Tree building, and evaluate the clades:
(1)Must include the very top hits
(2)HMM building from the clades to estimate
uniqueness
*Manual examinations are required in some cases
All the peptides from for a given family
MUSCLE
Alignment
Hmm Profile
Hmm search against all
the complete genome
database
Look through the hmm search results and determine
if the hmm can distinguish family members from others
Tree building
Alignment
ZORRO mask
Use 0.1 as the first round ZORRO cutoff
Trim the alignments and calculate the second ZORRO mask score
Build PHYML trees for all the families (alignments trimmed by the second ZORRO mask)
Ribosomal protein S4 PHYML tree (MF00001)
Monophyletic Analysis
A list of taxa that are assumed to be monophyletic can
be divided into separate clades
A monophyletic value is designed to estimate if given
list of taxa are monophyletic or not quantitatively
Shannon entropy measures uncertainty in a dataset
All taxa from a phylum form a monophyletic clade:
All taxa from a phylum spread into N clades:
100%
p1
p2
p3
p1+p2+p3=1
Shannon entropy calculation:
Uncertainty -> 0
Uncertainty increases if
(1)Clades number increase
(2)Evenness increase
Calculate Shannon entropy for 100 taxa distributed in N bins (N=2..10)
(repeat the calculation for 10,000 random simulations for each N)
Sample
number
H
Monophyletic Value = 100 x
Shannon Entropy
Monophyly Value
Shannon Entropy
05/04/10
ribosomal protein PHYML tree (MF00001)
161 families are kept
For at least 4 taxonomic groups
Universality * Evenness * monophyly >= 90*90*90
PMPROK00023: ribosome recycling factor
LIST:ARCH
UNIVERSALITY:NA
EVENNESS:NA
MONOPHYLY:NA
LIST:BACT
UNIVERSALITY:99.67
EVENNESS:98.68
MONOPHYLY:NA
LIST:ACTINO
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:78.84
LIST:BARIO
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:59.78
LIST:CHLAM
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:100.00
LIST:CHLOFL
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:60.37
LIST:CYANO UNIVERSALITY:100.00 EVENNESS:81.04
MONOPHYLY:100.00
LIST:FIRM
UNIVERSALITY:99.06
EVENNESS:100.00
MONOPHYLY:85.98
LIST:SPIRO
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:53.75
LIST:THERMI
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:100.00
LIST:THERMO
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:100.00
LIST:PROTEO
UNIVERSALITY:99.69
EVENNESS:100.00
MONOPHYLY:44.61
LIST:ALPHA
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:63.18
LIST:BETAGAMMA
UNIVERSALITY:99.45
EVENNESS:100.00
MONOPHYLY:97.47
LIST:BETA
UNIVERSALITY:98.21
EVENNESS:100.00
MONOPHYLY:100.00
LIST:GAMMA
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:79.67
LIST:DELTA
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:88.17
LIST:EPSI
UNIVERSALITY:100.00 EVENNESS:100.00
MONOPHYLY:100.00
What is next:
1. Search IMG again to update the seqs and accessions
2. Develop CGI scripts to retrieve user defined markers
(calculate universality, evenness and monophyly on the fly)