Transcript lg 127
Phylogenomics
Phylogenetics
- several (to several dozens) genes
reconstruction of phyletic relationships
based on the analysis of -
Phylogenomics
- complete genetic information (ideal)
- several dozens to hundreds of coding
sequences (phylotranscriptomics)
Why?
vast amount of genetic information should significantly
improve the prediction of phylogenetic relationships and
eliminate signal noise
... and sometimes it really works
Adl et al, 2012
... but sometimes it doesn’t
possible source of error: - incorrect sequence annotation
possible source of error: - paralogues
Laccaria bicolor 251532*
Phanerochaete chrysosporium 125524*
Phycomyces blakesleeanus 184165*
Saccharomyces cerevisiae NP_012447
Neurospora crassa 1471*
Aspergillus oryzae XP_001821358
Ectocarpus siliculosus CBN76155
Fragilariopsis cylindricus 268473*
Thalassiosira pseudonana XP_002286586
Phaeodactylum tricornutum XP_002184453
Aurantiochytrium limacinum 83378*
Phytophthora ramorum 95147*
Phytophthora sojae 155409*
Branchiostoma floridae 287613*
Rattus norvegicus CAA26007
Canis familiaris XP_537995
Gallus gallus NP_990241
Monodelphis domestica XP_001379029
Danio rerio XP_001334671
Nematostella vectensis 182282*
Sulfolobus solfataricus NP_342369
Vibrio vulnificus NP_760378
Shewanella halifaxensis YP_001672921
Apis mellifera XP_001122890
Neisseria meningitidis YP_003083549
Fervidobacterium nodosum YP_001410216
Cyanidioschyzon merolae CML214C**
Galdieria sulphuraria Gs00120.1***
Emiliania huxleyi 204139*
Porphyridium cruentum HS695246
Chlorella sp. NC64A 188762*
Cyanophora paradoxa 6658
Physcomitrella patens 169569*
Populus trichocarpa 206001*
Micromonas pusilla 55476*
Ostreococcus tauri 28754*
Chlorobium tepidum NP_662003
Rhodopirellula baltica NP_867004
Aureococcus anophagefferens 33293*
Guillardia theta 117281*
Bigellowiela natans 46835*
Giardia lamblia XP_001709979
Trichomonas vaginalis XP_001301097
Monosiga brevicolis 11184*
Colwellia psychrerythraea YP_267220
Roseiflexus castenholzii YP_001433145
Bacillus smithii ZP_09352647
Bacillus subtilis NP_389007
Thermoplasma acidophilum NP_394786
Thermococcus kodakarensis YP_183284
Calliarthron tuberculosum SRR090438.211357
Rhodobacter sphaeroides YP_352060
Methylobacterium radiotolerans YP_001754995
Erythrobacter sp. NAP1 ZP_01040708
Gordonia_sp. KTR9 YP_006669506
Arthrobacter arilaitensis YP_003916330
Sorangium cellulosum YP_001617807
Thermosynechococcus elongatus NP_681897
Acaryochloris marina YP_001519889
Trichodesmium erythraeum YP_721117
Arthrospira platensis ZP_06383255
Gloeobacter violaceus NP_926047
Paulinella chromatophora ACB42582
Prochlorococcus marinus YP_001551191
Halobacterium sp. NRC-1 NP_395811
Chlorobium tepidum NP_662679
Populus trichocarpa 794796*
Methylobacterium radiotolerans YP_001753702
Bacillus subtilis ZP_03591270
Burkholderia cenocepacia YP_002234614
Colwellia psychrerythraea YP_271275
Plasmodium falciparum XP_001350162
Babesia bovis XP_001609016
Theileria parva XP_765996
Theileria annulata XP_954289
Chlorella sp. NC64A 31*
Micromonas pusilla 98407*
Ostreococcus tauri 13232*
Physcomitrella patens 146128*
Populus trichocarpa 550113*
Thermoplasma acidophilum NP_394049
Vibrio vulnificus NP_760377
Shewanella halifaxensis YP_001675259
Emiliania huxleyi 53510*
Aureococcus anophagefferens 13348*
Phytophthora ramorum 49061*
Phytophthora sojae 109166*
Phytophthora infestans XP_002904952
Ectocarpus siliculosus CBJ48729
Fragilariopsis cylindricus 190753*
Thalassiosira pseudonana XP_002291441
Batrachochytrium dendrobatidis 29998*
Laccaria bicolor 294748*
1
OCT
ATC
possible source of error: - sins of the past
possible source of error: - sins of the past
EGT
EGT
“LEUCA”
ANIMALS/FUNGI
PLANTS
RHODOPHYTES
...
...
18(16)S rRNA
+
-
- combination of variable and
conserved regions
- zero L/HGT
- exhaustive taxon sampling
- known secondary structure
- hundreds of copies per cell single-cell PCR
- cost per nt + speed
- ‘18S is always right’
- ~1800bp
- intraindividual paralogues
- lower branching support
MULTI-PROTEIN DATASETs
+
-
- large ammount of information
- modular
- robust branching support
(although often false)
- limited sampling
- variable quality of
phylohenetic signal
- L/H(E)GT
- still costly and slow
- HW demanding analysis
- stability of topologies (or lack
of thereof)
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
n
SGP
(
CONCATENATION
MGP
)
DATABASE
PURIFIED DATABASE
HOMOLOGUES
- lots of redundancy in dbs
(duplicates, close paralogues...)
- usually it is better to get rid of them
DATASETS
MSA
SGF
CONCATENATION
sequence clustering
MGF
+
- speed, relative HW friendly, accuracy
-
- accuracy, black-box
CD-HIT
USEARCH
DB editing
FASTA – universal and simple!, but non unified
NCBI:
>gi|269120277|ref|YP_003308454.1| carbamate kinase [Sebaldella termitidis ATCC
33386]
MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAM
SQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVV
ASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINY
GKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK
JGI:
>jgi|Dappu1|290510|JCO_fgenesh1_kg.C_scaffold_4000019
MKLVYTVASAFLVVLIAQSAYASEKLSAQDYAYNSTCLNHLRSHIKRELQAAVTYLAMGAWANHYSVQRP
GLANFFFDSASEEREHGLKLLGYLRMRGHNDLDILPSSLEPLNGKYEWENSLSALRQALKMEKDVTESIK
KIIDYCADAEDHQLADYLTGDFMEEQLKGQRNVAGLANTLQGVLRKQPRLGEWIFDNNLSKSMAV
manual for several sequences but several thousands?
GB's of RAM
robust OS and text editor
!Regular expressions!
>gi|269120277|ref|YP_003308454.1| carbamate kinase [Sebaldella termitidis ATCC
33386]
MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAM
SQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVV
ASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINY
GKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK
Find:>\w+\|\d+\|\w+\|(\w+).*\[(\w+\s\w+).*
Replace:>\2_\1
>Sebaldella termitidis_YP_003308454
MKNRIVVALGGNALGNSAKEQRDAVRETAIPIVDLIEAGHEVILAHGNGPQVGMINLAMDSATKNLPSFAEMPITECVAM
SQGYIGYHLQRFIRDELKRRNIDKEVATIVTEVLVDGDDPAFKSPNKPIGAFYTKEEAEKLEKQGYTMMEDAGRGYRRVV
ASPKPVDIVQKKTIKTLIDNSQIVITVGGGGIPVKYVEGKGTLGEFAVIDKDFASAKLAELIDADYLIILTAVEKIAINY
GKENEQWLDKLSIDDAKKYIKEGHFAPGSMLPKVEAALGFAASKQGRRALVTSLEKAKDGIAGLTGTVIVDEK
extremely powerful, easy to learn, fun to use:
!Regular expressions!
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
BLAST vs annotation
BLAST (plus relatives) is the only reliable way to identify
homologues, do not rely on annotation!
MSA
SGP
CONCATENATION
MGP
the more the better
beware of close paralogues! Meticulous SGF necessary
possible source of error: - paralogues
Laccaria bicolor 251532*
Phanerochaete chrysosporium 125524*
Phycomyces blakesleeanus 184165*
Saccharomyces cerevisiae NP_012447
Neurospora crassa 1471*
Aspergillus oryzae XP_001821358
Ectocarpus siliculosus CBN76155
Fragilariopsis cylindricus 268473*
Thalassiosira pseudonana XP_002286586
Phaeodactylum tricornutum XP_002184453
Aurantiochytrium limacinum 83378*
Phytophthora ramorum 95147*
Phytophthora sojae 155409*
Branchiostoma floridae 287613*
Rattus norvegicus CAA26007
Canis familiaris XP_537995
Gallus gallus NP_990241
Monodelphis domestica XP_001379029
Danio rerio XP_001334671
Nematostella vectensis 182282*
Sulfolobus solfataricus NP_342369
Vibrio vulnificus NP_760378
Shewanella halifaxensis YP_001672921
Apis mellifera XP_001122890
Neisseria meningitidis YP_003083549
Fervidobacterium nodosum YP_001410216
Cyanidioschyzon merolae CML214C**
Galdieria sulphuraria Gs00120.1***
Emiliania huxleyi 204139*
Porphyridium cruentum HS695246
Chlorella sp. NC64A 188762*
Cyanophora paradoxa 6658
Physcomitrella patens 169569*
Populus trichocarpa 206001*
Micromonas pusilla 55476*
Ostreococcus tauri 28754*
Chlorobium tepidum NP_662003
Rhodopirellula baltica NP_867004
Aureococcus anophagefferens 33293*
Guillardia theta 117281*
Bigellowiela natans 46835*
Giardia lamblia XP_001709979
Trichomonas vaginalis XP_001301097
Monosiga brevicolis 11184*
Colwellia psychrerythraea YP_267220
Roseiflexus castenholzii YP_001433145
Bacillus smithii ZP_09352647
Bacillus subtilis NP_389007
Thermoplasma acidophilum NP_394786
Thermococcus kodakarensis YP_183284
Calliarthron tuberculosum SRR090438.211357
Rhodobacter sphaeroides YP_352060
Methylobacterium radiotolerans YP_001754995
Erythrobacter sp. NAP1 ZP_01040708
Gordonia_sp. KTR9 YP_006669506
Arthrobacter arilaitensis YP_003916330
Sorangium cellulosum YP_001617807
Thermosynechococcus elongatus NP_681897
Acaryochloris marina YP_001519889
Trichodesmium erythraeum YP_721117
Arthrospira platensis ZP_06383255
Gloeobacter violaceus NP_926047
Paulinella chromatophora ACB42582
Prochlorococcus marinus YP_001551191
Halobacterium sp. NRC-1 NP_395811
Chlorobium tepidum NP_662679
Populus trichocarpa 794796*
Methylobacterium radiotolerans YP_001753702
Bacillus subtilis ZP_03591270
Burkholderia cenocepacia YP_002234614
Colwellia psychrerythraea YP_271275
Plasmodium falciparum XP_001350162
Babesia bovis XP_001609016
Theileria parva XP_765996
Theileria annulata XP_954289
Chlorella sp. NC64A 31*
Micromonas pusilla 98407*
Ostreococcus tauri 13232*
Physcomitrella patens 146128*
Populus trichocarpa 550113*
Thermoplasma acidophilum NP_394049
Vibrio vulnificus NP_760377
Shewanella halifaxensis YP_001675259
Emiliania huxleyi 53510*
Aureococcus anophagefferens 13348*
Phytophthora ramorum 49061*
Phytophthora sojae 109166*
Phytophthora infestans XP_002904952
Ectocarpus siliculosus CBJ48729
Fragilariopsis cylindricus 190753*
Thalassiosira pseudonana XP_002291441
Batrachochytrium dendrobatidis 29998*
Laccaria bicolor 294748*
1
OCT
ATC
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
commercial
DATABASE
PURIFIED DATABASE
vs. free
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
- both (shiny GUI/command-line scripts) will get you there
relatively fast and easy but... beware of possible errors, there is
no universal solution
Multiple alignment
- important and necessary step in identification and definition of dna or protein domains,
oligonucleotide design, phylogenetic analyses...
- most of the modern algorithms are iterative (can self-improve during the iterations)
and reasonably good working (really, don’t use Clustal unless you really have to), some of the most used are:
MAFFT, MUSCLE, Kalign, ProbCons (none of them miraculous, each makes mistakes, but it’s not that bad)
- all of the above mentioned are accessible on-line (follow the hyperlinks) or can be run locally...
nevertheless, you’ll have to use some alignment-viewer/editor to visualize them
- several free options (depending on what OS you use)
MS WIndows: Bioedit- the living legend’, extensive features, user-friendly, can import from GenBank, align
(also translation alignment, although with ), edit, annotate, translate, do phylogeny...
Mac: MacClade - great editing features and them some more, user friendly, but doesn’t align, nor does phylogenies
currently work only up to OSX 10.6, not (mountain) lion.
Multi-platform: MEGA – good for alignment, phylogenetic and molecular evolution analyses
Jalview – excellent for proteomics, passable alignment editor
SeaView – great aligner/editor (although takes time to get use to it), excellent features for
phylogenetics (inclusion sets, translation alignment, there’s no UNDO button!)
... and then again, if you have access/can afford Geneious (student licenses are cheap),
you can skip everything listed above
Editing
- remember: the tree is as good as is the alignment; crap-in-crap-out!
- the goal is to keep only unambiguously aligned regions and relevant OTU
(remove duplicates or long-branchers)
site selection: AUTOMATED vs. MANUAL
automated: good as a starting point, reproducible, ‘objective’, transparent, but ... crude
manual: subjective, often non-reproducible, needs ‘expertise’, but... better (usually),
can be fine-tuned to the each respective dataset
Example- SeaView
open dataset (in this case apicomplexa_ssu1.fas) and align it.. you already know how, right?
Some regions are conserved (i.e., not much divergent diversity), there’s little doubt about the correctness of alignment.
They should be kept for analysis as they carry vital information.
Example- SeaView
On the other hand, some are pretty variable and could be aligned in several ways. Because we cannot be sure the
information they contain is correct, we should exclude these prior to analysis in order not to introduce error
(remember, crap-in-crap-out).
In some situations, especially when you’re fresh to the problematics, it is not so clear what parts of alignments
should be kept and what excluded from analysis. Gblocks (or similar SW) can help you. Luckily, it is also
implemented in SeaView:
as it tends to remove too much, let’s keep the
parameters the least strict
regions with X are kept,
those with dashes excluded from selection
you can edit the selection afterwards and save it using Files-Save selection
you can also directly perform phylogenetic analysis by clicking on Trees
you can choose from three
different methods, PhyML
represents Maximum likelihood
for publication, you will have to also assess branching
support
and you may want to use more thorough algorithm
of tree search (check ‘Best of NNI and SPR’)
the default settings are reasonable
compromise between speed and
precision, so you can leave them on
then hit Run ... and wait ... time depends on the
method and size of dataset (obviously, the bigger
the longer).
SeaView has also implemented very decent tree viewer/editor
node (represents hypothetical
ancestor of all taxa/branches stemming
off the node, also defines clade)
scale bar (substitutions per site – the longer the branch, the more
branch
divergent the sequence)
clade (group of sequences
sharing common
ancestor/stemming from single
node)
sister taxa
(two taxa forming
clade )
INGROUP
sister clades
OUTGROUP (root)
you can also create several subsets of alignments (inclusion sets) by clicking Sites-Create set
and give it the name
parts of sequences above X (highlighted) are included in selection. You can select the sites
by combination of right- and left-clicks (left unselect point sites, right removes selection
between two unselected regions, single left-click select single site, by holding left and
moving mouse, you can re-select the whole regions) I know, it sounds awkward...
TRY TO PRACTICE iT!
you can then duplicate-rename and create different inclusion sets and Save just selection,
not the whole alignment. This feature can be extremely useful in phylogenies and sets the
SeaView apart from the others alignment editors (will get to it next time)
Coding sequences should be aligned in ‘translation’ mode – temporarily translated into and
aligned as amino acids and back-translated into nucleotides keeping the alignment positions
in SeaView click Props-View as proteins
uncheck View as proteins
now, the sequences are aligned according ORF
Phylogenetic inference
You don’t have to use the state-of-art phylogenetic methods for initial analysis/es, which purpose is to
(quickly) identify redundancy (duplicates and very similar sequences), aberrant and very divergent sequences or
the need to extend the dataset (quite often, you realize, you should’ve add some other taxa). For that, simple
neighbour joining tree based on J-C, K2P or HKY model, or stripped-down maximum likelihood run (without
gamma categories and branching support) would suffice and do the job quickly even on some older computers.
On the other hand, for the purpose of the publication (or if you want to be sure), once you’ve polished your
dataset, you should use the best (possible) methods. That usually means Maximum-likelihood with gammacorrected and GTR (nucleotides) of LG or WAG (amino acids) substitution matrices (or models of evolution, if you
wish... these matrices tells computer, how probable is change from one state to another). But, it all depends on
the dataset... if the sequences are similar and/or there’re just few of them, it may be preferable to use simpler
matrices/models. There are also some models dedicated to the organellar genomes and/or specific taxonomic
groups (like mtArt, which is tailored for analysis of mitochondrial genes of arthropods). There are some programs
to tell you, which model suits your dataset the best (for example jModeltest for nucleotides and ProtTest
(available also as a server).
The credibility of topology should be ‘tested’ using (non-parametric) bootstrap analysis, during which
software creates subreplicates made of random parts of the sequences (all taxa are included) and infers topology
form these subreplicates instead of the original dataset. For the purpose of the publication 100 replicates are a
bare minimum, the reviewer will probably require 300 or higher number though. If the analysis is meant just for
you (or your boss), 100 is totally enough (in my opinion), alternatively you can use even faster method called
‘approximate Likelihood-Ratio test’ (aLRT, implemented in some software).
Nowadays, most reviewers/editors will also require another type of phylogenetic analysis called Bayesian
inference. Here, you use the same (similar) models, but the method of topology search is totally different, also,
the branching support is expressed as a posterior probability (ranging from 0-1), instead of bootstrap values. Be
careful with interpretation of these two values. In bootstrap, everything higher than 50 (meaning the topology
appeared in at least 50% of the replicates) is considered to be supported (although weakly), the more you
approaching 100, the more confident you could be with the branching. OTOH, the posterior probability anything
bellow 0.95 (some go to 0.90) shall be considered as unsupported! Only nodes with 1.0 (or 0.99) PP value are
considered to be strongly supported.
Phylogenetic inference - software
Surprisingly lot software is available (given the obscurity of the topic, almost-exhaustive
list to be found here), but most are either too specialized, slow, obsolete or not worth
use from some different reasons . Unfortunately, most (like 99%) are command-line
based without any user-friendly graphic interface. But some of the good/passable are
implemented in SW with GUI (like SeaView or Geneious) or at least have server-version.
So, here is the short list some recommended phylogenetic software:
Ambiguous regions detection/removal: several SW, but nothing exciting, try Aliscore or
Gblock (server)
Distance methods: PAUP (commercial), Phylip, BioNJ
Maximum Parsimony: PAUP (commercial), Phylip
Maximum likelihood: RAxML (server), PhyML (server), FastTree (REALLY fast, great for
preliminary analyses), garli
Bayesian Inferrence: MrBayes, Phylobayes
Tree Viewer/Editor: NJplot (improved version also implemented in SeaView), FigTree, Treeview
this list is far from being exhaustive, but above noted SW should fit general audience
(like you ) in terms of purpose and performance.
DATABASE
PURIFIED DATABASE
meticulous analysis of SGP is necessary!!!
HOMOLOGUES
DATASETS
MSA
n
SGP
(
CONCATENATION
MGP
)
you could use also the automated
approach (Phylosorter), but the risk of
error is quite a significant and the
parameters should be as strict as possible
DATABASE
PURIFIED DATABASE
HOMOLOGUES
DATASETS
MSA
SGP
CONCATENATION
MGP
‘clean’ datasets could be merged (concatenated) into the
supermatrix
Scafos, phyutility, SeaView, MacClade, Bioedit?...
DATABASE
PURIFIED DATABASE
HOMOLOGUES
Multi-Gene Phylogenies
- both SW and HW demanding
- due to the amount of data. the most complex models are
necessary, prone to errors and time consuming
DATASETS
MSA
SGP
CONCATENATION
MGP
+ SHOULD produce robust results
phylogenetic artifacts
why? - poor taxon sampling
- too weak/strong phylogenetic signal
- violation of the model assumptions (different base
composition, mutation rates...)
- inappropriate model used
Long-Branch Attraction (LBA)
- the most (in)famous and common artifact
- high evolutionary rates cause artificial grouping of long-branching taxa
Artifacts elimination
- adding more genes
2008 - 135
same author – different datasets
2009 - 127
2012 - 258
Artifacts elimination
-
adding more genes
- adding more taxa
-
poor taxon sampling is considered to be the most common reason
ideally, all taxa should be included
reasonably, all relevant and available taxa should be included
realistically, we have to work with the few available
Artifacts elimination
-
adding genes to MGP
adding more taxa
- removal of problematic (fast-evolving) taxa
-
improving methodology
- analysis of dataset with different combination of taxa and
comparison of resulting topologies
- efficient way to over-come the LBA
Artifacts elimination
-
adding genes to MGP
adding more taxa
removal of problematic (fast-evolving) taxa
- improving methodology
- current HW a SW enable application of the state-of-art models
- LG4M, LG4X (RAxML)
- CAT(+GTR): each position of alignment has specific equilibrium and model
parameters
- covarion, non-homogenous: each taxon has specific rate of evolution
- HW and time demanding!
Artifacts elimination
-
adding genes to MGP
adding more taxa
removal of problematic (fast-evolving) taxa
improving methodology
- removal of fast evolving genes
- simple and fast way to reduce signal noise
- for each gene, we compute overall ML distance and remove the the
most divergent genes
- TREEPUZZLE, RAxML
Artifacts elimination
-
adding genes to MGP
adding more taxa
removal of problematic (fast-evolving) taxa
improving methodology
removal of fast-evolving genes
- removal of fast-evolving sites
- usually more efficient
- each site of alignment is assigned to specific rate category (usually
8/16)
- the highest category(ies) are removed
- dependent on topology/model
- TREEPUZZLE, AIRremover
Artifacts elimination
-
adding genes to MGP
adding more taxa
removal of problematic (fast-evolving) taxa
improving methodology
removal of fast-evolving genes
removal of fast-evolving sites
- decoding of aa
- for datasets with a large proportion of
saturated sites
- amino acids are recoded according to their
biochemical properties to four categories
(Dayhoff matrix)
Artifacts elimination
-
adding genes to MGP
adding more taxa
removal of problematic (fast-evolving) taxa
improving methodology
removal of fast-evolving genes
removal of fast-evolving sites
decoding of aa
- selection of genes with congruent signal
- clever, but is it kosher? ... doesn’t work that well anyway
- concaterpillar
Artifacts elimination
-
adding genes to MGP
adding more taxa
removal of problematic (fast-evolving) taxa
improving methodology
removal of fast-evolving genes
removal of fast-evolving sites
decoding of aa
selection of genes with congruent signal
Phylogenomics is (not)surprisingly hard to publish, usually you have
to do combination of at least few above to satisfy the reviewers!
So... is it worth when quite often you get the same topology
as with SSU rRNA?
?