Assembly and Alignment-Free

Download Report

Transcript Assembly and Alignment-Free

ASSEMBLY AND ALIGNMENT-FREE
METHOD OF PHYLOGENY
RECONSTRUCTION FROM NGS DATA
Huan Fan, Anthony R. Ives, Yann Surget-Groba and
Charles H. Cannon
Traditional methods for building phylogeny
Requirements:
• High coverage
• Assembly
• Detection of putative orthologous genes
• Alignment
• Phylogeny from tiny portion of the whole genome
• Genome scale multi-sequence alignment is difficult
Alignment-free methods for building phylogeny
• Typically from assembled genomes
• De novo assembly with short reads?
• Mainly on closely related prokaryotic genomes
• No confidence assessment (e.g. bootstrapping)
Overview
• Assembly and Alignment-Free method (AAF)
• Calculate phylogenetic distances using whole genome short
read sequencing data
• Method validation
• Genome complexity
• Different genome sizes
• Sequencing errors
• Range of sequencing coverage
• 12 mammal species
• 21 tropical tree species
• Comparision with andi
AAF method
• Calculate pairwise genetic distances between each
sample using the number of evolutionary changes
between their genomes, which are represented by the
number of k-mers that differ between genomes.
• Phylogenetic relationships among the genomes are then
reconstructed from the pairwise distance matrix
AAF method - Evolutionary model
• The probability that no mutation will occur within a given
k-mer between species A and B is exp(−kd).
• If only substitutions occurred, all k-mers are unique, then
all the species will have the same total number of k-mers,
nt, and the maximum likelihood estimate of exp(−kd) is
ns/nt.
• Mutations will decrease the number of shared k-mers, ns,
between species relative to the total number of k-mers, nt
• Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers
• Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers
• Greater effect
K-mer sensitivity and homoplasy
• No assembly -> not all indels identified
• If k-mer covers multiple substitutions
• Shorter k-mers -> better sensitivity
• Shorter k-mers -> same k-mers from evolutionary
different regions
• Homoplasy
K-mer homoplasy
• k=15
• Genome size > 5x108 => same k-mers randomly in other
species
• May incorrectly inflate the proportion of shared k-mers
• The optimal k for phylogenetic reconstruction is the k
which is just large enough to greatly reduce k-mer
homoplasy for a given genome size
ph
• Prediction of the ratio ns/nt
• Large genomes and small k ph = 1
• all possible k-mers occur in both species. This problem is exacerbated if GC content is biased, which will inflate the average
similarity in genomic k-mer composition.
• GC content
• Sufficiently large k will overcome homoplasy, regardless
of the evolutionary distance between species.
Mathematical prediction
Random ancestral sequence
Real (non-random) sequence
Assembly-free
• Sampling error caused by low genome coverage
• The actual number of k-mers will be under-represented given low
sequencing coverage
• Sequencing errors
• Loss of true k-mers and the gain of false k-mers
• Filtering = remove singletons
Seq errors
p=observed/true
=> Tip corrections
Coverage 5-8 sufficient to observe all true k-mers when filtering
Filter only singletons?
Filter only singletons?
Bootstrapping
Nonparametric bootstrap
1) Resample original reads with replacement
2) “Block bootstrap” – take rows with probabilty 1/k
OR
Two-stage parametric bootstrap
• Estimate the variances in distances between species
caused by sampling and evolutionary variation
• Independent of genome size
Bushbaby (galago)
Tarsier
Recently published phylogeny of primates
Assembled genomes, k=19
Assembled genomes, k=21
Simulated reads
Simulated reads
Real data – tropical trees
Intsia
palembanica
Advantages
Limitations
• Low coverage
• Loss of k-mer sensitivity
requirements
• Deep nodes
• Low computational
demands
• 12 primates 25GB RAM, 12
threads
• Location of mutations
Distance computing for 73 Escherichia
strains
• AAF
• 32+76 = 1h 48min
• andi
• 21 min
AAF
andi