Phylogeny slides
Download
Report
Transcript Phylogeny slides
Phylogenetic Tree Construction
and Related Problems
Bioinformatics
Problems
Multiple Sequence Alignment
Construction of Phylogenetic Trees
Determining Genomic Distance through
Rearrangements
Alignment of Multiple
Sequences
We can extend the notion of alignment to
multiple strings:
-att-ct-ttactat-a-t
An alignment of strings S1…Sn is described
by strings S1’…Sn’
Each Si’ contains the characters of Si in order
interspersed with spaces (-)
No position exists that contain spaces for all Si’
Scoring a Multiple Alignment
Sum of pairs
Consider each pair in the set of sequences,
determine the similarity score (using gap, match,
and mismatch weights) for each pair, and add all
pair-wise scores
Distance from consensus
Consider all columns and count the total number
of differences from the consensus character
Variations: just count characters that differ from
consensus or have a difference score for each
differing character
Multiple Alignment Problem
Formulation: Given sequences S1…Sn,
obtain an optimal multiple alignment of
the sequences
“Optimal” depends on multiple
alignment scoring method
No known (correct) efficient algorithms
for this problem
Strategies
Brute force algorithm: consider all possible
alignments, then determine the one that
results in a best score (time complexity?)
Common Heuristic: Use regular (two-string)
alignment and then repeatedly add a string to
a growing alignment
O(nm2) – where m is the the max string length
Does not always produce an optimal alignment
Phylogenetic Tree Construction
Multiple alignments are often performed on
similar species
Next step: Construct an evolutionary tree of
these species
Input to problem: evolutionary distances
between each pair
Not the same as similarity score
Example: edit distance--count number of indels
needed to transform one string to the other
Problem Formulation
Given a set of species and evolutionary
distance between each pair (2d matrix
of numbers), construct a phylogenetic
tree consistent with the distances
Details needed:
Characteristics/constraints of a tree
Notion of “consistent”
Phylogenetic Tree
Rooted
Leaves are species
Internal nodes are speculated ancestors
Distances associated with each edge
Example
ancestor
1
4
ancestor
2
dog
bat
3
cat
Minimizing Distance Deviation
Phylogenetic tree implies pairwise
distances (sum all edge distances in
path)
Compare with input distances to assess
consistency
Sum of Differences
Alternative computations for deviation
(least squares)
Example
Suppose Input:
D(dog,cat): 4
D(dog,bat): 7
D(cat,bat): 10
ancestor
1
4
ancestor
2
dog
bat
3
cat
Implied by Tree:
D(dog,cat): 5
D(dog,bat): 7
D(cat,bat): 8
Difference = (5- 4) + (7-7) + (10-8) = 3
Character-based Tree
Construction
Input:
a set C of m characters
possible values for each character
a set S of n species where each element of S is associated
with a value for each character in C
Output: a phylogenetic tree T such that
Elements in S are the leaves of T
Each internal node of T has an assigned value per character
Each character value induces a connected subtree of T
Mutations on Genomes
Mutations can occur at the gene level
(indels of nucleotides) or at the genome
level (operations on genes)
At the genome level, for similar species,
the operations are rearrangements
Reversals
Transpositions
Translocations
Genomes, Chromosomes, and
Permutations
Genome: set of chromosomes
Chromosome: sequence of genes
Genes: unique across entire genome
Can simplify the representation of a
genome by arbitrarily concatenating the
chromosomes
Result: a permutation of a set of unique
genes
Determining Genomic Distance
Recall: Phylogenetic tree construction
need pair-wise distances
For similar species with the same set of
genes:
Can use number of rearrangement
operations for distance
Common distance model: obtain the
fewest (optimal) rearrangements necessary
to transform genome X to genome Y
Some Assumptions Made
Can limit allowed rearrangements to
some reasonable subset of possible
arrangements
e.g., reversals apparently most common for
plant species
Sometimes, gene-blocks instead of
individual genes are the elements that
are permuted
Problem Formulations
Given permutations (species) p and q of
the set {1…n}, find the shortest
sequence of rearrangements
(mutations) that transform p to q
p.r1. r2 … rs = q
Given a permutation p, find the shortest
sequence of rearrangements that sort p
p.r1. r2 … rs =
123…n