CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

CSCE555 Bioinformatics
Lecture 13 Phylogenetics II
HAPPY CHINESE NEW YEAR
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Outline
Review For Exams
 Data for Phylogenetic Tree inference
 Classification of Tree inference
approaches
 Neighbor-joining algorithm
 Parsimony-based tree reconstruction
 Least Square Best-fit reconstruction

7/6/2015
2
Midterm, Midterm
How to review: read slides and
textbooks, especially CG book.
 Format of problems: examples

◦ Brief questions: what is the difference
between global alignment and local alignment?
◦ calculation: build a HMM model for a multiple
seq alignment
◦ Definition: blasting, Motif, ORF
Covered Topics

Understand: concepts, algorithm ideas, tools
◦ Sequencing/blasting
◦ Gene finding
◦ Alignment algorithms and applications
◦ DNA motif search
◦ HMM profiles
◦ Gene prediction algorithms
◦ Promoter predictions
◦ Comparative genomics
◦ ……
Phylogenetic Reconstruction

There are essentially two types of data for phylogenetic
tree estimation:
◦ Distance data, usually stored in a distance matrix, e.g.
DNA×DNA hybridisation data, morphometric differences,
immunological data, pairwise genetic distances
◦ Character data, usually stored in a character array;
 e.g. multiple sequence alignment of DNA sequences, morphological
characters.
Characters
Distances
1
1234567890
Taxa
A
B
C
D
E
Taxa
A
B
C
D
E
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
1
1
0
0
0
1
1
0
0
0
0
1
1
0
0
1
1
1
0
0
1
1
1
0
0
1
1
1
1
0
Phylogenetic Reconstruction

Given the huge number of possible trees
even for small data sets, we have two
options:
◦ Build one according to some clustering
algorithm
◦ Assign a “goodness of fit” criterion (an
objective function) and find the tree(s) which
optimise(s) this criterion
Phylogenetic Reconstruction
Type of Data
Nucleotide
Sites
Clustering
Algorithm
Optimality
Criterion
Tree Building Method
Distances
UPGMA
Neighbor-Joining
Maximum
Parsimony
Minimum
Evolution
Maximum
Likelihood
CS369 2007
7
Phylogenetic Methods
Many different procedures exist. Three of the most
popular:
Neighbor-joining
• Minimizes distance between nearest
neighbors
Maximum parsimony
• Minimizes total evolutionary change
Maximum likelihood
• Maximizes likelihood of observed data
Distance based tree Construction
Given a set of species (leaves in a supposed tree), and
distances between them – construct a phylogeny which
best “fits” the distances.
Orc: ACAGTGACGCCCCAAACGT
Elf:
ACAGTGACGCTACAAACGT
Dwarf: CCTGTGACGTAACAAACGA
Hobbit: CCTGTGACGTAGCAAACGA
Human:CCTGTGACGTAGCAAACGA
Orc
Elf
Dwarf
Hobbit
Human
Distance Matrix



Given n species, we can compute the n x n
distance matrix Dij
Dij may be defined as the edit distance
between a gene in species i and species j,
where the gene of interest is sequenced for
all n species.
Dij can also be any other feature-based
distances
Distances in Trees
Edges may have weights reflecting:
◦ Number of mutations on evolutionary path
from one species to another
◦ Time estimate for evolution of one species
into another
 In a tree T, we often compute
dij(T) - the length of a path between leaves i
and j

Distances in Trees
Edges may have weights reflecting:
◦ Number of mutations on evolutionary path from
one species to another
◦ Time estimate for evolution of one species into
another
 In a tree T, we often compute
dij(T) - the length of a path between leaves i and j

Distance in Trees: an Exampe
j
i
d1,4 = 12 + 13 + 14 + 17 + 12 = 68
Fitting Distance Matrix
Given n species, we can compute the n x
n distance matrix Dij
 Evolution of these genes is described by a
tree that we don’t know.
 We need an algorithm to construct a tree
that best fits the distance matrix Dij

Reconstructing a 3 Leaved Tree
Tree reconstruction for any 3x3 matrix is
straightforward
 We have 3 leaves i, j, k and a center vertex c

Observe:
dic + djc = Dij
dic + dkc = Dik
djc + dkc = Djk
Reconstructing a 3 Leaved Tree
dic + djc = Dij
+ dic + dkc = Dik
2dic + djc + dkc = Dij + Dik
2dic +
Djk
= Dij + Dik
dic = (Dij + Dik – Djk)/2
Similarly,
djc = (Dij + Djk – Dik)/2
dkc = (Dki + Dkj – Dij)/2
Trees with > 3 Leaves

An tree with n leaves has 2n-3 edges

This means fitting a given tree to a
distance matrix D requires solving a
system of “n choose 2” equations with
2n-3 variables

This is not always possible to solve for n
>3
Additive Distance Matrices
Matrix D is
ADDITIVE if there
exists a tree T with
dij(T) = Dij
NON-ADDITIVE
otherwise
Distance Based Phylogeny Problem
Goal: Reconstruct an evolutionary tree
from a distance matrix
 Input: n x n distance matrix Dij
 Output: weighted tree T with n leaves
fitting D


If D is additive, this problem has a solution
and there is a simple algorithm to solve it
Using Neighboring Leaves to Construct the Tree
Find neighboring leaves i and j with parent k
 Remove the rows and columns of i and j
 Add a new row and column corresponding to
k, where the distance from k to any other leaf
m can be computed as:

Dkm = (Dim + Djm – Dij)/2
Compress i and j into
k, iterate algorithm for
rest of tree
Finding Neighboring Leaves
• To find neighboring leaves we simply
select a pair of closest leaves.
Finding Neighboring Leaves
• To find neighboring leaves we simply
select a pair of closest leaves.
WRONG
Finding Neighboring Leaves
• Closest leaves aren’t necessarily neighbors
• i and j are neighbors, but (dij = 13) > (djk = 12)
• Finding a pair of neighboring leaves is
a nontrivial problem!
Neighbor Finding: Seitou & Nei
algorithm (1987)
Definitions
For a leaf i, let ri 
 d (i, u).
u is a leaf
For leavesi, j :
D(i, j )  ( L  2)d (i, j )  ( ri  r j )
Theorem (Saitou & Nei) Assume all edge weights are
positive. If D(i,j) is minimal (among all pairs of leaves), then i
and j are neighboring leaves in the tree.
Neighbor Joining Algorithm

In 1987 Naruya Saitou and Masatoshi Nei
developed a neighbor joining algorithm for
phylogenetic tree reconstruction

Finds a pair of leaves that are close to
each other but far from other leaves:
implicitly finds a pair of neighboring leaves

Advantages: works well for additive and other
non-additive matrices, it does not have the
flawed molecular clock assumption
Neighbor-joining
Guaranteed to produce the correct tree if distance is
additive
 May produce a good tree even when distance is not
additive
Step 1: Finding neighboring leaves
1
3
Define
Dij = dij – (ri + rj)
0.1
0.1
0.1
Where
1
0.4
ri = –––––k dik
0.4
|L| - 2

2
4
Algorithm: Neighbor-joining
Initialization:
Define T to be the set of leaf nodes, one per sequence
Let L = T
Iteration:
 Pick i, j s.t. Dij is minimal
 Define a new node k, and set dkm = ½ (dim + djm – dij)
for all m  L
 Add k to T, with edges of lengths dik = ½ (dij + ri – rj)
 Remove i, j from L;
 Add k to L
Termination:
When L consists of two nodes, i, j, and the edge
between them of length dij
Rooting a tree, and definition of
outgroup
Neighbor-joining produces an unrooted tree
How do we root a tree between N species using n-j?
An outgroup is a species that we know to be more
distantly related to all remaining species, than they are
to one another
Example: Human, mouse, rat, pig, dog, chicken, whale
1
Which one is an outgroup?
Outgroup can act as a root
4
2
3
Neighbor Joining Algorithm-Widely Used

Applicable to matrices which are not additive

Known to work good in practice

The algorithm and its variants are the most
widely used distance-based algorithms today.
Maximum Parsimony Method for
Tree Inference
A Character-based method
Input: h sequences (one per species), all of length
k.
Goal: Find a tree with the input sequences at its
leaves, and an assignment of sequences to
internal nodes, such that the total number of
substitutions is minimized.
Two sub-problems:
1. Find the parsimony cost of a given tree (easy)
2. Search through all tree topologies (hard)
Example
Input: four nucleotide sequences: AAG, AAA, GGA,
AGA taken from four species.
By the parsimony principle, we seek a tree that has a
minimum total number of substitutions of symbols
between species and their originator in the
phylogenetic tree. Here is one possible tree.
AAA
AAA
1
AAG
2
GGA
AAA
AAA
1
AGA
Total #substitutions = 4
Least Squares Distance Phylogeny
Problem



If the distance matrix D is NOT additive, then we
look for a tree T that approximates D the best:
Squared Error : ∑i,j (dij(T) – Dij)2
Squared Error is a measure of the quality of the fit
between distance matrix and the tree: we want to
minimize it.
Least Squares Distance Phylogeny Problem:
finding the best approximation tree T for a nonadditive matrix D (NP-hard).
Search through tree topologies:
Branch and Bound
Observation: adding an edge to an existing tree can only increase the
parsimony cost
Enumerate all unrooted trees with at most n leaves:
[i3][i5][i7]……[i2N–5]]
where each ik can take values from 0 (no edge) to k
At each point keep C = smallest cost so far for a complete tree
Start B&B with tree [1][0][0]……[0]
Whenever cost of current tree T is > C, then:
◦ T is not optimal
◦ Any tree with more edges containing T, is not optimal:
Increment by 1 the rightmost nonzero counter
Comparison of Methods
Neighbor-joining
Maximum parsimony
Maximum likelihood
Very fast
Slow
Very slow
Easily trapped in local
optima
Assumptions fail when Highly dependent on
evolution is rapid
assumed evolution model
Good for generating
tentative tree, or choosing
among multiple trees
Best option when
tractable (<30 taxa,
strong conservation)
Good for very small data
sets and for testing trees
built using other methods
Summary
Category of phylogenetic inference algorithms
 Neighbor-joining algorithm

Acknowledgement

Anonymous authors

CSCE590/822 Data Mining Principles and Applications

Transcript CSCE590/822 Data Mining Principles and Applications

Directory