Transcript Document

COMP 538
Introduction of Bayesian networks
Lecture 16: Wrap-Up
Phylogeny / Slide 2
Nevin L. Zhang, HKUST
Recap

Latent class models




Clustering
Clustering criterion: conditional independence
Drawback: Assumption too strong
Hierarchical latent class (HLC) models


Identifiability issues: regularity, equivalence
Hill climbing algorithm
Phylogeny / Slide 3
Nevin L. Zhang, HKUST
Today

Phylogenetic (evolution) trees


Closely related to HLC models
An example of viewing existing models in the framework of BN
– Another example: HMM

Interesting because
– Ease understanding
– Techniques in one field applied to another
 Structural EM for phylogenetic trees
 Dynamic BNs for speech understanding
– Development of general purpose algorithms

Bayesian networks for classification

Hand waving only
Phylogeny / Slide 4
Nevin L. Zhang, HKUST
Phylogenetic Tree Outline

Introduction to phylogenetic trees

Probabilistic models of evolution

Tree reconstruction
Phylogeny / Slide 5
Nevin L. Zhang, HKUST
Phylogenetic Trees

Assumption



Phylogeny


All organisms on Earth have a common ancestor
This implies that any set of species is related.
The relationship between any set of species.
Phylogenetic tree

Usually, the relationship can be represented by a tree which is
called a phylogenetic (evolution) tree
– this is not always true
Phylogeny / Slide 6
Nevin L. Zhang, HKUST
Phylogenetic Trees

Phylogenetic trees
Time
giant
panda
lesser
panda
moose
goshawk
duck
vulture
alligator
Current-day species at bottom
Phylogeny / Slide 7
Nevin L. Zhang, HKUST
Phylogenetic Trees



TAXA (sequences) identify species
Edge lengths represent evoluation time
Assumption: bifurcating tree toplogy
AAGGCCT
AAGACTT
AGCACTT
AAGGCAT
AGGGCAT
AGCACAA
TAGACTT
TAGCCCA
AGCGCTT
Time
Phylogeny / Slide 8
Nevin L. Zhang, HKUST
Probabilistic Models of Evolution

Characterize relationship between taxa using substitution probability:
– P(x | y, t): probability that ancestral sequence y evolves into sequence x
along an edge of length t
t5
x5
t1
s1
x7
t2
s2
t3
s3
x6
t6
t4
s4
– P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), ….
Phylogeny / Slide 9
Nevin L. Zhang, HKUST
Probabilistic Models of Evolution


What should P(x|y, t) be?
Two assumptions of commonly used models

There are only substitutions, no insertions/deletions (aligned)
– One-to-one correspondence between sites in different sequences


Each site evolves independently and identically
P(x|y, t) = Pi=1 to m P(x(i) | y(i), t)

m is sequence length
AAGGCCT
AAGACTT
AGCACTT
AAGGCAT
AGGGCAT
TAGACTT
TAGCCCA
AGCACAA
AGCGCTT
Phylogeny / Slide 10
Nevin L. Zhang, HKUST
Probabilistic Models of Evolution

What should P(x(i)|y(i), t) be?

Jukes-Cantor (Character Evolution) Model [1969]
– Rate of substitution a (Constant or parameter?)

A
A
rt
C
st
G
st
T
st
C
st
rt
st
st
G
st
st
rt
st
T
st
st
st
rt
rt = 1/4 (1 + 3e-4at)
st = 1/4 (1 - e-4at)
Limit values when
t = 0 or t = infinity?
Multiplicativity (lack of memory)
P(c | a, t1  t2 )   P(b | a, t1 ) P(c | b, t2 )
b
Phylogeny / Slide 11
Nevin L. Zhang, HKUST
Tree Reconstruction

Given: collection of
current-day taxa

Find: tree



Tree topology: T
Edge lengths: t
Maximum likelihood

AGGGCAT, TAGCCCA, TAGACTT,
AGCACAA, AGCGCTT
Find tree to maximize
P(data | tree)
AGGGCAT
AGCACAA
TAGACTT
TAGCCCA
AGCGCTT
Phylogeny / Slide 12
Nevin L. Zhang, HKUST
Tree Reconstruction

When restricted to one particular site, a phylogenetic tree is an HLC
model where

The structure is a binary tree and variables share the same state space.
 The conditional probabilities are from the character evolution model,
parameterized by edge lengths instead of usual parameterization.
 The model is the same for different sites
AAGGCCT
AAGACTT
AGCACTT
AGGGCAT
AGCACAA
TAGACTT
TAGCCCA
AGCGCTT
Phylogeny / Slide 13
Nevin L. Zhang, HKUST
Tree Reconstruction

Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA,
AGCGCTT

Samples for HLC model. One Sample per site. The samples are i.i.d.

1st site:
(A, T, T, A, A),
2nd site: (G, A, A, G, G),
 3rd site: (G, G, G, C, C),


…
AAGGCCT
AAGACTT
AGCACTT
AGGGCAT
AGCACAA
TAGACTT
TAGCCCA
AGCGCTT
Phylogeny / Slide 14
Nevin L. Zhang, HKUST
Tree Reconstruction

Finding ML phylogenetic tree == Finding ML HLC model

Model space:


Model structures: binary tree where all variables share the same
state space, which is known.
Parameterization: one parameter for each edge. (In general, P(x|y)
has |x||y|-1 parameters).
Phylogeny / Slide 15
Nevin L. Zhang, HKUST
Bayesian Networks for Classification

The problem:


Given data:
Find mapping
– (A1, A2, …, An) |- C

Possible solutions

ANN
 Decision tree (Quinlan)
 …
A1
A2
…
An
C
0
1
1
0
T
1
0
1
1
F
..
..
..
..
..
Phylogeny / Slide 16
Nevin L. Zhang, HKUST
Bayesian Networks for Classification

Naïve Bayes model

From data, learn
– P(C), P(Ai|C)

Classification
– arg max_c P(C=c|A1=a1, …, An=an)

Very good in practice
Phylogeny / Slide 17
Nevin L. Zhang, HKUST
Bayesian Networks for Classification

Drawback of NB:



Attributes mutually independent given class variable
Often violated, leading to doubling counting.
Fixes:




General BN classifiers
Tree augmented Naïve Bayes (TAN) models
Hierarchical NB
…
Phylogeny / Slide 18
Nevin L. Zhang, HKUST
Bayesian Networks for Classification

General BN classifier




Treat class variable just as another variable
Learn a BN.
Classify the next instance based on values of variables in the
Markov blanket of the class variable.
Pretty bad because it does not utilize all available information
Phylogeny / Slide 19
Nevin L. Zhang, HKUST
Bayesian Networks for Classification

TAN model

Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian
networks classifiers. Machine Learning, 29:131-163.
 Capture dependence among attributes using a tree structure.
 During learning,
– First learn a tree among attributes: use Chow-Liu algorithm
– Add class variable and estimate parameters

Classification
– arg max_c P(C=c|A1=a1, …, An=an)
Phylogeny / Slide 20
Nevin L. Zhang, HKUST
Bayesian Networks for Classification

Hierarchical Naïve Bayes models

N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable
discovery in classification models. Artificial Intelligence in Medicine, to
appear.
 Capture dependence among attributes using latent variables
 Detect interesting latent structures besides classification
 Currently, slow