ppt - Department of Computer Science and Engineering

Download Report

Transcript ppt - Department of Computer Science and Engineering

L11: Uses of Bayesian Networks
Nevin L. Zhang
Department of Computer Science & Engineering
The Hong Kong University of Science & Technology
http://www.cse.ust.hk/~lzhang/
Page 2
Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project
Page 3
Traditional Uses


Probabilistic Expert Systems

Diagnostic

Prediction
Example:
BN for diagnosing
“blue baby” over phone
in a London Hospital
Comparable to specialist,
Better than others
Page 4
Traditional Uses

Language for describing probabilistic models in Science &
Engineering

Example: BN for turbo code
Page 5
Traditional Uses

Language for describing probabilistic models in Science &
Engineering

Example: BN from Bioinformatics
Page 6
Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project
Page 7
BN for Structure Discovery

Given: Data set D on variables X1, X2, …, Xn

Discover dependence, independence, and even causal
relationship among the variable.

Example: Evolution trees
Page 8
Phylogenetic Trees


Assumption

All organisms on Earth have a common ancestor

This implies that any set of species is related.
Phylogeny


The relationship between any set of species.
Phylogenetic tree

Usually, the relationship can be represented by a tree which is called a
phylogenetic (evolution) tree
 this is not always true
Page 9
Phylogenetic Trees

Phylogenetic trees
Time
giant
panda
lesser
panda
moose
goshawk
duck
vulture
alligator
Current-day species at bottom
Page 10
Phylogenetic Trees

TAXA (sequences) identify species

Edge lengths represent evolution time

Assumption: bifurcating tree topology
AAGGCCT
AAGACTT
AGCACTT
AAGGCAT
AGGGCAT
AGCACAA
TAGACTT
TAGCCCA
AGCGCTT
Time
Page 11
Probabilistic Models of Evolution

Characterize relationship between taxa using substitution probability:
P(x | y, t): probability that ancestral sequence y evolves into sequence x along
an edge of length t
t5
x5
t1
s1
x7
t2
s2
t3
s3
x6
t6
t4
s4
P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), ….
Page 12
Probabilistic Models of Evolution

What should P(x|y, t) be?

Two assumptions of commonly used models

There are only substitutions, no insertions/deletions (aligned)
 One-to-one correspondence between sites in different sequences


Each site evolves independently and identically
P(x|y, t) = ∏i=1 to m P(x(i) | y(i), t)

m is sequence length
AAGGCCT
AAGACTT
AGCACTT
AAGGCAT
AGGGCAT
TAGACTT
TAGCCCA
AGCACAA
AGCGCTT
Page 13
Probabilistic Models of Evolution

What should P(x(i)|y(i), t) be?

Jukes-Cantor (Character Evolution) Model [1969]
 Rate of substitution a (Constant or parameter?)

A
A
rt
C
st
G
st
T
st
C
st
rt
st
st
G
st
st
rt
st
T
st
st
st
rt
Multiplicativity (lack of memory)
rt = 1/4 (1 + 3e-4at)
st = 1/4 (1 - e-4at)
Limit values when
t = 0 or t = infinity?
P(c | a, t1  t2 )   P(b | a, t1 ) P(c | b, t2 )
b
Page 14
Tree Reconstruction



Given: collection of currentday taxa
AGGGCAT, TAGCCCA, TAGACTT,
AGCACAA, AGCGCTT
Find: tree

Tree topology: T

Edge lengths: t
Maximum likelihood

Find tree to maximize P(data
| tree)
AGGGCAT
AGCACAA
TAGACTT
TAGCCCA
AGCGCTT
Page 15
Tree Reconstruction

When restricted to one particular site, a phylogenetic tree is an LT
model where

The structure is a binary tree and variables share the same state space.

The conditional probabilities are from the character evolution model,
parameterized by edge lengths instead of usual parameterization.

The model is the same for different sites
AAGGCCT
AAGACTT
AGCACTT
AGGGCAT
AGCACAA
TAGACTT
TAGCCCA
AGCGCTT
Page 16
Tree Reconstruction

Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT

Samples for LT model. One Sample per site. The samples are i.i.d.

1st site: (A, T, T, A, A),

2nd site: (G, A, A, G, G),

3rd site: (G, G, G, C, C),
AAGGCCT
AAGACTT
AGCACTT
AGGGCAT
AGCACAA
TAGACTT
TAGCCCA
AGCGCTT
Page 17
Tree Reconstruction

Finding ML phylogenetic tree == Finding ML LT model

Model space:

Model structures: binary tree where all variables share the same state
space, which is known.

Parameterization: one parameter for each edge. (In general, P(x|y) has
|x||y|-1 parameters).

The objective is to find relationships among variables.

Applying new LTM algorithms to Phylogenetic tree reconstruction?
Page 18
Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project
Page 19
BN for Density Estimation

Given: Data set D on variables X1, X2, …, Xn

Estimate: P(X1, X2, …, Xn) under some constraints

..

Uses of the estimate:

Inference

Classification
Page 20
BN Methods for Density Estimation


Chow-Liu tree with X1, X2, …, Xn as nodes

Easy to compute

Easy to use

Might not be good estimation of “true” distribution
BN with X1, X2, …, Xn as nodes

Can be good estimation of “true” distribution.

Might be difficult to find

Might be complex to use
Page 21
BN Methods for Density Estimation

LC model with X1, X2, …, Xn as manifest variables (Lowd
and Domingos 2005)

Determine the cardinality of the latent variable using hold-out
validation,

Optimize the parameters using EM.

..

Easy to compute

Can be good estimation of “true” distribution

Might be complex to use (cardinality of latent variable might be very
large)
Page 22
BN Methods for Density Estimation

LT model for density estimation

Pearl 1988: As model over manifest variables, LTMs

Are computationally very simple to work with.

Can represent complex relationships among manifest variables.
Page 23
BN Methods for Density Estimation

New approximate inference algorithm for Bayesian networks (Wang, Zhang
and Chen, AAAI 08, JAIR 32: 879-900, 08 )
Sample
sparse
sparse
Learn
dense
dense
Page 24
Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project
Page 25
Bayesian Networks for Classification

The problem:


Given data:
Find mapping
 (A1, A2, …, An) |- C

Possible solutions

ANN

Decision tree (Quinlan)

…

(SVM: Continuous data)
A1
A2
…
An
C
0
1
…
0
T
1
0
…
1
F
..
..
..
..
..
Page 26
Bayesian Networks for Classification

Naïve Bayes model

From data, learn
 P(C), P(Ai|C)

Classification
 arg max_c P(C=c|A1=a1, …, An=an)

Very good in practice
Page 27
Bayesian Networks for Classification


Drawback of NB:

Attributes mutually independent given class variable

Often violated, leading to double counting.
Fixes:

General BN classifiers

Tree augmented Naïve Bayes (TAN) models

Hierarchical NB

Bayes rule + Density Estimation

…
Page 28
Bayesian Networks for Classification

General BN classifier

Treat class variable just as another variable

Learn a BN.

Classify the next instance based on values of variables in the Markov
blanket of the class variable.

Pretty bad because it does not utilize all available information because of
Markov boundary
Page 29
Bayesian Networks for Classification

TAN model

Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian
networks classifiers. Machine Learning, 29:131-163.

Capture dependence among attributes using a tree structure.

During learning,
 First learn a tree among attributes: use Chow-Liu algorithm
 Add class variable and estimate parameters

Classification
 arg max_c P(C=c|A1=a1, …, An=an)
Page 30
Bayesian Networks for Classification

Hierarchical Naïve Bayes models

N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable
discovery in classification models. Artificial Intelligence in Medicine, to
appear.

Capture dependence among attributes using latent variables

Detect interesting latent structures besides classification

Algorithm in the step of DHC..

…
Page 31
Bayesian Networks for Classification

Bayes Rule

.


Chow-Liu

LC model

LT Model
Wang Yi: Bayes rule + LT model is for superior
Page 32
Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project
Page 33
BN for Clustering



Latent class (LC) model

One latent variable

A set of manifest variables
Conditional Independence Assumption:

Xi’s mutually independent given Y.

Also known as Local Independence Assumption
Used for cluster analysis of categorical data

Determine cardinality of Y: number of clusters

Determine P(Xi|Y): characteristics of clusters
Page 34
BN for Clustering
Clustering Criteria

Distance based clustering:


Minimizes intra-cluster variation and/or maximizes inter-cluster
variation
LC Model-based clustering:

The criterion follows from the conditional independence assumption

Divide data into clusters such that, in each cluster, manifest
variables are mutually independent under the empirical distribution.
Page 35
BN for Clustering

Local independence assumption often not true

LT models generalize LC models

Relax the independence assumption

Each latent variable gives a way to partition data… multidimensional
clustering
Page 36
ICAC Data
// 31 variables, 1200 samples
C_City:
s0 s1 s2 s3
// very common, quit common, uncommon, ..
C_Gov:
s0 s1 s2 s3
C_Bus:
s0 s1 s2 s3
Tolerance_C_Gov:
s0 s1 s2 s3
Tolerance_C_Bus:
s0 s1 s2 s3
WillingReport_C:
s0 s1 s2
// yes, no, depends
LeaveContactInfo:
s0 s1
// yes, no
I_EncourageReport:
s0 s1 s2 s3 s4
// very sufficient, sufficient, average, ...
I_Effectiveness:
s0 s1 s2 s3 s4
//very e, e, a, in-e, very in-e
I_Deterrence:
s0 s1 s2 s3 s4
// very sufficient, sufficient, average, ...
//totally intolerable, intolerable, tolerable,...
…..
-1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 1 1 -1 -1 2 0 2 2 1 3 1 1 4 1 0 1.0
-1 -1 -1 0 0 -1 -1 1 1 -1 -1 0 0 -1 1 -1 1 3 2 2 0 0 0 2 1 2 0 0 2 1 0 1.0
-1 -1 -1 0 0 -1 -1 2 1 2 0 0 0 2 -1 -1 1 1 1 0 2 0 1 2 -1 2 0 1 2 1 0 1.0
….
Page 37
Latent Structure Discovery
Y2: Demographic info; Y3: Tolerance toward corruption
Y4: ICAC performance; Y7: ICAC accountability
Y5: Change in level of corruption; Y6: Level of corruption
(Zhang, Poon,
Wang and Chen 2008)
Page 38
Interpreting Partition

Y2 partition the population into 4 clusters

What is the partition about? What is “criterion”?

On what manifest variables do the clusters differ the most?

Mutual information:

The larger I(Y2, X), the more the 4 clusters differ on X
Page 39
Interpreting Partition

Information curves:

Partition of Y2 is based on Income, Age, Education, Sex

Interpretation: Y2 --- Represents a partition of the population based
on demographic information

Y3 --- Represents a partition based on Tolerance toward Corruption
Page 40
Interpreting Clusters
Y2=s0: Low income youngsters; Y2=s1: Women with no/low income
Y2=s2: people with good education and good income;
Y2=s3: people with poor education and average income
Page 41
Interpreting Clustering
Y3=s0: people who find corruption totally intolerable; 57%
Y3=s1: people who find corruption intolerable; 27%
Y3=s2: people who find corruption tolerable; 15%
Interesting finding:
Y3=s2: 29+19=48% find C-Gov totally intolerable or intolerable; 5% for C-Bus
Y3=s1: 54% find C-Gov totally intolerable; 2% for C-Bus
Y3=s0: Same attitude toward C-Gov and C-Bus
People who are touch on corruption are equally tough toward C-Gov and C-Bus.
People who are relaxed about corruption are more relaxed toward C-Bus than C-GOv
Page 42
Relationship Between Dimensions
Interesting finding: Relationship btw background and tolerance toward corruption
Y2=s2: ( good education and good income) the least tolerant. 4% tolerable
Y2=s3: (poor education and average income) the most tolerant. 32% tolerable
The other two classes are in between.
Page 43
Result of LCA

Partition not meaningful

Reason:


Local Independence not true
Another way to look at it

LCA assumes that all the manifest variables joint defines a meaningful way to cluster
data
 Obviously not true for ICAC data
 Instead, one should look for subsets that do define meaningful partition and perform
cluster analysis on them
 This is what we do with LTA
Page 44
Finite Mixture Models

Y: discrete latent variable

Xi: continuous

P(X1, X2, …, Xn|Y): Usually
multivariate Gaussian

No independence assumption

Assume states of Y: 1, 2, …, k
P(X1, X2, …, Xn)
=
P(Y=i)P(X1, X2, …, Xn|Y=i):
Mixture of k Gaussian components
Page 45
Finite Mixture Models

Used to cluster continuous data

Learning

Determine
 k: number of clusters
 P(Y)
 P(X1, …, Xn|Y)

Also assume: All attributes define coherent partition

Not realistic

LT models are a natural framework for clustering high dimensional data
Page 46
Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project
Page 47
Observation on How Human Brain Does Thinking

Human beings often invoke latent variables to explain regularities that
we observe.

Example 1

Observe Regularity:
 Beers, Diapers often bought together in early evening

Hypothesize (latent) cause:
 There must be a common (latent) cause

Identify the cause and explain regularity
 Shopping by Father of Babies on the way home from work
 Based on our understanding of the world
Page 48
Observation on How Human Brain Does Thinking

Example 2

Background: At night, watch lighting throw windows of apartments
in big buildings

Observe Regularity:
 Lighting from several apartments were changing in brightness and color
at the same times and in perfect synchrony.

Hypothesize common (latent) cause:
 There must be a (late) common cause

Identify the cause and explain the phenomenon:
 People watching the same TV channel.
 Based on understanding of the world
Page 49
Back to Ancient Time

Observe Regularity

Several symptoms often occur together
 ‘intolerance to cold’, ‘cold limbs’, and ‘cold lumbus and back’

Hypothesize common latent cause:


There must be a common latent cause
Identify the cause

Answer based on understanding of world at that time, primitive

Conclusion: Yang deficiency (阳虚)
 Explanation: Yang is like the sun, it warms your body. If you don’t have enough
of it, feel cold.
Page 50
Back to Ancient Time

Regularity observed:

Several symptoms often occur together
 Tidal fever (潮热),heat sensation in palm and feet (手足心热), palpitation (心慌
心跳), thready and rapid pulse (脉细数)

Hypothesize common latent cause:


There must be a common latent cause
Identify the cause and explain the regularirt

Yin deficiency causing internal heart (阴虚内热)
 Yin and Yang should be in balance. If Yin is in deficiency, Yang will be in excess
relatively, and hence causes heat.
Page 51
Traditional Chinese Medicine (TCM)

Claim


TCM Theories = Statistical Regularities + Subjective Interpretations
How to justify the claim
Page 52
A Case Study

We collected a data set about kidney deficiency (肾虚)

35 symptom variables, 2600 records
Result of Data Analysis

Y0-Y34: manifest variables from data

X0-X13: latent variables introduced by data analysis

Structure interesting, supports TCM’s theories about various symptoms.
Page 53
Page 54
Other TCM Data Sets

From Beijing U of TCM, 973 project






Depression
Hepatitis B
Chronic Renal Failure
COPD
Menopause
China Academy of TCM

Subhealth
 Diabetes

In all cases, results of LT analysis match relevant TCM Theories
Page 55
Result on the Depression Data
Page 56
Significance

Conclusion


TCM Theories = Statistical Regularities + Subjective Interpretations
Significance

TCM theories are partially based on objective facts
 Boast user confidence

Can help to lay a modern statistical foundation for TCM
 Systematically identify statistical regularities about occurrence of
symptoms, find natural partitions
 Establish objective and quantitative diagnosis standards
 Assist in double-blind experiments for evaluate and improve the efficacy
of TCM treatment
Page 57
Outline

Traditional Uses

Structure Discovery

Density Estimation

Classification

Clustering

An HKUST Project