Transcript View
Cognitive Computer Vision
Kingsley Sage
[email protected]
and
Hilary Buxton
[email protected]
Prepared under ECVision Specific Action 8-3
http://www.ecvision.org
Lecture 13
Learning Bayesian Belief Networks
Taxonomy of methods
Learning BBNs for the fully observable data
and known structure case
So why are BBNs relevant to
Cognitive CV?
Provides a well-founded methodology for
reasoning with uncertainty
These methods are the basis for our model of
perception guided by expectation
We can develop well-founded methods of
learning rather than just being stuck with handcoded models
Reminder: What is a BBN?
p(a=detect)
p(b=detect)
0.2
0.1
B
A
O
C
a=
b=
p(o=T|A,B)
T
T
0.95
T
F
0.6
F
T
0.5
F
F
0.01
N
o=
p(c=T|O)
o=
p(n=T|O)
T
0.7
T
0.7
F
0.1
F
0.2
Compact
representation of the
joint probability
Each variable is
represented as a
node.
Conditional
independence
assumptions are
encoded using a set
of arcs
Different types of
graph exist. The one
shown is a Directed
Acyclic Graph (DAG)
Why is learning important in the
context of BBNs?
Knowledge acquisition can be an expensive
process
Experts may not be readily available (scarce
knowledge) or simply not exist
But you might have a lot of data from (say) case
studies
Learning allows us to construct BBN models from
the data and in the process gain insight into the
nature of the problem domain
The process of learning
Data (may be full
or partial)
Learning
process
Model structure
(if known)
What do we mean by “partial” data?
Training data where there are missing values e.g.:
A
B
O
a=
b=
o=
T
T
F
F
T
F
F
?
T
T
F
?
Discrete valued BBN
with 3 nodes
.
.
F
T
F
What do we mean by “known” and
“unknown” structure?
A
B
A
B
O
O
Known structure
Unknown structure
Taxonomy of learning methods
Model structure
Observability
Known
Unknown
Full
Maximum likelihood estimation
Search through model
space
Partial
Expectation Maximisation (EM) or
gradient descent(
EM + search through
model space (structural
EM)
In this lecture we will look at the full observability and
known model structure case in detail
In the next lecture we will take an overview of the
other three cases
Full observability & known structure
Getting the notation right
The model parameters (CPDs) are represented
as (example later)
Training data set D
We want to find parameters to maximise P(|D)
LIKELIHOOD
P ( D | ).P ( )
P ( | D )
P( D)
Likelihood function L(:D) is P(D| )
Full observability & known structure
Getting the notation right
Training data Dz
A
B
O
B[1]
O[1]
A[1]
D
A[ M ] B[ M ] O[ M ]
L( : D) P( D | )
P( A[m], B[m],O[m] : )
m
Factorising the likelihood expression
L( : D) P( D | )
A
m
B
P( A[m] : )
P( B[m] : )
m P (O[ m] | A[m], B[m] : )
O
P( A[m], B[m],O[m] : )
P( A[m] : ) P(B[m] : ) P(O[m] | A[m], B[m] : )
m
m
m
Decomposition in general
P( x [m] | Parents [m] : )
L( : D)
P( x1[m],...,xn [m] : )
m
i
i
i
i
m
All the parameters for each node can be estimated separately
Example
Estimating parameter for root node
A
B
O
Let’s say our training data D contains
these values for A {T,F,T,T,F,T,T,T}
We represent our single parameter as the
probability that a=T
The likelihood for the sequence is:
L( : D) .(1 ). . .(1 ). . .
So what about the prior on ?
P(a[1],...,a[ M ] | ).P( )
P( | a[1],...,a[ M ])
P(a[1],...,a[ M ])
We have an expression for P(a[1],…,a[M]), all we need
to do now is to say something about P()
P( | a[1],...,a[M ]) P(a[1],...,a[M ] | ).P( )
If all values of were equally likely at the outset, then we
have a MAXIMUM LIKELIHOOD ESTIMATE (MLE)
for P(|a[1],…,a[M]) which for our example
is = 0.75 I.e. p(a=T is 0.75)
So what about the prior on ?
P( | a[1],...,a[M ]) P(a[1],...,a[M ] | ).P( )
If P() is not uniform, we need to take that into account when
computing our estimate for a model parameter. In that case
P(|x[1],…,x[M]) would be a
MAXIMUM APOSTERIORI PROBABILITY (MAP) estimate
There are many different forms of prior, one of the more common
ones in this application is the DIRICHLET prior …
The Dirichlet prior
p()
Dirichlet(T,F)
Semantic priors
If the training data D is sorted into known
classes, the priors can be estimate
beforehand. These are called “semantic priors”
This involves an element of hand coding and
loses the advantage gaining some insight into
the problem domain
Does give the advantage of mapping into
expert knowledge of the classes in the problem
Summary
Estimation relies on sufficient statistics
For ML estimate for discrete valued nodes, we
use counts #:
ˆxi | parentsi
# ( xi , parentsi )
# (of exam plesm appingparentsi and i )
For MAP estimate, we have to account for the
prior
Next time …
Overview of methods for learning BBNs:
–
–
–
Full data and unknown structure
Partial data and known structure
Partial data and unknown structure
Excellent tutorial at by Koller and Friedman:
www.cs.huji.ac.il/~nir/Nips01-Tutorial/
Some of today’s slides were adapted from
that tutorial