Transcript Powerpoint
PGM: Tirgul 10
Parameter Learning
and Priors
.
Why learning?
Knowledge acquisition bottleneck
Knowledge acquisition is an expensive process
Often we don’t have an expert
Data is cheap
Vast amounts of data becomes available to us
Learning
allows us to build systems based on the
data
2
Learning Bayesian networks
B
E
Data +
Prior information
Inducer
R
A
C
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
3
Known Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
Inducer
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
A
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Network
B
E
structure is specified
Inducer needs to estimate parameters
Data
does not contain missing values
4
Unknown Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
Inducer
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
A
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Network
B
E
structure is not specified
Inducer needs to select arcs & estimate parameters
Data
does not contain missing values
5
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
Inducer
A
B
E
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Network
structure is specified
Data contains missing values
We consider assignments to missing values
6
Known Structure / Complete Data
Given
a network structure G
And choice of parametric family for P(Xi|Pai)
Learn
parameters for network
Goal
Construct a network that is “closest” to probability
that generated the data
7
Example: Binomial Experiment
(Statistics 101)
Head
Tail
When
tossed, it can land in one of two positions:
Head or Tail
We denote by the (unknown) probability P(H).
Estimation task:
Given a sequence of toss samples x[1], x[2], …,
x[M] we want to estimate the probabilities P(H)=
and P(T) = 1 -
8
Statistical Parameter Fitting
Consider
instances x[1], x[2], …, x[M] such
that
The set of values that x can take is known
Each is sampled from the same distribution
Each sampled independently of the rest
i.i.d.
samples
task is to find a parameter so that the data
can be summarized by a probability P(x[j]| ).
The
Depends on the given family of probability
distributions: multinomial, Gaussian, Poisson, etc.
For now, focus on multinomial distributions
9
The Likelihood Function
How good is a particular ?
It depends on how likely it is to generate the
observed data
L( : D ) P (D | ) P (x [m ] | )
m
likelihood for the sequence H,T, T, H, H is
L( :D)
The
L( : D ) (1 ) (1 )
0
0.2
0.4
0.6
0.8
1
10
Sufficient Statistics
To
compute the likelihood in the thumbtack
example we only require NH and NT
(the number of heads and the number of tails)
L( : D ) NH (1 )NT
NH
and NT are sufficient statistics for the
binomial distribution
11
Sufficient Statistics
A
sufficient statistic is a function of the data that
summarizes the relevant information for the
likelihood
Formally, s(D) is a sufficient statistics if for any two
datasets D and D’
s(D) = s(D’ ) L( |D) = L( |D’)
Datasets
Statistics
12
Maximum Likelihood Estimation
MLE Principle:
Choose parameters that maximize the
likelihood function
This
is one of the most commonly used
estimators in statistics
Intuitively
appealing
13
Example: MLE in Binomial Data
Applying
the MLE principle we get
NH
ˆ
NH NT
(Which coincides with what one would expect)
(NH,NT ) = (3,2)
MLE estimate is 3/5 = 0.6
L( :D)
Example:
0
0.2
0.4
0.6
0.8
1
14
Learning Parameters for a Bayesian
Network
Training
data has the form:
B
E
A
E [1] B [1] A[1] C [1]
D
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
C
15
Learning Parameters for a Bayesian
Network
Since
we assume i.i.d. samples,
likelihood function is
L( : D ) P (E [m], B [m], A[m],C [m] : )
B
E
A
C
m
16
Learning Parameters for a Bayesian
Network
By
B
E
definition of network, we get
A
L( : D ) P (E [m ], B [m ], A[m ], C [m ] : )
C
m
P (E [m ] : )
m
P (B [m ] : )
P (A[m ] | B [m ], E [m ] : )
P (C [m ] | A[m ] : )
E [1] B [1] A[1] C [1]
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
17
Learning Parameters for a Bayesian
Network
Rewriting
B
E
terms, we get
A
L( : D ) P (E [m ], B [m ], A[m ], C [m ] : )
C
m
P (E [m ] : )
m
P (B [m] : )
m
P (A[m] | B [m], E [m] : )
m
P (C [m] | A[m] : )
m
E [1] B [1] A[1] C [1]
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
18
General Bayesian Networks
Generalizing for any Bayesian network:
L( : D ) P (x 1 [m ], , xn [m ] : )
i.i.d. samples
m
P (xi [m ] | Pai [m ] : i )
m
i
i
m
Network factorization
P (xi [m ] | Pai [m ] : i )
Li (i : D )
i
The
likelihood decomposes according to the
structure of the network.
19
General Bayesian Networks (Cont.)
Decomposition
Independent Estimation Problems
If the parameters for each family are not related, then
they can be estimated independently of each other.
20
From Binomial to Multinomial
example, suppose X can have the values
1,2,…,K
We want to learn the parameters 1, 2. …, K
Sufficient statistics:
N1, N2, …, NK - the number of times each outcome
is observed
K
L( : D ) k Nk
Likelihood function:
For
k 1
MLE:
Nk
ˆ
k
N
21
Likelihood for Multinomial Networks
we assume that P(Xi | Pai ) is multinomial,
we get further decomposition:
When
Li (i : D ) P (xi [m ] | Pai [m ] : i )
m
P (x [m] | pa
paii m ,Paii [ m ] paii
i
i
: i )
P (xi | pai : i )N ( xii , paii )
paii
xii
pai
xi
xi |pai N ( xi , pai )
22
Likelihood for Multinomial Networks
we assume that P(Xi | Pai ) is multinomial,
we get further decomposition:
When
Li (i : D ) xi |pai
pai
N ( xi , pai )
xi
each value pai of the parents of Xi we get an
independent multinomial problem
For
N (xi , pai )
ˆ
The MLE is xi |pai
N ( pai )
23
Maximum Likelihood Estimation
Consistency
Estimate converges to best possible value as the
number of examples grow
To
make this formal, we need to introduce some
definitions
24
KL-Divergence
P and Q be two distributions over X
A measure of distance between P and Q is the
Kullback-Leibler Divergence
Let
P (x )
KL(P || Q ) P (x ) log
Q (x )
x
KL(P||Q)
= 1 (when logs are in base 2) =
The probability P assigns to an instance is, on
average, half the probability Q assigns to it
KL(P||Q) 0
KL(P||Q) = 0 iff are P and Q equal
25
Consistency
Let
P(X| ) be a parametric family
We need to make various regularity condition we won’t
go into now
P*(X) be the distribution that generates the data
ˆD
Let
be the MLE estimate given a dataset D
Let
Thm
As N , ˆ * where
D
* arg min KL(P * (X ) || P (X : )
with probability 1
26
Consistency -- Geometric
Interpretation
Distributions that can
represented by P(X| )
P*
P(X| * )
Space of probability distribution
27
Is MLE all we need?
Suppose
that after 10 observations,
ML estimates P(H) = 0.7 for the thumbtack
Would you bet on heads for the next toss?
Suppose now that after 10 observations,
ML estimates P(H) = 0.7 for a coin
Would you place the same bet?
28
Bayesian Inference
Frequentist Approach:
Assumes there is an unknown but fixed parameter
Estimates with some confidence
Prediction by using the estimated parameter value
Bayesian Approach:
Represents uncertainty about the unknown parameter
Uses probability to quantify this uncertainty:
Unknown parameters as random variables
Prediction follows from the rules of probability:
Expectation over the unknown parameters
29
Bayesian Inference (cont.)
We
can represent our uncertainty about the
sampling process using a Bayesian network
X[1]
X[2]
X[m]
Observed data
X[m+1]
Query
values of X are independent given
The conditional probabilities, P(x[m] | ), are
the parameters in the model
Prediction is now inference in this network
The
30
Bayesian Inference (cont.)
Prediction as inference in this network
P (x [M 1] | x [1], , x [M ])
X[1]
X[2]
X[m]
X[m+1]
P (x [M 1] | , x [1], , x [M ])P ( | x [1], , x [M ])d
P (x [M 1] | )P ( | x [1], , x [M ])d
where
Likelihood
Prior
P (x [1], x [M ] | )P ()
P ( | x [1], x [M ])
P (x [1], x [M ])
Posterior
Probability of data
31
Example: Binomial Data Revisited
uniform for in [0,1]
P( ) = 1
Then P( |D) is proportional to the likelihood L( :D)
Prior:
P ( | x [1],x [M]) P (x [1],x [M ] | ) P ( )
(NH,NT ) = (4,1)
MLE for P(X = H ) is 4/5 = 0.8
Bayesian
prediction is
0
P (x [M 1] H | D ) P ( | D )d
0.2
0.4
5
0.7142
7
0.6
0.8
1
32
Bayesian Inference and MLE
In
our example, MLE and Bayesian prediction
differ
But…
If prior is well-behaved
Does not assign 0 density to any “feasible”
parameter value
Then: both MLE and Bayesian prediction
converge to the same value
Both are consistent
33
Dirichlet Priors
Recall
that the likelihood function is
K
L ( : D ) k Nk
k 1
Dirichlet prior with hyperparameters 1,…,K is
defined as
A
K
P () k k 1 for legal 1,…, K
k 1
Then the posterior has the same form, with
hyperparameters 1+N 1,…,K +N
K
K
K
k 1
k 1
K
P ( | D ) P ()P (D | ) k k 1 k Nk k k Nk 1
k 1
34
Dirichlet Priors (cont.)
We
can compute the prediction on a new event
in closed form:
If P() is Dirichlet with hyperparameters 1,…,K
then
k
P (X [1] k ) k P ()d
Since
the posterior is also Dirichlet, we get
k Nk
P (X [M 1] k | D ) k P ( | D )d
( N )
35
Dirichlet Priors -- Example
5
Dirichlet(1,1)
Dirichlet(2,2)
Dirichlet(0.5,0.5)
Dirichlet(5,5)
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
36
Prior Knowledge
hyperparameters 1,…,K can be thought of
as “imaginary” counts from our prior experience
The
Equivalent
sample size = 1+…+K
The
larger the equivalent sample size the more
confident we are in our prior
37
Effect of Priors
Prediction of P(X=H ) after seeing data with NH =
0.25•NT for different sample sizes
0.55
0.6
0.5
Different strength H + T
Fixed ratio H / T
0.45
0.5
0.4
0.4
0.35
Fixed strength H + T
Different ratio H / T
0.3
0.3
0.2
0.25
0.1
0.2
0.15
0
20
40
60
80
100
0
0
20
40
60
80
100
38
Effect of Priors (cont.)
In
real data, Bayesian estimates are less
sensitive to noise in the data
P(X = 1|D)
0.7
MLE
Dirichlet(.5,.5)
Dirichlet(1,1)
Dirichlet(5,5)
Dirichlet(10,10)
0.6
0.5
0.4
0.3
0.2
0.1
N
5
10
15
20
25
30
35
40
45
50
1
Toss Result
0
N
39
Conjugate Families
The
property that the posterior distribution follows
the same parametric form as the prior distribution
is called conjugacy
Dirichlet prior is a conjugate family for the
multinomial likelihood
Conjugate families are useful since:
For many distributions we can represent them
with hyperparameters
They allow for sequential update within the
same representation
In many cases we have closed-form solution
for prediction
40
Bayesian Networks and Bayesian
Prediction
Y|X
X
X
m
X[1]
X[2]
X[M]
X[M+1]
Y[1]
Y[2]
Y[M]
Y[M+1]
X[m]
Y|X
Y[m]
Plate notation
Observed data
Query
Priors
for each parameter group are independent
Data instances are independent given the
unknown parameters
41
Bayesian Networks and Bayesian
Prediction (Cont.)
Y|X
X
X
m
X[1]
X[2]
X[M]
X[M+1]
Y[1]
Y[2]
Y[M]
Y[M+1]
X[m]
Y|X
Y[m]
Plate notation
Observed data
Query
We
can also “read” from the network:
Complete data
posteriors on parameters are independent
42
Bayesian Prediction(cont.)
Since
posteriors on parameters for each family
are independent, we can compute them
separately
Posteriors for parameters within families are also
independent:
X
m
X[m]
Y[m]
Y|X
X
Refined model
m
X[m]
Y|X=0
Y|X=1
Y[m]
data
independent posteriors on Y|X=0 and Y|X=1
Complete
43
Bayesian Prediction(cont.)
Given
these observations, we can compute the
posterior for each multinomial Xi | pai
independently
The posterior is Dirichlet with parameters
(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)
The
predictive distribution is then represented by
the parameters
~
x |pa
i
i
(xi , pai ) N (xi , pai )
( pai ) N ( pai )
44
Assessing Priors for Bayesian
Networks
We need the(xi,pai) for each node xj
can use initial parameters 0 as prior
information
Need also an equivalent sample size parameter
M0
Then, we let (xi,pai) = M0P(xi,pai|0)
We
This
allows to update a network using new data
45
Learning Parameters: Case Study
(cont.)
Experiment:
Sample a stream of instances from the alarm
network
Learn parameters using
MLE estimator
Bayesian estimator with uniform prior with
different strengths
46
Learning Parameters: Case Study
(cont.)
MLE
Bayes w/ Uniform Prior, M'=5
Bayes w/ Uniform Prior, M'=10
Bayes w/ Uniform Prior, M'=20
Bayes w/ Uniform Prior, M'=50
1.4
KL Divergence
1.2
1
0.8
0.6
0.4
0.2
0
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
M
47
Learning Parameters: Summary
Estimation
relies on sufficient statistics
For multinomial these are of the form N (xi,pai)
Parameter estimation
N (xi , pai )
ˆ
x |pa
i
i
N ( pai )
MLE
(xi , pai ) N (xi , pai )
~
x |pa
i
i
( pai ) N ( pai )
Bayesian (Dirichlet)
Bayesian
methods also require choice of priors
Both MLE and Bayesian are asymptotically
equivalent and consistent
Both can be implemented in an on-line manner
by accumulating sufficient statistics
48