Transcript Powerpoint

PGM: Tirgul 10
Parameter Learning
and Priors
.
Why learning?
Knowledge acquisition bottleneck
 Knowledge acquisition is an expensive process
 Often we don’t have an expert
Data is cheap
 Vast amounts of data becomes available to us
 Learning
allows us to build systems based on the
data
2
Learning Bayesian networks
B
E
Data +
Prior information
Inducer
R
A
C
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
3
Known Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
Inducer
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
A
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
 Network

B
E
structure is specified
Inducer needs to estimate parameters
 Data
does not contain missing values
4
Unknown Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
Inducer
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
A
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
 Network

B
E
structure is not specified
Inducer needs to select arcs & estimate parameters
 Data
does not contain missing values
5
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
Inducer
A
B
E
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
 Network
structure is specified
 Data contains missing values

We consider assignments to missing values
6
Known Structure / Complete Data
 Given

a network structure G
And choice of parametric family for P(Xi|Pai)
 Learn
parameters for network
Goal
 Construct a network that is “closest” to probability
that generated the data
7
Example: Binomial Experiment
(Statistics 101)
Head
Tail
 When
tossed, it can land in one of two positions:
Head or Tail
 We denote by  the (unknown) probability P(H).
Estimation task:
 Given a sequence of toss samples x[1], x[2], …,
x[M] we want to estimate the probabilities P(H)= 
and P(T) = 1 - 
8
Statistical Parameter Fitting
 Consider
instances x[1], x[2], …, x[M] such
that



The set of values that x can take is known
Each is sampled from the same distribution
Each sampled independently of the rest
i.i.d.
samples
task is to find a parameter  so that the data
can be summarized by a probability P(x[j]|  ).
 The


Depends on the given family of probability
distributions: multinomial, Gaussian, Poisson, etc.
For now, focus on multinomial distributions
9
The Likelihood Function

How good is a particular ?
It depends on how likely it is to generate the
observed data
L( : D )  P (D | )   P (x [m ] | )
m
likelihood for the sequence H,T, T, H, H is
L( :D)
 The
L( : D )    (1  )  (1  )    
0
0.2
0.4

0.6
0.8
1
10
Sufficient Statistics
 To
compute the likelihood in the thumbtack
example we only require NH and NT
(the number of heads and the number of tails)
L( : D )  NH  (1  )NT
 NH
and NT are sufficient statistics for the
binomial distribution
11
Sufficient Statistics
A
sufficient statistic is a function of the data that
summarizes the relevant information for the
likelihood
 Formally, s(D) is a sufficient statistics if for any two
datasets D and D’

s(D) = s(D’ )  L( |D) = L( |D’)
Datasets
Statistics
12
Maximum Likelihood Estimation
MLE Principle:
Choose parameters that maximize the
likelihood function
 This
is one of the most commonly used
estimators in statistics
 Intuitively
appealing
13
Example: MLE in Binomial Data
 Applying
the MLE principle we get
NH
ˆ

NH  NT
(Which coincides with what one would expect)
(NH,NT ) = (3,2)
MLE estimate is 3/5 = 0.6
L( :D)
Example:
0
0.2
0.4
0.6
0.8
1
14
Learning Parameters for a Bayesian
Network
 Training
data has the form:
B
E
A
 E [1] B [1] A[1] C [1] 
 





D
 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


C
15
Learning Parameters for a Bayesian
Network
 Since
we assume i.i.d. samples,
likelihood function is
L( : D )   P (E [m], B [m], A[m],C [m] : )
B
E
A
C
m
16
Learning Parameters for a Bayesian
Network
 By
B
E
definition of network, we get
A
L( : D )   P (E [m ], B [m ], A[m ], C [m ] : )
C
m
P (E [m ] : )

m
P (B [m ] : )
P (A[m ] | B [m ], E [m ] : )
P (C [m ] | A[m ] : )
 E [1] B [1] A[1] C [1] 
 






 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


17
Learning Parameters for a Bayesian
Network
 Rewriting
B
E
terms, we get
A
L( : D )   P (E [m ], B [m ], A[m ], C [m ] : )
C
m
  P (E [m ] : )
m
 P (B [m] : )
m
 P (A[m] | B [m], E [m] : )
m
 P (C [m] | A[m] : )
m
 E [1] B [1] A[1] C [1] 
 






 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


18
General Bayesian Networks
Generalizing for any Bayesian network:
L( : D )   P (x 1 [m ], , xn [m ] : )
i.i.d. samples
m
  P (xi [m ] | Pai [m ] : i )
m
i
i
m
Network factorization
  P (xi [m ] | Pai [m ] : i )
  Li (i : D )
i
 The
likelihood decomposes according to the
structure of the network.
19
General Bayesian Networks (Cont.)
Decomposition
 Independent Estimation Problems
If the parameters for each family are not related, then
they can be estimated independently of each other.
20
From Binomial to Multinomial
example, suppose X can have the values
1,2,…,K
 We want to learn the parameters  1,  2. …,  K
Sufficient statistics:
 N1, N2, …, NK - the number of times each outcome
is observed
K
L( : D )   k Nk
Likelihood function:
 For
k 1
MLE:
Nk
ˆ
k 
 N

21
Likelihood for Multinomial Networks
we assume that P(Xi | Pai ) is multinomial,
we get further decomposition:
 When
Li (i : D )   P (xi [m ] | Pai [m ] : i )
m

 P (x [m] | pa
paii m ,Paii [ m ]  paii
i
i
: i )
  P (xi | pai : i )N ( xii , paii )
paii
xii
pai
xi
  xi |pai N ( xi , pai )
22
Likelihood for Multinomial Networks
we assume that P(Xi | Pai ) is multinomial,
we get further decomposition:
 When
Li (i : D )   xi |pai
pai
N ( xi , pai )
xi
each value pai of the parents of Xi we get an
independent multinomial problem
 For
N (xi , pai )
ˆ
 The MLE is  xi |pai 
N ( pai )
23
Maximum Likelihood Estimation
Consistency
 Estimate converges to best possible value as the
number of examples grow
 To
make this formal, we need to introduce some
definitions
24
KL-Divergence
P and Q be two distributions over X
 A measure of distance between P and Q is the
Kullback-Leibler Divergence
 Let
P (x )
KL(P || Q )   P (x ) log
Q (x )
x
 KL(P||Q)
= 1 (when logs are in base 2) =
 The probability P assigns to an instance is, on
average, half the probability Q assigns to it
 KL(P||Q)  0
 KL(P||Q) = 0 iff are P and Q equal
25
Consistency
 Let

P(X| ) be a parametric family
We need to make various regularity condition we won’t
go into now
P*(X) be the distribution that generates the data
ˆD
 Let
be the MLE estimate given a dataset D
 Let
Thm
 As N , ˆ   * where
D
 *  arg min KL(P * (X ) || P (X :  )
with probability 1
26
Consistency -- Geometric
Interpretation
Distributions that can
represented by P(X| )
P*
P(X| * )
Space of probability distribution
27
Is MLE all we need?
 Suppose


that after 10 observations,
ML estimates P(H) = 0.7 for the thumbtack
Would you bet on heads for the next toss?
Suppose now that after 10 observations,
ML estimates P(H) = 0.7 for a coin
Would you place the same bet?
28
Bayesian Inference
Frequentist Approach:
Assumes there is an unknown but fixed parameter 
 Estimates  with some confidence
 Prediction by using the estimated parameter value

Bayesian Approach:
Represents uncertainty about the unknown parameter
 Uses probability to quantify this uncertainty:
 Unknown parameters as random variables
 Prediction follows from the rules of probability:
 Expectation over the unknown parameters

29
Bayesian Inference (cont.)
 We
can represent our uncertainty about the
sampling process using a Bayesian network

X[1]
X[2]
X[m]
Observed data
X[m+1]
Query
values of X are independent given 
 The conditional probabilities, P(x[m] |  ), are
the parameters in the model
 Prediction is now inference in this network
 The
30
Bayesian Inference (cont.)
Prediction as inference in this network
P (x [M  1] | x [1], , x [M ])
X[1]

X[2]
X[m]
X[m+1]
  P (x [M  1] | , x [1], , x [M ])P ( | x [1], , x [M ])d
  P (x [M  1] | )P ( | x [1], , x [M ])d
where
Likelihood
Prior
P (x [1],  x [M ] | )P ()
P ( | x [1],  x [M ]) 
P (x [1],  x [M ])
Posterior
Probability of data
31
Example: Binomial Data Revisited
uniform for  in [0,1]
 P( ) = 1
 Then P( |D) is proportional to the likelihood L( :D)
 Prior:
P ( | x [1],x [M])  P (x [1],x [M ] |  )  P ( )
(NH,NT ) = (4,1)

MLE for P(X = H ) is 4/5 = 0.8
 Bayesian
prediction is
0
P (x [M  1]  H | D )    P ( | D )d 
0.2
0.4
5
 0.7142 
7
0.6
0.8
1
32
Bayesian Inference and MLE
 In
our example, MLE and Bayesian prediction
differ
But…
If prior is well-behaved
 Does not assign 0 density to any “feasible”
parameter value
Then: both MLE and Bayesian prediction
converge to the same value
 Both are consistent
33
Dirichlet Priors
 Recall
that the likelihood function is
K
L (  : D )    k Nk
k 1
Dirichlet prior with hyperparameters 1,…,K is
defined as
A
K
P ()   k k 1 for legal  1,…,  K
k 1
Then the posterior has the same form, with
hyperparameters 1+N 1,…,K +N
K
K
K
k 1
k 1
K
P ( | D )  P ()P (D | )   k k 1  k Nk   k k Nk 1
k 1
34
Dirichlet Priors (cont.)
 We
can compute the prediction on a new event
in closed form:
 If P() is Dirichlet with hyperparameters 1,…,K
then
k
P (X [1]  k )   k P ()d 
 

 Since
the posterior is also Dirichlet, we get
 k  Nk
P (X [M  1]  k | D )   k P ( | D )d 
 (   N  )

35
Dirichlet Priors -- Example
5
Dirichlet(1,1)
Dirichlet(2,2)
Dirichlet(0.5,0.5)
Dirichlet(5,5)
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
36
Prior Knowledge
hyperparameters 1,…,K can be thought of
as “imaginary” counts from our prior experience
 The
 Equivalent
sample size = 1+…+K
 The
larger the equivalent sample size the more
confident we are in our prior
37
Effect of Priors
Prediction of P(X=H ) after seeing data with NH =
0.25•NT for different sample sizes
0.55
0.6
0.5
Different strength H + T
Fixed ratio H / T
0.45
0.5
0.4
0.4
0.35
Fixed strength H + T
Different ratio H / T
0.3
0.3
0.2
0.25
0.1
0.2
0.15
0
20
40
60
80
100
0
0
20
40
60
80
100
38
Effect of Priors (cont.)
 In
real data, Bayesian estimates are less
sensitive to noise in the data
P(X = 1|D)
0.7
MLE
Dirichlet(.5,.5)
Dirichlet(1,1)
Dirichlet(5,5)
Dirichlet(10,10)
0.6
0.5
0.4
0.3
0.2
0.1
N
5
10
15
20
25
30
35
40
45
50
1
Toss Result
0
N
39
Conjugate Families
 The
property that the posterior distribution follows
the same parametric form as the prior distribution
is called conjugacy
 Dirichlet prior is a conjugate family for the
multinomial likelihood
 Conjugate families are useful since:
 For many distributions we can represent them
with hyperparameters
 They allow for sequential update within the
same representation
 In many cases we have closed-form solution
for prediction
40
Bayesian Networks and Bayesian
Prediction
Y|X
X
X
m
X[1]
X[2]
X[M]
X[M+1]
Y[1]
Y[2]
Y[M]
Y[M+1]
X[m]
Y|X
Y[m]
Plate notation
Observed data
Query
 Priors
for each parameter group are independent
 Data instances are independent given the
unknown parameters
41
Bayesian Networks and Bayesian
Prediction (Cont.)
Y|X
X
X
m
X[1]
X[2]
X[M]
X[M+1]
Y[1]
Y[2]
Y[M]
Y[M+1]
X[m]
Y|X
Y[m]
Plate notation
Observed data
Query
 We
can also “read” from the network:
Complete data 
posteriors on parameters are independent
42
Bayesian Prediction(cont.)
 Since
posteriors on parameters for each family
are independent, we can compute them
separately
 Posteriors for parameters within families are also
independent:
X
m
X[m]
Y[m]
Y|X
X
Refined model
m
X[m]
Y|X=0
Y|X=1
Y[m]
data 
independent posteriors on Y|X=0 and  Y|X=1
 Complete
43
Bayesian Prediction(cont.)
 Given
these observations, we can compute the
posterior for each multinomial  Xi | pai
independently
 The posterior is Dirichlet with parameters
(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)
 The
predictive distribution is then represented by
the parameters
~
 x |pa
i
i
 (xi , pai )  N (xi , pai )

 ( pai )  N ( pai )
44
Assessing Priors for Bayesian
Networks
We need the(xi,pai) for each node xj
can use initial parameters 0 as prior
information
 Need also an equivalent sample size parameter
M0
 Then, we let (xi,pai) = M0P(xi,pai|0)
 We
 This
allows to update a network using new data
45
Learning Parameters: Case Study
(cont.)
Experiment:
 Sample a stream of instances from the alarm
network
 Learn parameters using
 MLE estimator
 Bayesian estimator with uniform prior with
different strengths
46
Learning Parameters: Case Study
(cont.)
MLE
Bayes w/ Uniform Prior, M'=5
Bayes w/ Uniform Prior, M'=10
Bayes w/ Uniform Prior, M'=20
Bayes w/ Uniform Prior, M'=50
1.4
KL Divergence
1.2
1
0.8
0.6
0.4
0.2
0
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
M
47
Learning Parameters: Summary
 Estimation
relies on sufficient statistics
 For multinomial these are of the form N (xi,pai)
 Parameter estimation
N (xi , pai )
ˆ
x |pa 
i
i
N ( pai )
MLE
(xi , pai )  N (xi , pai )
~
x |pa 
i
i
( pai )  N ( pai )
Bayesian (Dirichlet)
 Bayesian
methods also require choice of priors
 Both MLE and Bayesian are asymptotically
equivalent and consistent
 Both can be implemented in an on-line manner
by accumulating sufficient statistics
48