Probabilistic Inference
Download
Report
Transcript Probabilistic Inference
CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Parameter Learning
(From data to distributions)
AGENDA
Learning probability distributions from example
data
Generative vs. discriminative models
Maximum likelihood estimation (MLE)
Bayesian estimation
MOTIVATION
Past lectures have studied how to infer
characteristics of a distribution, given a fullyspecified Bayes net
Next few lectures: where does the Bayes net come
from?
Setting for this lecture:
Given a set of examples drawn from a distribution
Each example is complete (fully observable)
BN structure is known, but the CPTs are unknown
DENSITY ESTIMATION
Given dataset D={d[1],…,d[M]} drawn from
underlying distribution P*
Find a distribution that matches P* as close as
possible
High-level issues:
Usually, not enough data to get an accurate picture of
P*, which forces us to approximate.
Even if we did have P*, how do we measure
closeness?
How do we maximize closeness?
Two approaches: learning problems =>
Optimization problems, or
Bayesian inference problems
KULLBACK-LIEBLER DIVERGENCE
Definition: given two probability distributions P
and Q over X, the KL divergence (or relative
entropy) from P to Q is given by:
𝑃 𝒙
𝐷(𝑃| 𝑄 = 𝐸𝒙~𝑃 log
𝑄 𝒙
𝑃 𝒙
=
𝑃(𝑥) log
𝑄 𝒙
𝒙
Properties:
𝐷(𝑃| 𝑄 ≥ 0
𝐷(𝑃||𝑄) = 0 iff P=Q “almost everywhere”
Not a true “metric” – non-symmetric
APPLYING KL DIVERGENCE TO LEARNING
Approach: given underlying distribution P*, find
P (within a class of distributions) so KL
divergence is minimized
arg min 𝐷(𝑃∗ | 𝑃 = arg min 𝐸𝒙~𝑃∗ log 𝑃∗ 𝒙 − log 𝑃 𝒙
𝑃
= argmin
𝑃
𝑃
𝐸𝒙~𝑃∗ log 𝑃∗
𝒙 ] − 𝐸𝒙~𝑃∗ [log 𝑃 𝒙
= arg max 𝐸𝒙~𝑃∗ log 𝑃 𝒙
𝑃
If we approximate P* with draws from D, we get
1
arg max
log 𝑃 𝒅 𝑖
𝑃
𝐷
𝑖
Minimizing KL-divergence to the empirical
distribution is the same as maximizing the
empirical log-likelihood
ANOTHER APPROACH: DISCRIMINATIVE
LEARNING
Do we really want to model P*? We may be more
concerned with predicting the values of some
subset of variables
E.g., for a Bayes net CPT, we want P(Y|PaY) but
may not care about the distribution of PaY
Generative model: estimate P(X,Y)
Discriminative model: estimate P(Y|X), ignore
P(X)
TRAINING DISCRIMINATIVE MODELS
Define a loss function l(y,x,P) that is given the
ground truth y,x
Measures the difference between the prediction
P(Y|x) and the ground truth
Examples:
Classification error I[y arg maxy P(y|x)]
Conditional log likelihood - log P(y|x)
Strategy: minimize empirical loss
DISCRIMINATIVE VS GENERATIVE
Discriminative models:
Don’t model the input distribution, so may have more
expressive power for the same level of complexity
May learn more accurate predictive models for same
sized training dataset
Directly transcribe top-down evaluation of CPTs
Generative models:
More flexible, because they don’t require a priori
selection of the dependent variable Y
Bottom-up inference is easier
Both useful in different situations
WHAT CLASS OF PROBABILITY MODELS?
For small discrete distributions, just use a
tabular representation
Very efficient learning techniques
For large discrete distributions or continuous
ones, the choice of probability model is crucial
Increasing complexity =>
Can represent complex distributions more accurately
Need more data to learn well (risk of overfitting)
More expensive to learn and to perform inference
LEARNING COIN FLIPS
Let the unknown fraction of cherries be q
(hypothesis)
Probability of drawing a cherry is q
Suppose draws are independent and identically
distributed (i.i.d)
Observe that c out of N draws are cherries (data)
LEARNING COIN FLIPS
Let the unknown fraction of cherries be q
(hypothesis)
Intuition: c/N might be a good hypothesis
(or it might not, depending on the draw!)
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) =
Pj P(d |q) = q
j
i.i.d assumption
c
(1-q)N-c
Gather c cherry terms
together, then N-c lime terms
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
1.2
1
P(data|q)
0.8
0.6
1/1 cherry
0.4
0.2
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
1.2
1
P(data|q)
0.8
0.6
2/2 cherry
0.4
0.2
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0.16
0.14
0.12
P(data|q)
0.1
0.08
2/3 cherry
0.06
0.04
0.02
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0.07
0.06
0.05
P(data|q)
0.04
0.03
2/4 cherry
0.02
0.01
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0.04
0.035
0.03
P(data|q)
0.025
0.02
2/5 cherry
0.015
0.01
0.005
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0.0000012
0.000001
P(data|q)
0.0000008
0.0000006
10/20 cherry
0.0000004
0.0000002
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
9E-31
8E-31
7E-31
P(data|q)
6E-31
5E-31
4E-31
50/100 cherry
3E-31
2E-31
1E-31
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Peaks of likelihood function seem to hover
around the fraction of cherries…
Sharpness indicates some notion of certainty…
9E-31
8E-31
P(data|q)
7E-31
6E-31
5E-31
4E-31
50/100 cherry
3E-31
2E-31
1E-31
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
P(d|q) be the likelihood function
The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]
= log [ qc ] + log [(1-q)N-c]
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = log [ qc (1-q)N-c]
= log [ qc ] + log [(1-q)N-c]
= c log q + (N-c) log (1-q)
MAXIMUM LIKELIHOOD
l(q) = log P(d|q) = c log q + (N-c) log (1-q)
Setting dl/dq(q) = 0 gives the maximum likelihood
estimate
MAXIMUM LIKELIHOOD
dl/dq(q) = c/q – (N-c)/(1-q)
At MLE, c/q – (N-c)/(1-q) = 0
=> q = c/N
c and N are known as sufficient
statistics for the parameter q – no other
values give additional information
about q
OTHER MLE RESULTS
Categorical distributions (Non-binary discrete
variables): take fraction of counts for each value
(histogram)
Continuous Gaussian distributions
Mean = average data
Standard deviation = standard deviation of data
MAXIMUM LIKELIHOOD FOR BN
For any BN, the ML parameters of any CPT can
be derived by the fraction of observed values in
the data, conditioned on matched parent values
N=1000
E: 500
B: 200
P(E) = 0.5
P(B) = 0.2
Earthquake
Burglar
Alarm
A|E,B: 19/20
A|B: 188/200
A|E: 170/500
A| : 1/380
E
B
P(A|E,B)
T
T
0.95
F
T
0.95
T
F
0.34
F
F
0.003
PROOF
Let BN have structure G over variables X1,…,Xn
and parameters q
Given dataset D
L(q ; D) = Pm PG(d[m]; q)
PROOF
Let BN have structure G over variables X1,…,Xn
and parameters q
Given dataset D
L(q ; D) = Pm PG(d[m]; q)
= Pm Pi PG(xi[m] | paXi[m]; q)
FITTING CPTS
Each ML entry P(xi|paXi) is given by examining
counts of (xi,paXi) in D and normalizing across
rows of the CPT
Note that for large k=|PaXi|, very few datapoints
will share the values of paXi!
O(|D|/2k), but some values may be even rarer
Large domains |Val(Xi)| can also be a problem
Data fragmentation
PROOF
Let BN have structure G over variables X1,…,Xn and
parameters q
Given dataset D
L(q ; D) = Pm PG(d[m]; q)
= Pm Pi PG(xi[m] | paXi[m]; q)
= Pi [Pm PG(xi[m] | paXi[m]; q)]
Pm PG(xi[m] | paXi[m]; q) is the likelihood of the local
CPT of Xi: L(qXi ; D)
Each CPT depends on a disjoint set of parameters qXi
=> maximizing L(q ; D) over all parameters q is equivalent
to maximizing L(qXi ; D) over each individual qXi
AN ALTERNATIVE APPROACH: BAYESIAN
ESTIMATION
P(q|d) = 1/Z P(d|q) P(q) is the posterior
Distribution of hypotheses given the data
P(d|q) is the likelihood
P(q) is the hypothesis prior
q
d[1]
d[2]
d[M]
ASSUMPTION: UNIFORM PRIOR,
BERNOULLI DISTRIBUTION
Assume P(q) is uniform
P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c
What’s P(Y|D)?
qi
Y
d[1]
d[2]
d[M]
ASSUMPTION: UNIFORM PRIOR,
BERNOULLI DISTRIBUTION
Assume P(q) is uniform
P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c
What’s P(Y|D)?
𝑃 𝑌𝐷 =
1
𝑃
0
𝑌 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 =
1 1 𝑐
𝜃 𝜃
0
𝑍
1−𝜃
qi
Y
d[1]
d[2]
d[M]
𝑁−𝑐
𝑑𝜃
ASSUMPTION: UNIFORM PRIOR,
BERNOULLI DISTRIBUTION
1 𝑎
𝜃
0
1 − 𝜃 𝑏 𝑑𝜃 =
𝑏!𝑎!
𝑎+𝑏+1 !
=>Z = c! (N-c)! / (N+1)!
=>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!
= (c+1) / (N+2)
qi
Can think of this as a
“correction” using
“virtual counts”
Y
d[1]
d[2]
d[M]
NONUNIFORM PRIORS
P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q)
Define, for all q, the probability
that I believe in q
P(q)
0
1
q
BETA DISTRIBUTION
Betaa,b(q) = g qa-1 (1-q)b-1
a, b hyperparameters > 0
g is a normalization
constant
a=b=1 is uniform
distribution
POSTERIOR WITH BETA PRIOR
Posterior qc (1-q)N-c
= g qc+a-1 (1-q)N-c+b-1
= Betaa+c,b+N-c(q)
Prediction
= mean
E[q]=(c+a)/(N+a+b)
P(q)
POSTERIOR WITH BETA PRIOR
What
does this mean?
Prior specifies a “virtual count”
of a=a-1 heads, b=b-1 tails
See heads, increment a
See tails, increment b
Effect
of prior diminishes
with more data
CHOOSING A PRIOR
Part of the design process; must be chosen
according to your intuition
Uninformed belief a=b=1, strong belief => a,b
high
EXTENSIONS OF BETA PRIORS
Parameters of categorical distributions: Dirichlet
prior
Mathematical expression more complex, but in
practice still takes the form of “virtual counts”
Mean, standard deviation for Gaussian
distributions: Gamma prior
Conjugate priors preserve the representation of
prior and posterior distributions, but do not
necessary exist for general distributions
DIRICHLET PRIOR
Categorical variable |Val(X)|=k with P(X=i) = qi
Parameter space q1,…,qk with qi 0, S qi = 1
Maximum likelihood estimate given counts
c1,…,ck in the data D:
qiML = ci/N
Dirichlet prior is Dirichlet(a1,…,ak) =
1 𝛼1 −1
𝛼 −1
P 𝜃1 , … , 𝜃𝑘 = 𝜃1
× ⋯ × 𝜃𝑘 𝑘
𝑍
Mean is (a1/aT,…,ak/aT) with aT=Siai
Posterior P(q|D) is Dirichlet (a1+c1,…,ak+ck)
RECAP
Learning => optimization problem (ML)
Learning => inference problem (Bayesian
estimation)
Learning parameters of Bayesian networks
Conjugate priors