#### Transcript Probabilistic Inference

```CS B553: ALGORITHMS FOR
OPTIMIZATION AND LEARNING
Parameter Learning
(From data to distributions)
AGENDA
Learning probability distributions from example
data
 Generative vs. discriminative models
 Maximum likelihood estimation (MLE)
 Bayesian estimation

MOTIVATION
Past lectures have studied how to infer
characteristics of a distribution, given a fullyspecified Bayes net
 Next few lectures: where does the Bayes net come
from?
 Setting for this lecture:

Given a set of examples drawn from a distribution
 Each example is complete (fully observable)
 BN structure is known, but the CPTs are unknown

DENSITY ESTIMATION
Given dataset D={d,…,d[M]} drawn from
underlying distribution P*
 Find a distribution that matches P* as close as
possible
 High-level issues:





Usually, not enough data to get an accurate picture of
P*, which forces us to approximate.
Even if we did have P*, how do we measure
closeness?
How do we maximize closeness?
Two approaches: learning problems =>
Optimization problems, or
 Bayesian inference problems

KULLBACK-LIEBLER DIVERGENCE

Definition: given two probability distributions P
and Q over X, the KL divergence (or relative
entropy) from P to Q is given by:
𝑃 𝒙
𝐷(𝑃| 𝑄 = 𝐸𝒙~𝑃 log
𝑄 𝒙
𝑃 𝒙
=
𝑃(𝑥) log
𝑄 𝒙
𝒙

Properties:



𝐷(𝑃| 𝑄 ≥ 0
𝐷(𝑃||𝑄) = 0 iff P=Q “almost everywhere”
Not a true “metric” – non-symmetric
APPLYING KL DIVERGENCE TO LEARNING

Approach: given underlying distribution P*, find
P (within a class of distributions) so KL
divergence is minimized

arg min 𝐷(𝑃∗ | 𝑃 = arg min 𝐸𝒙~𝑃∗ log 𝑃∗ 𝒙 − log 𝑃 𝒙
𝑃
= argmin
𝑃
𝑃
𝐸𝒙~𝑃∗ log 𝑃∗
𝒙 ] − 𝐸𝒙~𝑃∗ [log 𝑃 𝒙
= arg max 𝐸𝒙~𝑃∗ log 𝑃 𝒙
𝑃

If we approximate P* with draws from D, we get
1
arg max
log 𝑃 𝒅 𝑖
𝑃
𝐷
𝑖

Minimizing KL-divergence to the empirical
distribution is the same as maximizing the
empirical log-likelihood
ANOTHER APPROACH: DISCRIMINATIVE
LEARNING

Do we really want to model P*? We may be more
concerned with predicting the values of some
subset of variables

E.g., for a Bayes net CPT, we want P(Y|PaY) but
may not care about the distribution of PaY
Generative model: estimate P(X,Y)
 Discriminative model: estimate P(Y|X), ignore
P(X)

TRAINING DISCRIMINATIVE MODELS
Define a loss function l(y,x,P) that is given the
ground truth y,x
 Measures the difference between the prediction
P(Y|x) and the ground truth
 Examples:

Classification error I[y  arg maxy P(y|x)]
 Conditional log likelihood - log P(y|x)


Strategy: minimize empirical loss
DISCRIMINATIVE VS GENERATIVE

Discriminative models:
Don’t model the input distribution, so may have more
expressive power for the same level of complexity
sized training dataset
 Directly transcribe top-down evaluation of CPTs


Generative models:
More flexible, because they don’t require a priori
selection of the dependent variable Y
 Bottom-up inference is easier


Both useful in different situations
WHAT CLASS OF PROBABILITY MODELS?

For small discrete distributions, just use a
tabular representation


Very efficient learning techniques
For large discrete distributions or continuous
ones, the choice of probability model is crucial

Increasing complexity =>
Can represent complex distributions more accurately
 Need more data to learn well (risk of overfitting)
 More expensive to learn and to perform inference

LEARNING COIN FLIPS
Let the unknown fraction of cherries be q
(hypothesis)
 Probability of drawing a cherry is q
 Suppose draws are independent and identically
distributed (i.i.d)
 Observe that c out of N draws are cherries (data)

LEARNING COIN FLIPS
Let the unknown fraction of cherries be q
(hypothesis)
 Intuition: c/N might be a good hypothesis
 (or it might not, depending on the draw!)

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q

P(d|q) =
Pj P(d |q) = q
j
i.i.d assumption
c
(1-q)N-c
Gather c cherry terms
together, then N-c lime terms
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c
1.2
1
P(data|q)

0.8
0.6
1/1 cherry
0.4
0.2
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c
1.2
1
P(data|q)

0.8
0.6
2/2 cherry
0.4
0.2
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c
0.16
0.14
0.12
P(data|q)

0.1
0.08
2/3 cherry
0.06
0.04
0.02
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c
0.07
0.06
0.05
P(data|q)

0.04
0.03
2/4 cherry
0.02
0.01
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c
0.04
0.035
0.03
P(data|q)

0.025
0.02
2/5 cherry
0.015
0.01
0.005
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c
0.0000012
0.000001
P(data|q)

0.0000008
0.0000006
10/20 cherry
0.0000004
0.0000002
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c
9E-31
8E-31
7E-31
P(data|q)

6E-31
5E-31
4E-31
50/100 cherry
3E-31
2E-31
1E-31
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
Peaks of likelihood function seem to hover
around the fraction of cherries…
 Sharpness indicates some notion of certainty…

9E-31
8E-31
P(data|q)
7E-31
6E-31
5E-31
4E-31
50/100 cherry
3E-31
2E-31
1E-31
0
0
0.2
0.4
0.6
q
0.8
1
MAXIMUM LIKELIHOOD
P(d|q) be the likelihood function
 The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]
MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]
= log [ qc ] + log [(1-q)N-c]
MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]
= log [ qc ] + log [(1-q)N-c]
= c log q + (N-c) log (1-q)
MAXIMUM LIKELIHOOD


l(q) = log P(d|q) = c log q + (N-c) log (1-q)
Setting dl/dq(q) = 0 gives the maximum likelihood
estimate
MAXIMUM LIKELIHOOD


dl/dq(q) = c/q – (N-c)/(1-q)
At MLE, c/q – (N-c)/(1-q) = 0
=> q = c/N
c and N are known as sufficient
statistics for the parameter q – no other
OTHER MLE RESULTS
Categorical distributions (Non-binary discrete
variables): take fraction of counts for each value
(histogram)
 Continuous Gaussian distributions

Mean = average data
 Standard deviation = standard deviation of data

MAXIMUM LIKELIHOOD FOR BN

For any BN, the ML parameters of any CPT can
be derived by the fraction of observed values in
the data, conditioned on matched parent values
N=1000
E: 500
B: 200
P(E) = 0.5
P(B) = 0.2
Earthquake
Burglar
Alarm
A|E,B: 19/20
A|B: 188/200
A|E: 170/500
A| : 1/380
E
B
P(A|E,B)
T
T
0.95
F
T
0.95
T
F
0.34
F
F
0.003
PROOF
Let BN have structure G over variables X1,…,Xn
and parameters q
 Given dataset D
 L(q ; D) = Pm PG(d[m]; q)

PROOF
Let BN have structure G over variables X1,…,Xn
and parameters q
 Given dataset D
 L(q ; D) = Pm PG(d[m]; q)
= Pm Pi PG(xi[m] | paXi[m]; q)

FITTING CPTS
Each ML entry P(xi|paXi) is given by examining
counts of (xi,paXi) in D and normalizing across
rows of the CPT
 Note that for large k=|PaXi|, very few datapoints
will share the values of paXi!

O(|D|/2k), but some values may be even rarer
 Large domains |Val(Xi)| can also be a problem
 Data fragmentation

PROOF





Let BN have structure G over variables X1,…,Xn and
parameters q
Given dataset D
L(q ; D) = Pm PG(d[m]; q)
= Pm Pi PG(xi[m] | paXi[m]; q)
= Pi [Pm PG(xi[m] | paXi[m]; q)]
Pm PG(xi[m] | paXi[m]; q) is the likelihood of the local
CPT of Xi: L(qXi ; D)
Each CPT depends on a disjoint set of parameters qXi

=> maximizing L(q ; D) over all parameters q is equivalent
to maximizing L(qXi ; D) over each individual qXi
AN ALTERNATIVE APPROACH: BAYESIAN
ESTIMATION

P(q|d) = 1/Z P(d|q) P(q) is the posterior

Distribution of hypotheses given the data
P(d|q) is the likelihood
 P(q) is the hypothesis prior

q
d
d
d[M]
ASSUMPTION: UNIFORM PRIOR,
BERNOULLI DISTRIBUTION
Assume P(q) is uniform
 P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c
 What’s P(Y|D)?

qi
Y
d
d
d[M]
ASSUMPTION: UNIFORM PRIOR,
BERNOULLI DISTRIBUTION
Assume P(q) is uniform
 P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c
 What’s P(Y|D)?


𝑃 𝑌𝐷 =
1
𝑃
0
𝑌 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 =
1 1 𝑐
𝜃 𝜃
0
𝑍
1−𝜃
qi
Y
d
d
d[M]
𝑁−𝑐
𝑑𝜃
ASSUMPTION: UNIFORM PRIOR,
BERNOULLI DISTRIBUTION

1 𝑎
𝜃
0
1 − 𝜃 𝑏 𝑑𝜃 =
𝑏!𝑎!
𝑎+𝑏+1 !
=>Z = c! (N-c)! / (N+1)!
 =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!
= (c+1) / (N+2)

qi
Can think of this as a
“correction” using
“virtual counts”
Y
d
d
d[M]
NONUNIFORM PRIORS

P(q|d)  P(d|q)P(q) = qc (1-q)N-c P(q)
Define, for all q, the probability
that I believe in q
P(q)
0
1
q
BETA DISTRIBUTION

Betaa,b(q) = g qa-1 (1-q)b-1
a, b hyperparameters > 0
 g is a normalization
constant
 a=b=1 is uniform
distribution

POSTERIOR WITH BETA PRIOR
 Posterior qc (1-q)N-c
= g qc+a-1 (1-q)N-c+b-1
= Betaa+c,b+N-c(q)
 Prediction
= mean
E[q]=(c+a)/(N+a+b)
P(q)
POSTERIOR WITH BETA PRIOR
 What

does this mean?
Prior specifies a “virtual count”
 See tails, increment b
 Effect
of prior diminishes
with more data
CHOOSING A PRIOR
Part of the design process; must be chosen
 Uninformed belief a=b=1, strong belief => a,b
high

EXTENSIONS OF BETA PRIORS

Parameters of categorical distributions: Dirichlet
prior

Mathematical expression more complex, but in
practice still takes the form of “virtual counts”
Mean, standard deviation for Gaussian
distributions: Gamma prior
 Conjugate priors preserve the representation of
prior and posterior distributions, but do not
necessary exist for general distributions

DIRICHLET PRIOR
Categorical variable |Val(X)|=k with P(X=i) = qi
 Parameter space q1,…,qk with qi  0, S qi = 1
 Maximum likelihood estimate given counts
c1,…,ck in the data D:


qiML = ci/N
Dirichlet prior is Dirichlet(a1,…,ak) =
1 𝛼1 −1
𝛼 −1
P 𝜃1 , … , 𝜃𝑘 = 𝜃1
× ⋯ × 𝜃𝑘 𝑘
𝑍
 Mean is (a1/aT,…,ak/aT) with aT=Siai
 Posterior P(q|D) is Dirichlet (a1+c1,…,ak+ck)

RECAP
Learning => optimization problem (ML)
 Learning => inference problem (Bayesian
estimation)
 Learning parameters of Bayesian networks
 Conjugate priors

```