Statistical learning

Download Report

Transcript Statistical learning

Statistical Learning
(From data to distributions)
Reminders
• HW5 deadline extended to Friday
Agenda
• Learning a probability
distribution from data
• Maximum likelihood
estimation (MLE)
• Maximum a posteriori
(MAP) estimation
• Expectation
Maximization (EM)
Motivation
• Agent has made observations (data)
• Now must make sense of it (hypotheses)
– Hypotheses alone may be important (e.g., in
basic science)
– For inference (e.g., forecasting)
– To take sensible actions (decision making)
• A basic component of economics, social
and hard sciences, engineering, …
Candy Example
• Candy comes in 2 flavors, cherry and lime, with identical
wrappers
• Manufacturer makes 5 (indistinguishable) bags
H1
C: 100%
L: 0%
H2
C: 75%
L: 25%
H3
C: 50%
L: 50%
H4
C: 25%
L: 75%
H5
C: 0%
L: 100%
• Suppose we draw
• What bag are we holding? What flavor will we draw
next?
Machine Learning vs. Statistics
• Machine Learning  automated statistics
• This lecture
– Bayesian learning, the more “traditional”
statistics (R&N 20.1-3)
– Learning Bayes Nets
Bayesian Learning
• Main idea: Consider the probability of each
hypothesis, given the data
• Data d:
• Hypotheses: P(hi|d)
h1
C: 100%
L: 0%
h2
C: 75%
L: 25%
h3
C: 50%
L: 50%
h4
C: 25%
L: 75%
h5
C: 0%
L: 100%
Using Bayes’ Rule
• P(hi|d) = a P(d|hi) P(hi) is the posterior
– (Recall, 1/a = Si P(d|hi) P(hi))
• P(d|hi) is the likelihood
• P(hi) is the hypothesis prior
h1
C: 100%
L: 0%
h2
C: 75%
L: 25%
h3
C: 50%
L: 50%
h4
C: 25%
L: 75%
h5
C: 0%
L: 100%
Computing the Posterior
• Assume draws are independent
• Let P(h1),…,P(h5) = (0.1,0.2,0.4,0.2,0.1)
• d = { 10 x }
P(d|h1) = 0
P(d|h2) = 0.2510
P(d|h3) = 0.510
P(d|h4) = 0.7510
P(d|h5) = 110
P(d|h1)P(h1)=0
P(d|h2)P(h2)=9e-8
P(d|h3)P(h3)=4e-4
P(d|h4)P(h4)=0.011
P(d|h5)P(h5)=0.1
Sum = 1/a = 0.1114
P(h1|d) =0
P(h2|d) =0.00
P(h3|d) =0.00
P(h4|d) =0.10
P(h5|d) =0.90
Posterior Hypotheses
Predicting the Next Draw
• P(X|d) = Si P(X|hi,d)P(hi|d)
= Si P(X|hi)P(hi|d)
H
D
X
Probability that next candy drawn is a lime
P(h1|d) =0
P(h2|d) =0.00
P(h3|d) =0.00
P(h4|d) =0.10
P(h5|d) =0.90
P(X|h1) =0
P(X|h2) =0.25
P(X|h3) =0.5
P(X|h4) =0.75
P(X|h5) =1
P(X|d) = 0.975
P(Next Candy is Lime | d)
Other properties of Bayesian
Estimation
• Any learning technique trades off between
good fit and hypothesis complexity
• Prior can penalize complex hypotheses
– Many more complex hypotheses than simple
ones
– Ockham’s razor
Hypothesis Spaces often
Intractable
• A hypothesis is a joint probability table
over state variables
– 2n entries => hypothesis space is [0,1]^(2n)
– 2^(2n) deterministic hypotheses
6 boolean variables => over 1022 hypotheses
• Summing over hypotheses is expensive!
Some Common Simplifications
• Maximum a posteriori estimation (MAP)
– hMAP = argmaxhi P(hi|d)
– P(X|d)  P(X|hMAP)
• Maximum likelihood estimation (ML)
– hML = argmaxhi P(d|hi)
– P(X|d)  P(X|hML)
• Both approximate the true Bayesian
predictions as the # of data grows large
Maximum a Posteriori
• hMAP = argmaxhi P(hi|d)
• P(X|d)  P(X|hMAP)
P(X|hMAP)
P(X|d)
hMAP =
h3 h4
h5
Maximum a Posteriori
• For large amounts of data,
P(incorrect hypothesis|d) => 0
• For small sample sizes, MAP predictions
are “overconfident”
P(X|h )
MAP
P(X|d)
Maximum Likelihood
• hML = argmaxhi P(d|hi)
• P(X|d)  P(X|hML)
P(X|hML)
P(X|d)
hML = undefined
h5
Maximum Likelihood
• hML= hMAP with uniform prior
• Relevance of prior diminishes with more
data
• Preferred by some statisticians
– Are priors “cheating”?
– What is a prior anyway?
Advantages of MAP and MLE over
Bayesian estimation
• Involves an optimization rather than a
large summation
– Local search techniques
• For some types of distributions, there are
closed-form solutions that are easily
computed
Learning Coin Flips (Bernoulli
distribution)
• Let the unknown fraction of cherries be q
• Suppose draws are independent and
identically distributed (i.i.d)
• Observe that c out of N draws are cherries
Maximum Likelihood
• Likelihood of data d={d1,…,dN} given q
– P(d|q) = Pj P(dj|q) = qc (1-q)N-c
i.i.d assumption
Gather c cherries together,
then N-c limes
Maximum Likelihood
• Same as maximizing log likelihood
• L(d|q)= log P(d|q) = c log q + (N-c) log(1-q)
• maxq L(d|q)
=> dL/dq = 0
=> 0 = c/q – (N-c)/(1-q)
=> q = c/N
Maximum Likelihood for BN
• For any BN, the ML parameters of any
CPT can be derived by the fraction of
observed values in the data
N=1000
E 500
B: 200
P(E) = 0.5
P(B) = 0.2
Earthquake
Burglar
Alarm
A|E,B: 19/20
A|B: 188/200
A|E: 170/500
A| : 1/380
E
B
P(A|E,B)
T
T
0.95
F
T
0.95
T
F
0.34
F
F
0.003
Maximum Likelihood for Gaussian
Models
• Observe a continuous variable x1,…,xN
• Fit a Gaussian with mean m, std s
– Standard procedure: write log likelihood
L = N(C – log s) – Sj (xj-m)2/(2s2)
– Set derivatives to zero
Maximum Likelihood for Gaussian
Models
• Observe a continuous variable x1,…,xN
• Results:
m = 1/N S xj
(sample mean)
s2 = 1/N S (xj-m)2 (sample variance)
Maximum Likelihood for
Conditional Linear Gaussians
• Y is a child of X
• Data (xj,yj)
X
Y
• X is gaussian, Y is a linear Gaussian
function of X
– Y(x) ~ N(ax+b,s)
• ML estimate of a, b is given by least
squares regression, s by standard errors
Back to Coin Flips
• What about Bayesian or MAP learning?
• Motivation
– I pick a coin out of my pocket
– 1 flip turns up heads
– Whats the MLE?
Back to Coin Flips
• Need some prior distribution P(q)
• P(q|d) = P(d|q)P(q) = qc (1-q)N-c P(q)
Define, for all q, the probability
that I believe in q
P(q)
0
1
q
MAP estimate
• Could maximize qc (1-q)N-c P(q) using
some optimization
• Turns out for some families of P(q), the
MAP estimate is easy to compute
(Conjugate prior)
P(q)
Beta distributions
0
1
q
Beta Distribution
• Betaa,b(q) = a qa-1 (1-q)b-1
– a, b hyperparameters
– a is a normalization
constant
– Mean at a/(a+b)
Posterior with Beta Prior
• Posterior qc (1-q)N-c P(q)
= a qc+a-1 (1-q)N-c+b-1
• MAP estimate
q=(c+a)/(N+a+b)
• Posterior is also a
beta distribution!
– See heads, increment a
– See tails, increment b
– Prior specifies a “virtual count” of a heads, b tails
Does this work in general?
• Only specific distributions have the right
type of prior
– Bernoulli, Poisson, geometric, Gaussian,
exponential, …
• Otherwise, MAP needs a (often
expensive) numerical optimization
How to deal with missing
observations?
• Very difficult statistical problem in general
• E.g., surveys
– Did the person not fill out political affiliation
randomly?
– Or do independents do this more often than
someone with a strong affiliation?
• Better if a variable is completely hidden
Expectation Maximization for
Gaussian Mixture models
Clustering: N gaussian
distributions
Data have labels to which
Gaussian they belong to,
but label is a hidden
variable
E step: compute
probability a datapoint
belongs to each gaussian
M step: compute ML
estimates of each
gaussian, weighted by the
probability that each
sample belongs to it
Learning HMMs
Want to find transition and observation
probabilities
Data: many sequences {O1:t(j) for 1jN}
Problem: we don’t observe the X’s!
X0
X1
X2
X3
O1
O2
O3
Learning HMMs
• Assume stationary markov chain, discrete
states x1,…,xm
• Transition parameters
qij = P(Xt+1=xj|Xt=xi)
• Observation parameters
i = P(O|Xt=xi)
X0
X1
X2
X3
O1
O2
O3
Learning HMMs
• Assume stationary markov chain, discrete
states x1,…,xm
x
• Transition parameters
q13, q31
pij = P(Xt+1=xj|Xt=xi)
x
x
• Observation parameters
i = P(O|Xt=xi)
3
2
O
• Initial states
li = P(X0=xi)
1
2
3
Expectation Maximization
• Initialize parameters randomly
• E-step: infer expected probabilities of hidden
variables over time, given current parameters
• M-step: maximize likelihood of data over
parameters
x1
x2
P(initial state)
P(transition ij)
q13, q31
x3
P(emission)
q = (l1, l2, l3,p11,p12,...,p32p33, 1,2,3)
2
O
3
Expectation Maximization
q = (l1, l2, l3,q11,q12,...,q32q33, 1,2,3)
Initialize q(0)
E: Compute E[P(Z=z| q(0),O)]
Z: all combinations of hidden sequences
x1
x2
x3
x2
x2
x1
x1
x2
x2
x1
x3
x2
Result: probability distribution over
hidden state at time t
q(1)
M: compute
= ML estimate
of transition / obs. distributions
x1
x2
2
q13, q31
x3
O
3
Expectation Maximization
q = (l1, l2, l3,q11,q12,...,q32q33, 1,2,3)
Initialize q(0)
E: Compute E[P(Z=z| q(0),O)]
Z: all combinations of hidden sequences
x1
x2
x3
x2
x2
x1
x1
x2
x2
x1
x3
x2
Result: probability distribution over
hidden state at time t
q(1)
M: compute
= ML estimate
of transition / obs. distributions
This is the hard
part…
x1
x2
2
q13, q31
x3
O
3
E-Step on HMMs
• Computing expectations can be done by:
– Sampling
– Using the forward/backward algorithm on the
unrolled HMM (R&N pp. 546)
• The latter gives the classic Baum-Welch
algorithm
• Note that EM can still get stuck in local
optima or even saddle points
Next Time
• Machine learning