Lecture 12 – Model Assessment and Selection

Download Report

Transcript Lecture 12 – Model Assessment and Selection

Lectures 13,14 – Model Inference
and Averaging
Rice ECE697
Farinaz Koushanfar
Fall 2006
Summary
•
•
•
•
•
•
Bootstrap and maximum likelihood (ML)
Bayesian methods
The expectation maximization (EM) algorithm
MCMC for sampling from posterior
Bagging
Model averaging
Empirical Distribution
• Suppose x1, ..., xN are the observed outcomes of N iid
random variables following an unknown PDF
• The empirical distribution: P(X = a) = count (xi = a)/n.
• The empirical estimate of a parameter  is computed from
the empirical distribution by the formula that defines the
parameter based on its true distribution.
• For example, the empirical estimate of the variance is
1
var(x )   ( x  x )
N
N
i 1
2
i
• Empirical estimates are often biased; there is no guarantee
they have best possible variance, or other good properties.
Bootstrap
• Let T(x1, ..., xN) be an estimator of .
• Bootstrap generates repeated estimates by generating repeated
"fake" outcomes. Each fake outcome is generated by taking a
random sample according to the empirical distribution P(X =
a) = count(xi = a)/n.
• To generate fake outcomes, resample with replacement.
• By taking R random samples with replacement, get R different
bootstrap estimates of ; call these B1, ..., BR.
• What do we use R bootstrap? The most common is for CI:
(1) Use the order statistics of Br. E.g., for a 95% confidence interval, use
B(2.5%) and B(97.5%).
(2) If we know that the PDF( T(x1, ..., xN)) ~ Gaussian, base the CI on the
sample variance of the bootstrap estimates,
1
var() 
 (B  B )
R 1
R
r 1
r
r
2
Bootstrap-Example
• Training data, Z={z1,z2,..,zN}, with zi=(xi,yi)
• Suppose we are fitting a cubic spline, with 3 knots placed at
quantiles (splines are one form of kernel basis function,
centered at knots)
h1(x)
h2(x)
h3(x)
h4(x)
h5(x)
h6(x)
h7(x)
( x )    h ( x )

j1
j
j
Bootstrap-Example (cont’d)
h(x)T=(h1(x),…,h7(x))
•
•
•
•
•
Spline prediction ( x )    h ( x )  h ( x ) ˆ
Can think of (x)=E(Y|X=x)
The usual estimate of ^=(HTH)-1HTy
The estimated covariance is Var ()  (H H) ˆ
Noise var: ˆ   ( y  (x )) / N

j1
T
j
j

T
2
N
i 1
1
2
i
i
1
2
ŝe(( x )) | h ( x ) (H H) h ( x ) | ˆ
T
T
1
• How do we apply bootstrap on this example?
2
Bootstrap-Example (cont’d)
• Draw B datasets with replacements (Z*:zi=(xi,yi))
• To each bootstrap sample, fit a cubic spline ^*(x)
• Example - 10 bootstrap samples (left), CI (right)
Least Square and Bootstrap - Example
Least Square, Bootstrap, and ML
• Previous example was nonparametric bootstrap
• Suppose that error is Gaussian: ~N(0,2)
• In parametric bootstrap, we draw samples by adding
Gaussian noise to the predicted values
y  ˆ (x )  
*
i
i
*
i
 ~ N(0, ˆ ); i  1,..., N
*
2
i
• The process repeated B times, re-compute the spline
on each, the CI from this method will be exactly the
least square bands!!
• The function estimated from the bootstrap sample has
the distribution
ˆ * ( x )  N(ˆ ( x ), h ( x ) (H H) h ( x )ˆ )
T
T
1
2
Maximum Likelihood (ML) inference
• In general, bootstrap estimate agrees not with the least squares,
but with ML
• Specify a pdf function to observations: zi~g(z)
• ML is based on a likelihood function L(;Z)
N
L(; Z)   g (z )
i 1

i
• The logarithm of L(;Z) is denoted as l(;Z), and is the loglikelihood function
N
N
l(; Z)   l(; z )   log g (z )
i 1
i
i 1

i
• ML chooses the value of  that maximizes l(;Z)
ML – Some Definitions
The score function is: l(; Z)   l(; z )
Where l(; z )  l(; z ) / 
Assuming that the max is in the interior, it is 0
The information matrix is
 l(; z )
I()  

• I() evaluated at 0 is the observed information
• Fisher information (or expected information) is
i()  E [I()]
•
•
•
•
N
i
i 1
i
i
N
2
i
i 1

T
ML – Some More Results
• Assume independent sampling from g(z)
• ***The sampling distribution of the ML
estimator has a limiting Normal PDF (N)
ˆ  N( , i( ) )  N(ˆ , i(ˆ ) )  N(ˆ , I(ˆ ) )
• The standard error of j estimation is
i(ˆ ) and I(ˆ )
1
0
1
0
1
1
jj
jj
• CI for j has the forms
ˆ  z
i(ˆ ) or ˆ  z
1
( 1  )
j
1
jj
j
( 1  )
I(ˆ )
1
jj
ML for our smoothing example
• Parameters are =(,2); the log-likelihood is
N
1
l()   log(  2) 
 ( y  h ( x ) )
2
2
• ML estimate achieved by:
l() /   0 and l() /   0
•    (H H) H y and ˆ  1  ( y  ˆ ( x ))
N
N
2
2
i 1
T
i
2
i
2
T
1
T
2
2
i
• The information matrix for =(,2) is
I()  (H H) / 
T
2
i
Bayesian Approach to Inference
• Specify a sampling model Pr(Z|), pdf for data
• Given the parameters, and a prior distribution for the Pr(),
reflecting the knowledge before we see the new data
Pr(Z | ) Pr()
Pr(Z | ) Pr()
Pr( | Z) 

Pr(Z)
 Pr(Z | ) Pr()d
• Corresponding to our updated knowledge about  after we see
the new data
• The difference b/w Bayesian and regular inference is that it
expresses the uncertainty before seeing the data (prior), and
express the uncertainty remaining (posterior)
Bayesian Approach (Cont’d)
• Predict the value of a future observation via
the predictive distribution
Pr(z | Z)   Pr(z | ) Pr( | z)d
• ML would use Pr(znew|) to predict future data
• Unlike predictive distribution, does not
account for uncertainty in estimating 
new
new
Bayesian Approach on Our Example
• Parametric model
( x )  j1  j h j ( x )  h ( x ) T ˆ
• Assume that 2 is known and randomness is only
coming from variations of y around (x)
• Assuming finite number of basis, put the prior on
distribution of coefficients ~N(0,)
•  (x) is Gaussian with covariance kernel
K(x,x’)=cov((x), (x’))= h(x)Th(x’)
• Posterior distribution for  is also Gaussian
E( | Z)  (H H  
T
2

Cov( | Z)  (H H  
T
 ) Hy
1
2

1
T
 ) 
1
1
2
Example (Cont’d)
• The corresponding posterior value for (x) is
E(( x ) | Z)  h ( x ) (H H    ) H y

Cov(( x ) | Z)  h ( x ) (H H    ) h ( x ' )

• How to choose ? Take the prior to be =I
T
2
T
T
T
1
2
1
1
T
1
2
Example (Cont’d)
• Let’s take a look at the posterior curves and see the
impact of the prior on the posterior
Looks like bootstrap
curves
Bayes Inference – Example
(from wikipedia)
• Suppose we wish to know about the proportion r of voters in a
large population who will vote "yes" in a referendum
• Let n be the number of voters in a random sample (chosen
with replacement, so that we have statistical independence)
• Let m be the number of voters in that random sample who will
vote "yes“
• Suppose that we observe n = 10 voters and m = 7 say they will
vote yes. From Bayes theorem:
Example from Wikipedia (Cont’d)
• From this we see that from the prior probability density
function f(r) and the likelihood function L(r) = f(m = 7|r,
n = 10), we can compute the posterior pdf
• f(r) summarizes what we know about the distribution of r in
the absence of any observation.
• We provisionally assume in this case that the prior distribution
of r is uniform over the interval [0, 1]. That is, f(r) = 1.
• If some additional background information is found, we
should modify the prior accordingly. However before we have
any observations, all outcomes are equally likely.
Example from Wikipedia (Cont’d)
• Assuming random sampling, the likelihood function
L(r) = P(m = 7|r, n = 10,) is just the probability of 7 successes
in 10 trials for a binomial pdf
• As with the prior, the likelihood is open to revision -- more
complex assumptions will yield more complex likelihood
functions. Maintaining the current assumptions, we compute
the normalizing factor,
• For r[0,1] inclusive, the posterior distribution for r is then
Example from Wikipedia (Cont’d)
• One may be interested in the probability that more
than half the voters will vote "yes".
• The prior probability that more than half the voters
will vote "yes" is 1/2, by the symmetry of the uniform
distribution.
• In comparison, the posterior probability that more
than half the voters will vote "yes", i.e., the
conditional probability given the outcome of the
opinion poll – that seven of the 10 voters questioned
will vote "yes" – is
Expectation Maximization (EM)
Algorithm
• EM algorithm is used for simplifying difficult ML
problems, we will show an example first
• Assume we are looking for a simple mixture model
EM - Example
• Model: Y is mixture of two normal distributions:
Y1~N(1,12), and Y1~N(2,22)
• Y=(1-)Y1+Y2, where {0,1}, Pr(=1)=
• Let (y) be Gaussian with =(,2), pdf of y is
gY(y)=(1- ) 1(y)+  2(y)
• The parameters of the ML for the mixture model are
=(,1,12,2,22),
• The log-likelihood on N training cases is
N
l(; Z)   log[(1  ) ( y )   ( y )]
i 1
1
i
2
i
EM - Example
• Direct maximization is difficult numerically, because of the
sum of the terms inside the log
• A simpler approach is to set some of the values, find the others
and iteratively update
• This is the core of EM algorithm
• Expectation step: Do a soft assignment of each observation to
one model: how much each model is responsible for
explaining a data point
– E.g., responsibility of model 2 for observation i can be written as
ˆ ()  E( | , Z)  Pr(  1 | , Z)
i
i
i
• Maximization step: The responsibilities are used in a weighted
ML fit to update the parameter estimates
1.
2.
EM Algorithm for 2-Component
Gaussian Mixture Model
Take initial guesses for parameters ,1,12,2,22
Expectation step: compute the responsibilities
ˆ  (y )

ˆ 
, i  1,2,..., N
ˆ ) ( y )  
ˆ  (y )
(1  
ˆ2

i
i
ˆ1

3.
ˆ2

i
i
Maximization step: compute weighted mean & variance
ˆ )
 (1  ˆ ) y
 (1  ˆ )( y  
ˆ 
ˆ 

, 
 (1  ˆ )
 (1  ˆ )
ˆ )
 ˆ y
 ˆ ( y  
ˆ 
ˆ 

, 
 ˆ
 ˆ
N
N
i 1
1
i
i
i 1
N
i 1
i
i
i 1
2
2
N
i
i
1
N
i
N
4.
i
1
N
i 1
2
i 1
2
i 1
i
2
i
i
2
N
i 1
i
N
And the mixing probability
ˆ  i1 ˆ i / N
Iterate steps 2 and 3 (E and M) until converge
2
EM – Example (cont’d)
Selected iterations of the EM algorithm
For mixture example
Iteration
1
5
10
15
20

0.485
0.493
0.523
0.544
0.546
EM – When is it Useful?
• Another name for EM is Baum-Welch algorithm
• EM is used for maximizing the likelihood for a
certain class of problems, where ML is difficult
• Data augmentation: ML is enlarged with latent
(unobserved) variables
– In the previous example, the unobserved variable was i
– EM is widely used when a part of the actual data is missing
 treat missing data as latent variables
EM Algorithm – General Case
• Assume that the observed data is Z, with loglikelihood l(;Z), depending on parameter 
• Missing data is Zm, complete data T=(Z,Zm), with
log-likelihood l0(;T), l0 based on complete density
• In the mixture problem T=(Z,Zm)=(y,)
Pr(Z , Z | ' )
Pr(Z | Z, ' ) 
Pr(Z | ' )
m
m
Pr(Z , Z | ' )
Pr(Z | ' ) 
Pr(Z | Z, ' )
m
m
l('; Z)  l ('; T)  l ('; Z | Z)
m
0
1
• And l1 is based on the conditional density Pr(Zm|Z,)
EM Algorithm – General Case
1. Start with initial guesses for the parameter ^(0)
2. Expectation step: at the j-th step, compute
Q(' , ˆ )  E[l ('; T) | Z, ˆ )
( j)
( j)
0
as a function of the dummy argument ’
3. Maximization step: determine the new estimate
^(j+1) as the maximizer of Q(’, ^(j)) over ’
4. Iterate steps 2 and 3 (E and M) until convergence
Why EM Works?
• Remember that in terms of the log-likelihood:
l('; Z)  l ('; T)  l ('; Z | Z)
m
0
1
• Where l0 is based on complete density, l1 is based on the
conditional density Pr(Zm|Z,)
• Taking the conditional distribution of T|Z governed by the
parameter  gives
l(' ; Z)  E[l (' ; T) | Z, ]  E[l (' ; Z | Z) | Z, ]
m
0
1
 Q(' , )  R (' , )
• R(*,) is the expectation of a log-likelihood of a density
(indexed by *) w.r.t the same density indexed by  
maximized as a function of , when *= (Jensen’s inequality)
l('; Z)  l(; Z)  [Q(' , )  Q(, )]  [R (' , )  R (, )]  0
• Thus, the EM iteration never decreases the log-likelihood!
MCMC Methods
• Slides (for MCMC) mostly borrowed from Sujit Sahu
b
www.maths.soton.ac.uk/staff/Sahu/
I   f ( x )dx
• Assume that we want to find
a
• If interval [a,b] is divided s.t. a=x0<x1...<xN=b, the integral can
be approximated by
x x
I   f ( x )dx   f (
)h
2
i  N 1
b
i
i 0
a
• Now assume we need to find an expectation
E [h(X)]   h(x)(x)dx

 | h(x) |(x)dx  
• This can be difficult!
i 1
MCMC for Integration (Cont’d)
• If we can draw samples
• Then we can estimate
• This is Monte Carlo (MC) integration
• We have changed notation
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Consistency of the integral estimate
• For independent samples, by Law of Large
Numbers,
• But independent sampling from (x) can be
difficult
• It turns out that the above equation holds if we
generate samples using a Markov chain
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Markov Chains (Review)
• Markov chain generated by sampling
• Where p is the transition kernel
• So X(t+1) depends only on X(t), not on X(0), X(1),…, X(t-1)
• For example
• This is called the first order auto-regressive process
with lag-1, autocorrelation 0.5
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Markov Chains (Stationary)
• As t, the Markov chain converges (in
distribution) to its stationary (invariant)
distribution
• In the example before, this is
• Which does not depend on x(0)
• Does this happen for all Markov chains?
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Markov Chains (Irreducibility)
• Assuming that a stationary distribution exists,
it is unique if the chain is irreducible
• Irreducible means that any set of states can be
reached from any other state in a finite number
of moves
• An example of reducible Markov chain:
– Suppose p(x|y)=0, for xA and yB and vice
versa
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Markov Chains (Aperiodicity)
• A Markov chain taking only finite number of values
is aperiodic if greatest common divisor of return
times to any particular state i say, is 1
• Think of recording the number of steps taken to
return to the state 1. The g.c.d of those numbers
should be 1
• If the g.c.d is bigger than 1, 2 say, then the chain will
return in cycles of 2,4,6,.. Number of steps. This is
not allowed for aperiodicity
• Definition can be extended to general state-space case
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Markov Chains (Ergodicity)
• Assume the Markov chain has a stationary
distribution (x), is aperiodic and irreducible
•  Ergodic theorem
• Also for such chains with
• Central limit theorem holds and
• Convergence occurs geometrically
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Numerical Standard Error (nse)
• In general, no simpler expression exists for nse
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Numerical Standard Error (nse)
• If {h(X(t))} can be approximated as a first-order autoregressive process, then
• Where  is the lag-1 auto-correlation {h(X(t))}
• The first factor is the usual term under independent
sampling
• The second term is usually >1
• As thus is the penalty to be paid because a Markov
chain has been used
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
More on nse
• The nse may not be finite in general
• It is finite if the chain converges geometrically
• If the nse is finite, then we can make it as
small as we like by increasing N
• The ‘obvious’ estimator of nse is not consistent
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Markov Chains -- Summary
• A Markov chain may have a stationary
distribution
• The stationary distribution is unique is the
chain is irreducible
• We can estimate nse’s if the chain is also
geometrically convergent
• Where does this all gets us?
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
MCMC
• How do we construct a Markov chain whose
stationary distribution is our target distribution,
 (x)?
• Metropolis et al. (1953) shows how
• The method was generalized by Hastings
(1970)
• This is called
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Metropolis-Hastings Algorithm
• At each iteration t
• Step 1 Sample y~q(y|x(t)), where y is the candidate
point and q(.) is the proposed pdf
• Step 2 With probability
• Set x(t+1)=y (acceptance)
• Else set x(t+1)=x(t) (rejection)
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Metropolis-Hasting Notes
• The normalizing constant in (x) is not required to
run the algorithm. It cancels in the ratio
• If q(y|x)= (y), then we obtain independent samples
• Usually q is chosen so that q(y|x) is easy to sample
from
• Theoretically, any density q(.|x) having the same
support should work. However, some q’s are better
than the others
• The induced Markov chains have the desirable
properties under mild conditions on (x)
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Gibbs Sampler
• Gibbs sampling is a Monte-Carlo sampling method
• Suppose that x=(x1,x2,…,xk) is k(2) dimensional
• Gibbs sampler uses what are called the full (or complete)
conditional distributions
• Note that the conditional
• Is proportional to the joint. Often this helps in finding it
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Gibbs Sampling (Cont’d)
• Sample or update in turn, but always use the
most relevant values
Slide is courtesy of Sujit Sahu http://www.maths.soton.ac.uk/staff/Sahu/
Gibbs Sampling -- Algorithm
1. Take some initial values Uk(0), k=1,2,…,K
2. Repeat for t=1,2,…,:
For k=1,2,…,K generate Uk(t) from
Pr(Uk(t) | U1(t),…, Uk-1(t-1), …, UK(t-1))
3. Continue step 2 until the joint distribution of
(U1(t), U2(t), …, Uk(t)) does not change
Gibbs Sampling (Cont’d)
• In Bayesian Inference, the goal is to draw a sample
from the joint posterior of the parameters given the
data Z
• Gibbs sampling will be helpful if it is easy to sample
from the conditional distribution of each parameter
given the other parameters and Z
• An example – Gaussian mixture problem is described
on the next slide
Gibbs Sampling - Example
1. Take some initial values (0)= (1(0), 2(0))
2. Repeat for t=1,2,..
a) For i=1,2,…,N generate i(t){0,1} with
Pr(i(t)=1)=^i((t)), from the Eq. previously given
(t)
N
b) Set N
(t)
i 1  i .y i
i 1 (1   i ) y i , and, 
ˆ2 
(t)
ˆ1 

N
(
t
)
N
 (1  
i 1
i
)
 
i 1
And generate 1(t)~N(^1,^12) and 2(t)~ N(^2,^22)
3. Continue step 2 until the joint distribution of
((t),1(t),2(t)) does not change
i
Gibbs Sampling – Example (Cont’d)
• Close connection b/w Gibbs sampling and EM
• The key is to consider the latent variable Zm from EM as
another parameter for Gibbs sampler
• For simplicity, fix the variances (12,22) and mixing
proportions   unknowns are 1,2
• The first two steps are like EM, in step 3, Gibbs simulates the
latent data i from the distribution Pr(i|,Z)
• In step 2(b), rather than compute the maximization of the
posterior Pr(1,2,|Z), we simulate from the conditional
distribution Pr(1,2|,Z)
• The example was simplified, in reality variances and  should
be taken into account
Gibbs Sampling – Example (Cont’d)
• 200 iterations of Gibbs sampling
• The values seems to stabilize quickly, and are distributed
evenly around the ML value
Bagging
• Some classification and regressions are unstable
– Small perturbations in their training sets or in construction may result
in large changes in constructed predictor
– E.g., subset selection method in regression, decision trees
• Unstable methods can improve their accuracy by perturbing
and combining
• Generate multiple versions of the predictor by perturbing the
training set or construction method and then combine the
versions into a single predictor
1 B *b
• E.g., bagging estimate for regression f̂ bag ( x ) 
 f̂ ( x )
B
• Bagging for classification: combine the B votes
b 1
Bagging
• The bagging estimate will differ from the original
estimate f^(x) when the latter is a nonlinear or
adaptive function of the data
• Can be used to bag different models of a prediction
• It has a lot of usage on the trees (discussed in the next
section)
• If we think about bootstrap values of an estimator as
approximate posterior values of a corresponding
parameter, from a nonparametric Bayesian analysis,
bagging is just the posterior Bayesian mean
Model Averaging and Stacking
• We have a set of candidate models m, m=1,..,M for training
set Z
• The models maybe the same type with different parameter
values, or different models for the same task
• Suppose  is some quantity of interest, say, a prediction f(x) at
some fixed feature value x
• The posterior distribution and mean of  are:
M
Pr( | Z)   Pr( |  , Z) Pr( | Z)
m 1
m
m
M
E( | Z)   E( |  , Z) Pr( | Z)
m 1
m
m
• This Bayesian prediction, is a weighted average of individual
predictions, with weights proportional to the posterior
probability of each model
Model Averaging and Stacking
• Committee methods take a simple unweighted
average of the predictions from each model, giving
equal probability to each of the models
• In cases where the different models arise from the
same parametric model, with different parametric
values, one could use BIC to weight
– Recall that BIC gives weight to each model depending on
how well it fits and how many parameters it uses
• How does it work for nonparametric (frequentist)
models?
– Answer: Chapter 10, boosting is a strong method proposed
for solving these cases
Stochastic Search: Bumping
• Bumping is a technique to find a better single model, but
again uses bootstrapping
• It is suitable for models, where there are a lot of local minima
• Like bagging, draw bootstrap samples, and fit a model to each
• Rather than averaging, choose the model estimated from the
bootstrap sample that best fits the training data
• Draw bootstrap samples Z*1,…,Z*B and fit our model to each,
giving predictions f^*b(x), b=1,..,B
• Choose the model that gives the smallest prediction error,
averaged over the original training set
• E.g. for squared error, choose the bootstrap sample b^
N
b̂  arg min  [ y  f b( x )]
b
i 1
*
i
2
i