Statistical Estimation

Download Report

Transcript Statistical Estimation

Basics of Statistical Estimation
Learning Probabilities:
Classical Approach
Simplest case: Flipping a thumbtack
heads
tails
True probability q is unknown
Given iid data, estimate q using an estimator with
good properties: low bias, low variance, consistent
(e.g., maximum likelihood estimate)
Maximum Likelihood Principle
Choose the parameters that maximize
the probability of the observed data
Maximum Likelihood Estimation
p(heads | q )  q
p( tails | q )  (1  q )
p(hhth...ttth | q )  q (1  q )
#h
#t
(Number of heads is binomial distribution)
Computing the ML Estimate
•
•
•
•
Use log-likelihood
Differentiate with respect to parameter(s)
Equate to zero and solve
Solution:
#h
q 
# h  #t
Sufficient Statistics
p(hhth...ttth | q )  q (1  q )
#h
(#h,#t) are sufficient statistics
#t
Bayesian Estimation
heads
tails
True probability q is unknown
Bayesian probability density for q
p(q)
0
1
q
Use of Bayes’ Theorem
prior
posterior
p(q | heads ) 
likelihood
p(q ) p(heads | q )
 p(q ) p(heads | q ) dq
 p(q ) p(heads | q )
Example: Application to
Observation of Single “Heads"
p(q)
p(heads|q)= q

0
1
prior
q
p(q|heads)

0
1
likelihood
q
0
1
posterior
q
Probability of Heads on Next Toss
p(n  1th toss is h | d)



 p(X  h | q ) p(q | d)dq
 q p(q | d) dq
N 1
E p (q |d ) (q )
MAP Estimation
• Approximation:
– Instead of averaging over all parameter values
– Consider only the most probable value
(i.e., value with highest posterior probability)
• Usually a very good approximation,
and much simpler
• MAP value ≠ Expected value
• MAP → ML for infinite data
(as long as prior ≠ 0 everywhere)
Prior Distributions for q
• Direct assessment
• Parametric distributions
–Conjugate distributions
(for convenience)
–Mixtures of conjugate distributions
Conjugate Family of
Distributions
Beta distribution:
p(q )  Beta( h , t )  q
 h 1
t 1
(1  q )
 h , t  0
Resulting posterior distribution:
p(q | h heads , t tails)  q
# h   h 1
(1  q )
# t  t 1
Estimates Compared
• Prior prediction:
h
E (q ) 
 h +t
# h  h
• Posterior prediction: E (q ) 
# h   h + # t  t
• MAP estimate:
• ML estimate:
# h   h 1
q
# h   h 1 + # t  t 1
#h
q
# h + #t
Intuition
• The hyperparameters h and t can be
thought of as imaginary counts from our
prior experience, starting from "pure
ignorance"
• Equivalent sample size = h + t
• The larger the equivalent sample size, the
more confident we are about the true
probability
Beta Distributions
Beta(0.5, 0.5 )
Beta(1, 1 )
Beta(3, 2 )
Beta(19, 39 )
Assessment of a
Beta Distribution
Method 1: Equivalent sample
- assess h and t
- assess h+t and h/(h+t)
Method 2: Imagined future samples
p(heads )  0.2 and p(heads | 3 heads )  0.5   h  1, t  4
1
1 3
check: 0.2 =
, 0. 5 
1+ 4
1 3  4
Generalization to m Outcomes
(Multinomial Distribution)
Dirichlet distribution:
p(θ1, ,θm )  Dirichlet( 1 ,,  m )   q i i  1
m
m
q
i 1
1
i 1
Properties:
E (q i ) 
i  0
i
m

i 1
i
m
p(q | N1,, N m )  q i i  Ni 1
i 1
Other Distributions
Likelihoods from the exponential family
• Binomial
• Multinomial
• Poisson
• Gamma
• Normal