Lecture 4 (Nov 30)
Download
Report
Transcript Lecture 4 (Nov 30)
Statistical Decision Theory
Bayes’ theorem:
For discrete events
M
B1, , BM mutually exclusive events with
Pr Bi A
Pr A Bi Pr Bi
M
Pr
i 1
A B j PrB j
Bj S
(sample space)
j 1
Pr A Bi Pr Bi
For probability density functions
fY X y x
f X Y x y fY y
f X Y x z fY z dz
f X Y x y fY y
f X x
f X Y x y fY y
The Bayesian “philosophy”
The classical approach (frequentist’s view):
The random sample X = (X1, … , Xn ) is assumed to come from a
distribution with a probability density function f (x; ) where is
an unknown but fixed parameter.
The sample is investigated from its random variable properties
relating to f (x; ) . The uncertainty about is solely assessed on
basis of the sample properties.
The Bayesian approach:
The random sample X = (X1, … , Xn ) is assumed to come from a
distribution with a probability density function f (x; ) where the
uncertainty about is modelled with a probability distribution (i.e. a
p.d.f), called the prior distribution
The obtained values of the sample, i.e. x = (x1, … , xn ) are used to
update the information from the prior distribution to a posterior
distribution for
Main differences:
In the classical approach, is fix, while in the Bayesian approach
is a random variable.
In the classical approach focus is on the sampling distribution of X,
while in the Bayesian the sample focus is on the variation of .
Bayesian: “What we observe is fixed, what we do not observe is
random.”
Frequentist: “What we observe is random, what we do not
observe is fixed.”
Concepts of the Bayesian framework
Prior density:
p( )
Likelihood:
L( ; x )
Posterior density:
second notation
q( | x ) = q( ; x ) The book uses the
“as before”
Relation through Bayes’ theorem:
qθ x
f X x θ pθ
f X x p d
Lθ; x pθ
Lλ; x pλdλ
Book' s notation
Lθ; x pθ
Lθ; x pθ
f X x
f X x; θ pθ
f X x; λ pλdλ
The textbook writes
q θ x q θ; x
f X x h x
qθ ; x
Lθ ; x pθ
h x
Still the posterior is referred to as the distribution of conditional on x
Decision-theoretic elements
1. One of a number of actions should be decided on.
2. State of nature: A number of states possible. Usually represented by
3. For each state of nature the relative desirability of each of the
different actions possible can be quantified
4. Prior information for the different states of nature may be available:
Prior distribution of
5. Data may be available. Usually represented by x. Can be used to
update the knowledge about the relative desirability of (each of) the
different actions.
In mathematical notation for this course:
True state of nature:
Uncertainty described by the prior
p ( )
Data:
observation of X, whose p.d.f.
depends on (data is thus assumed to
be available)
x
Decision procedure:
Action:
(x)
Loss function:
LS ( , (x) ) measures the loss from taking action
(x) when holds
Risk function
The decision procedure becomes an
action when applied to given data x
Rθ , LS θ , x Lθ; x dx E X LS θ , X
Rθ , LS θ , x Lθ; x dx E X LS θ , X
Note that the risk function is the expected loss with respect to the
simultaneous distribution of X1, … , Xn
Note also that the risk function is for the decision procedure, and not for
the particular action
Admissible procedures:
A procedure 1 is inadmissible if there exists another procedure such that
R( , 1 ) R( , 2 ) for all values of .
A procedure which is not inadmissible (i.e. no other procedure with lower
risk function for any can be found) is said to be admissible
Minimax procedure:
A procedure * is a minimax procedure if
R θ, * min
max Rθ,
θ
i.e. is chosen to be the “worst” possible value, and under that
value the procedure that gives the lowest possible risk is chosen
The minimax procedure uses no prior information about , thus it is not a
Bayesian procedure.
Example
Suppose you are about to make a decision on whether you should buy or
rent a new TV.
1 = “Buy the TV”
2 = “Rent the TV”
Now, assume is the mean time until the TV breaks down for the first
time
Let assume three possible values 6, 12 and 24 months
The cost of the TV is $500 if you buy it and $30 per month if you rent it
If the TV breaks down after 12 months you’ll have to replace it for the
same cost as you bought it if you bought it. If you rented it you will get a
new TV for no cost provided you proceed with your contract.
Let X be the time in months until the TV breaks down and assume this
variable is exponentially distributed with mean
A loss function for an ownership of maximum 24 months may be
defined as
LS ( , 1(X ) ) = 500 + 500 H (X – 12) and
LS ( , 2(X ) ) = 30 24 = 720
Then
R , 1 E X 500 500 H X 12 500 500 e
500 1 e 12
R , 2 720
1 1 x
dx
12
Now compare the risks for the three possible values of
R( , 1 )
R( , 2 )
6
568
720
12
684
720
24
803
720
Clearly the risk for the first procedure increases with while the risk for
the second in constant. In searching for the minimax procedure we
therefore focus on the largest possible value of where 2 has the
smallest risk
2 is the minimax procedure
Bayes procedure
Bayes risk:
B Rθ , pθ dθ
Uses the prior distribution of the unknown parameter
A Bayes procedure is a procedure that minimizes the Bayes risk
B arg min Rθ, pθ dθ
Example cont.
Assume the three possible values of (6, 12 and 24) has the prior
probabilities 0.2, 0.3 and 0.5.
Then
B1 500 1 e 12 6 0.2 1 e 12 12 0.3 1 e 12 24 0.5
280
B 2 720 (does not depend on )
Thus the Bayes risk is minimized by 1 and therefore 1 is the Bayes
procedure
Decision theory applied on point estimation
The action is a particular point estimator
State of nature is the true value of
The loss function is a measure of how good (desirable) the estimator is
of :
LS LS ,ˆ
Prior information is quantified by the prior distribution (p.d.f.) p( )
Data is the random sample x from a distribution with p.d.f. f (x ; )
Three simple loss functions
Zero-one loss:
0 | ˆ | b
ˆ
LS ,
a, b 0
ˆ
a | | b
Absolute error loss:
LS ,ˆ a | ˆ | a 0
Quadratic (error) loss:
LS ,ˆ a ˆ
2
a0
Minimax estimators:
Find the value of that maximizes the expected loss with respect to the
sample values, i.e. that maximizes
E X LS ,ˆ X over the set of estimators ˆ X
Then, the particular estimator that minimizes the risk for that value of is
the minimax estimator
Not so easy to find!
Bayes estimators
A Bayes estimator is the estimator that minimizes
LS ,ˆ x L ; x dx p d
LS , ˆ x L ; x p d dx
LS , ˆ x q ; x h x d dx
h x LS ,ˆ x q ; x d dx
ˆ p d
R
,
For any given value of x what has to be minimized is
ˆ x q ; x d
L
,
S
The Bayes philosophy is that data (x ) should be considered to be given
and therefore the minimization cannot depend on x.
Now minimization with respect to different loss functions will result in
measures of location in the posterior distribution of .
Zero-one loss:
ˆ x is the posterior mode for given x
Absolute error loss:
ˆ x is the posterior median for given x
Quadratic loss:
ˆ x is the posterior mean for given x
About prior distributions
Conjugate prior distributions
Example: Assume the parameter of interest is , the proportion of some
property of interest in the population (i.e. the probability for this property
to occur)
A reasonable prior density for is the Beta density:
1 1 1
p ; ,
; 0 1
B ,
where 0 and 0 are two (constant) parameters
1
and B , x 1 1 x 1 dx
0
the so - called Beta function
,
Beta(1,1)
Beta(5,5)
Beta(1,5)
Beta(5,1)
Beta(2,5)
Beta(5,2)
Beta(0.5,0.5)
Beta(0.3,0.7)
Beta(0.7,0.3)
0
0.5
1
Now, assume a sample of size n from the population in which y of the
values possess the property of interest.
The likelihood becomes
n y
L ; y 1 n y
y
1
n y
1 1
n y
1
y
B ,
L ; y p
q ; y 1
1
1
1 n y
1 x dx
n y x
L
x
;
y
p
x
dx
0
x
1
x
0 y
B ,
y 1 n y 1 1 1
1 y
x 1 x n y x 1 1 x 1 dx
0
n y 1
y 1
y 1 1 n y 1
1 y 1
x
0
1 x
n y 1
dx
1
B y , n y
Thus, the posterior density is also a Beta density with parameters y +
and n – y +
Prior distributions that combined with the likelihood gives a posterior in
the same distributional family are named conjugate priors.
(Note that by a distributional family we mean distributions that go under
a common name: Normal distribution, Binomial distribution, Poisson
distribution etc. )
A conjugate prior always go together with a particular likelihood to
produce the posterior.
We sometimes refer to a conjugate pair of distributions meaning
(prior distribution, sample distribution = likelihood)
In particular, if the sample distribution, i.e. f (x; ) belongs to the kparameter exponential family of distributions:
Aj θ B j x C x Dθ
k
f x; θ e j1
we may put
Aj θ j k 1 D θ K 1 ,, k , k 1
k
pθ e j1
k
Aj θ j k 1 D θ
e j1
where 1 , … , k + 1 are parameters of this prior distribution and K( ) is a
function of 1 , … , k + 1 only .
Then
qθ; x Lθ; x pθ
Aj θ B j xi C xi nD θ Aj θ j k 1D θ K 1 ,, k , k 1
i 1
i 1
e j1
e j1
k
n
n
k
n
Aj θ B j xi j n k 1 D θ
C xi K ,, ,
i1
k
k 1
e i1
e 1
e j1
k
n
n
Aj θ B j xi j n k 1 D θ
i1
e j1
k
i.e. the posterior distribution is of the same form as the prior distribution
but with parameters
n
n
i 1
i 1
1 B1 xi ,, k Bk xi , k 1 n
instead of
1,, k , k 1
Some common cases:
Conjugate prior
Sample distribution
Posterior
Beta
Binomial
Beta
~ Beta ,
X ~ Bin n,
Normal
Normal, known 2
~ N ,
2
X i ~ N ,
2
Gamma
Poisson
~ Gamma ,
X i ~ Po
Pareto
Uniform
p
;
| x ~ Beta x, n x
Normal
2
2 2
2
n
| x ~ N 2
x
,
2
2
2
2
2
n
n
n
X i ~ U 0,
Gamma
| xi ~ Gamma xi , n
Pareto
q ; x n ; max , xn
Example
Assume we have a sample x = (x1, … , xn ) from U (0, ) and that a prior
density for is the Pareto density
p 1 1 , 2 ; 1, 0
What is the Bayes estimator of under quadratic loss?
The Bayes estimator is the posterior mean.
The posterior distribution is also Pareto with
q ; x n 1 max , xn
n1 n , max , xn
E | x
n 1 n
n
1
max
,
x
d
n
max , x n
max , xn
n1 n 1 n d
max , x n
n 1 max , xn
n1 n1d
max , x n
n 2
n 1
n 1 max , xn
n 2 max , x
n
n 2
max
,
x
n
n 1 max , xn n 1 0
n2
n 1
max , xn ˆB Compare with ˆML xn
n2
Non-informative priors (uninformative)
A prior distribution that gives no more information about than possibly
the parameter space is called a non-informative or uninformative prior.
Example: Beta(1,1) for an unknown proportion simply says that the
parameter can be any value between 0 and 1 (which coincides with its
definition)
A non-informative prior is characterized by the property that all values in
the parameter space are equally likely.
0.25
0.2
Proper non-informative priors:
0.15
0.1
0.05
1.2
0
1
The prior is a true density or mass function
2
3
4
5
1
0.8
0.6
0.4
0.2
0
0
Improper non-informative priors:
The prior is a constant value over Rk
Example: N ( , ) for the mean of a normal population
0.2
0.4
0.6
0.8
1
Decision theory applied on hypothesis testing
Test of H0: = 0 vs. H1: = 1
Decision procedure: C = Use a test with critical region C
Action: C (x) = “Reject H0 if x C , otherwise accept H0 ”
Loss function:
H0 true
H1 true
Accept H0
0
b
Reject H0
a
0
Risk function
R C ; θ E X LS θ , C X
Loss when rejecting H 0 for true value θ Pr X C | θ
Loss when accepting H 0 for true value θ Pr X C | θ
R C ; θ0 a 0 1 a
R C ; θ1 0 1 b b
Assume a prior setting p0 = Pr (H0 is true) = Pr ( = 0) and p1 = Pr (H1
is true) = Pr ( = 1)
The prior expected risk becomes
Eθ R C ; θ a p0 b p1
Bayes test:
B arg min Eθ R C ; θ arg min ap0 bp1
C
C
Minimax test:
* arg min max R C ; θ arg min max a , b
C
θ
C
θ
Lemma 6.1: Bayes tests and most powerful tests (Neyman-Pearson
lemma) are equivalent in that
every most powerful test is a Bayes test for some values of p0 and p1 and
every Bayes test is a most powerful test with
Lθ1; x p0 a
Lθ0 ; x p1b
Example:
Assume x = (x1, x2 ) is a random sample from Exp( ), i.e.
f x; e
1 1 x
, x 0 ; 0
We would like to test H0: = 1 vs. H0: = 2 with a Bayes test with
losses a = 2 and b = 1 and with prior probabilities p0 and p1
L1; x
L 0 ; x
1 11 x1
e
1
1 01 x1
0 e
1 11 x2
1 e
1 01 x2
0 e
4e
x1 x2
p0 a
p0
2
p1b
p1
p
x1 x2 ln 0
2 p1
4e 2 x1 x2
x x 4e x1 x2
e 1 2
Now,
t
P X 1 X 2 t
t
1 1 x1
e
x1 0
t
t x1
1 1 x1 1 1 x2
e
e
dx2 dx1
x1 0 x2 0
1 x2 t x1
e
x2 0 dx1
1 1 x1
e
e
1t
1 1t
1 1t
t
e
1 e
1 x1
1 1t
x1 e
x1 0
1 e
t e
1 t x1
x1 0
dx e
1
1 1 x1
P X 1 X 2 t e
1t
t
x1 0
dx
1
1 1t
t e
p
1ln 0
2p
e 1
p0
1
ln
p0
2
p
1 e 1
ln
2
p
1
2 p1 p0
1 ln
p0 2 p1
A fixed size gives conditions on p0 and p1, and a certain choice will
give a minimized
Sequential probability ratio test (SPRT)
Suppose that we consider the sampling to be the observation of values in
a “stream” x1, x2, … , i.e. we do not consider a sample with fixed size.
We would like to test H0: = 0 vs. H1: = 1
After n observations have been taken we have xn = (x1, … , xn ) , and we
put
L1; x n
LR n
L 0 ; x n
as the current test statistic.
The frequentist approach:
Specify two numbers K1 and K2 not depending on n such that 0 < K1 < K2
<.
Then
If LR(n) K1 Stop sampling, accept H0
If LR(n) K2 Stop sampling, reject H0
If K1 < LR(n) < K2 Take another observation
Usual choice of K1 and K2 (Property 6.3):
If the size and the power 1 – are pre-specified, put
K1
1
and K 2
1
This gives approximate true size and approximate true power 1 –
The Bayesian approach:
The structure is the same, but the choices of K1 and K2 is different.
Let c be the cost of taking one observation, and let as before a and b be
the loss values for taking the wring decisions, and p0 and p1 be the prior
probabilities of H0 and H1 respectively.
Then the Bayesian choices of K1 and K2 are
1 k1 a
k 2 1 ln k1 1 k1 ln k 2
c
p0
k
k
k
k
2
1
0
K1, K 2 arg min 2 1
k1 ,k2
p k 2 k 2 1 b c k1 k 2 1 ln k1 k 2 1 k1 ln k 2
1 k k
k
k
2
1
2
1
1
where
f X ;1
f X ;1
H 0 and 1 E ln
H1
0 E ln
f X ; 0
f X ; 0