Basic principles of probability theory

Download Report

Transcript Basic principles of probability theory

Resampling techniques
•
•
•
•
•
Why resampling?
Jacknife
Cross-validation
Bootstrap
Exercises
Why resampling?
Purpose of statistics is to estimate some parameter(s) and reliability of them. Since estimators are
function of the sample points they are random variables. If we could find distribution of
this random variable (sample statistic) then we could estimate reliability of the estimators.
Unfortunately apart from the simplest cases, sampling distribution is not easy to derive.
There are several techniques to approximate these distributions. These include: Edgeworth
series, Laplace approximation, saddle-point approximations and others. These
approximations give analytical form for the approximate distributions. With advent of
computers more computationaly intensive methods are emerging. They work in many
cases satisfactorily.
If we would have sampling distribution for the sampling statistics then we can estimate variance
of the estimator, interval, even test hypothesis. Examples of simplest cases where sample
distribution is known include:
1)
Sample mean when sample is from the normal distribution – normal distribution with
mean value equal to sample mean and variance equal to variance of the population divided
by sample size if population variance is known. If population variance is not known then
variance of sample mean is sample variance divided by n.
2)
Sample variance has the distribution of multiple of 2 distribution. Again it is valid if
population distribution is normal.
3)
Sample mean divided by square root of sample variance has the multiple of the t
distribution – again normal case
4)
Sample variance divided by sample variance has the multiple of F-distribution.
Jacknife
Jacknife is used for bias removal. As we know that mean-square error is sum of
squared bias and variances of the estimator. If bias is much higher than variance
then under some circumstances Jacknife could be used.
Description of Jacknife: Let us assume that we have sample of size n. We estimate
some sample statistics using all data – tn. Then by removing one point at a time
we estimate tn-1,i, where subscript indicates size of the sample and index of
removed sample point. Then new estimator is derived
as:
n
t  ntn  ( n  1)t n 1 , where t n 1 
'
n
t
i 1
n 1,i
n
If the order of the bias of the statistic tn is O(n-1) then after jacknife order of the bias
becomes O(n-2).
Variance is estimated using:
n  1 n 1
ˆ
VJ 
(tn 1,i  t n 1 )2

n i 1
This procedure can be applied iteratively. I.e. for new estimator jacknife can be
applied again. First application of Jacknife can reduce bias without changing
variance of the estimator. But its second and higher order application can in
general increase the variance of the estimator.
Cross-validation
Cross-validation is a resampling technique to overcome overfitting.
Let us consider least-squares technique. Let us assume that we have sample of size n
y=(y1,y2,,,yn). We want to estimate parameters =(1, 2,,, m). Now let us further assume
that mean value of the observations is a function of these parameters (we may not know
form of this function). Then we can postulate that function has a form g. Then we can find
values of the parameters using least-squares techniques.
n
h   ( yi  g ( xi1 , xi 2 , , , xim , 1 , 2 , , , m ))2
i 1
Where x is a fixed matrix or random variables. After this technique we will have values of the
parameters therefore form of the function. Form of the function g defines model we want
to use. We may have several forms of the function. Obviously if we have more parameters
fit will be “better”. Question is what would happen if we would observe new values of
observations. Using estimated values of the parameters we could estimate square
differences. Let us say we have new observation (yn+1,,,yn+l). Can our function predict new
observations? Which function can predict better? To answer to these questions we can
calculate new differences:
l
PE   ( yni  g ( x( n i )1 , , , x( n l ) m , 1 , , , m ))2
i 1
Where PE is prediction error. Function g that gives smallest value for PE will have higher
predictive power. Function that gives smaller h but larger PE will be called overfitted
function.
Cross-validation: Cont.
When we choose the function using current sample how can we avoid overfitting? Crossvalidation is an approach to deal with this problem.
Description of cross-validation: We have sample of the size n.
1)
Divide sample into K roughly equal size parts.
2)
For the kth part, estimate parameters using K-1 parts excluding kth part. Calculate
prediction error for kth part.
3)
Repeat it for all k=1,2,,,K and combine all prediction errors and get cross-validation
prediction error.
If K=n then we will have leave-one-out cross-validation technique. Let us denote estimate at the
kth step by k (we will use the vector form). Let kth subset of the sample be Ak and number
of points in this subset is Nk.. Then prediction error calculated per observation would be:
PE 
1 K 1

K k 1 N k
 ( y  g ( x, 
iAk
i
k
))2
Then we would choose the function that gives the smallest prediction error. We can expect that in
future when we will have new observation this function will give smallest prediction error.
This technique is widely used in modern statistical analysis. It is not restricted to least-squares
technique. Instead of least-squares we could have any other form dependent on the
distribution of the observations. It can in principle be applied to various maximumlikelihood and other estimators.
Cross-validation is useful for model selection. I.e. if we have several models using crossvalidation we select one of them.
Bootstrap
Bootstrap is one of the computationally very expensive techniques. In a very simple form it
works as follows.
We have a sample size of n. We want to estimate some parameter . Estimator for this
parameter gives t. For each sample we assign probability (usually 1/n, i.e. all sample
points have equal probability). Then from this sample with replacement we draw
another random sample of size n and estimate . Let us denote estimate of the parameter
by ti* at the jth resampling stage. Bootstrap estimator for  and its variance is calculated
as:
B
B
tB* 
1
1
t *j and the variance VB (tB* ) 
(t *j  tB* )2


B j 1
B  1 j 1
It is very simple form of application of the bootstrap resampling. For the parameter estimation
bootstrap is usually chosen to be around 200.
Let us analyse the working of bootstrap in one simple case. Consider random variable X with
sample space x=(x1,,,,xM). Each point have probability fj. I.e.
P( X  x j )  f j
f =(f1,,,fM) represents distribution of the population. The sample of the size n will have relative
frequencies for each sample point as
fˆ  ( fˆ1, , , , fˆM )
Bootstrap: Cont.
Then distribution of fˆ conditional on f will be multinomial distribution:
fˆ | f  Mn(n, f )
Multinomial distribution is the extension of the binomial
distribution
and expressed as:
M
M
P( X  ( x1, x2 , , , , xM ) 
Limiting distribution of:
n!
f1x1 ... f MxM ,
x1!...xM !
x
j 1
j
 n,
f
j 1
j
1
fˆ  f
Is multinormal distribution. If we resample from the given sample then we should consider
conditional distribution of the following (that is also multinomial distribution):
fˆ * | fˆ  Mn(n, fˆ )
Limiting distribution of
fˆ *  fˆ
is the same as the conditional distribution of original sample. Since these two distribution
converge to the same distribution then well behaved function of them also will have
same limiting distributions. Thus if we use bootstrap to derive distribution of the sample
statistic we can expect that in the limit it will converge to the distribution of sample
statistic. I.e. following two function will have the same limiting distributions:
t( fˆ *, fˆ ) and t( fˆ , f )
Bootstrap: Cont.
If we could enumerate all possible resamples from our sample then we could build
“ideal” bootstrap distribution. In practice even with modern computers it is
impossible to achieve. Instead Monte Carlo simulation is used. Usually it
works like:
Draw random sample of size of n with replacement from the given sample.
Estimate parameter and get estimate tj.
Repeat it B times and build frequency and cumulative distributions for t
Bootstrap: Cont.
How to build the cumulative distribution (it approximates our distribution function)? Consider
sample of size n. x=(x1,x2,,,,xn). Then cumulative distribution will be:
1 n
Fˆ ( x )   I ( x j  x )
m j 1
where I denotes the indicator function:
1 if x j  x
I ( x j  x)  
0 otherwise
Another way of building the cumulative distribution is to sort the data first so that:
x1  x2  ....  xn
Then build cumulative distribution like:
F (t ) 
max( j : x j  t )
n
We can also build histogram that approximates
density of the distribution. First we should find
interval that contains our data into equal intervals with length t. Assume that center of
the i-th interval is ti.. Then histogram can be calculated using the formula:
min(k : xk  ti  t / 2)  max( j : x j  ti  t / 2)
h(ti ) 
n
Once we have the distribution of the statistics we can use it for various purposes. Bootstrap
estimation of the parameter and its variance was one of the possible application. We can
use this distribution for hypothesis testing, interval estimation etc. For pure parameter
estimation we need resample around 200 times. For interval estimation we might need
to resample around 2000 times. Reason is that for interval estimation and hypothesis
testing we need more accurate distribution.
Bootstrap: Cont.
Since while resampling we did not use any assumption about the population distribution this
bootstrap is called non-parametric bootstrap. If we have some idea about the population
distribution then we can use it in resampling. I.e. when we draw randomly from our
sample we can use population distribution. For example if we know that population
distribution is normal then we can estimate its parameters using our sample (sample
mean and variance). Then we can approximate population distribution with this sample
distribution and use it to draw new samples. As it can be expected if assumption about
population distribution is correct then parametric bootstrap will perform better. If it is
not correct then non-parametric bootstrap will overperform its parametric counterpart.
Other application of bootstrap and cross validation we will discuss in future lectures
Exercises 2a
These data have been taken from Box, Hunter & Hunter: Statistics for experimenters:
You should use SPSS or other statistics packages.
The following are results from a larger study on the pharmacological effects nalbuphine. The
measured response optained from 11 subjects was change in pupil diameter. (in
millimeters) after 28 doses of nalbuphine (B) or morphine (A)
A
2.4
0.08
0.8
2.0
1.9
1.0
B
0.4
0.2
-0.3
0.8
0.0
Assume that subjects were randomly allocated to the drugs. Find sample mean and variances for
A and B. Test if differences between two drugs are significantly different. What is 95%,
90% confidence intervals for the differences. What distribution you would use for testing
differences between means if we assume that variances are equal. What conclusion would
you make from these results? Are treatments have significantly different effects?
Hint: statistic we want to use for two-sample test (if variances are equal) is:
nx sx  n y s y
t  ( x  y ) /( s nx  n y ), where s 
nx  n y  2
What is the distribution of this statistic? What is degrees of freedom?
When variances are different then you should use Welch’s two sample test.
Write small report summarising your conclusions about these data.
Exercise 2b
These data have been taken from the book: Box, Hunter & Hunter: Statistics for experimenters
Use SPSS or other statistics packages.
Given following data on egg production from 12 hens randomly allocated to two different
diets, estimate mean differences produced by the diets and obtain 95% and 90%
confidence intervals. Are these differences significant? Can you write small report
analysing this data.
Diet A:
Diet B:
166
158
174
159
150
142
166
163
165
161
178
157
What distribution would you use if variances would be equal?
Write small report analysng these data.