Basic principles of probability theory

Download Report

Transcript Basic principles of probability theory

Resampling techniques
•
•
•
•
•
Why resampling?
Jacknife
Cross-validation
Bootstrap
Examples of application of bootstrap
Why resampling?
One of the purposes of statistics is to estimate some parameters and their reliability.
Since estimators are functions of sample points they are random variables. If we
could find distribution of this random variable (sample statistic) then we could
estimate reliability of the estimators. Unfortunately apart from the simplest cases,
sampling distribution is not easy to derive. There are several techniques to
approximate them. These include: Edgeworth series, Laplace approximation,
saddle-point approximations. They give analytical forms for the approximate
distributions. With advent of computers, computationally intensive methods are
emerging. They work in many cases satisfactorily.
Examples of simplest cases where sample distributions are known include:
1)
Sample mean, when sample is from a population with the normal distribution, has
normal distribution with mean value equal to the population mean and variance
equal to variance of the population divided by the sample size if population
variance is known. If population variance is not known then variance of sample
mean is approximated by the sample variance divided by n.
2)
Sample variance has the distribution of multiple of 2 distribution. Again it is
valid if population distribution is normal and sample points are independent.
3)
Sample mean divided by square root of sample variance has the multiple of the t
distribution – again normal and independence case
4)
For two independent samples from normal distribution: ratio of sample variances
has the multiple of F-distribution.
Simulation techniques
One simple yet very powerful technique for generating distribution for a statistic is
simulation. It is done by considering assumptions about the population,
generating random numbers under these assumptions and calculating statistics for
the generated random samples. It is usually done 1000-10000 times. Using the
results we can build density of the distribution, cumulative distributions and
design tests or any other inference we would like to make.
The procedure works as follow:
Repeat N times
1) generate sample of the required size under the assumptions
2) Calculate desired statistics and store it
After this procedure we will have as if we have repeated experiment/observations N
times. This procedure will work if the assumptions are valid or distribution of the
statistics is less sensitive to departures from the assumptions.
Simulation techniques
We can do this type of simulations using R. Let us write an example of a small generic
function for simulation of desired statistics:
fsimul = function(n,fstat, ...){
res=vector(length=n)
for( i in 1:n){
res[i]=fstat(...)
}
res
}
Only thing is remaining is to write a function for statistic we want to generate
distribution for. Let us say that we want to simulate distribution of the ratio of the
variances of two independent random samples that are from the normal distributions
vrstat = function(k1,k2,mn1,mn2,sdd1,sdd2){
var(rnorm(k1,mean=mn1,sd=sdd1))/var(rnorm(k2,mean=mn2,sd=sdd2))
}
Once we have both functions we can generate distribution of statistics for ratio of
variances (e.g. for sample sizes of 10 and 15 from population with N(0,1))
vrdist = fsimul(10000,vrstat,10,15,0.0,0.0,1.0,1.0)
Simulation techniques
Using 10000 values we can generate desired
distribution. From theory we know that ratio of
variances of independent random samples of sizes 10
and 15 should be F distribution with degrees of
freedom (9,14). Figure shows that theoretical and
simulated distributions are very similar.
We can use this distribution for tests and other
purposes. To find quantiles we can use the R function
quantile:
quantile(vrdist,c(0.025,0.975)
To generate cumulative distribution function we can
use ecdf – empirical cumulative distribution function:
ecv = ecdf(vrdist)
# Generate cumulative density
plot(ecdf)
# Plot ecdf
ecv(value) # calculate probability P(x<value)
Bars – simulated distribution
Red line – theoretical distribution
Simulation techniques
Two small exercises:
1)
2)
Generate simulated distribution (density and cumulative) for a) sample maximum,
b) sample minimum; c) sample range (the distributions of the sample maximum
and the sample minimum are known as extreme value distributions)
Generate simulated density for the sample range divided by the sample standard
deviation.
Resampling techniques
Resampling techniques use data to calculate bias, prediction error and distribution
functions.
Three of the popular computer intensive resampling techniques are:
1)
Jacknife. It is a useful tool for bias removal. It may work fine for medium and
large samples.
2)
Cross-validation. Very useful technique for model selection. It may help to
choose “best” model among those under consideration.
3)
Bootstrap. Perhaps one of the most important resampling techniques. It can
reduce bias as well as can give variance of an estimator. Moreover it can give the
distribution of the statistic under consideration. This distribution can be used for
such wide variety purposes as interval estimation, hypothesis testing.
Jacknife
Jacknife is used for bias removal. As we know, mean-square error of an estimator is
equal to the square of the bias plus the variance of the estimator. If the bias is
much higher than variance then under some circumstances Jacknife could be
used.
Description of Jacknife: Let us assume that we have a sample of size n. We estimate
some sample statistics using all the data – tn. Then by removing one point at a
time we estimate tn-1,i, where subscript indicates the size of the sample and the
index of the removed sample point. Then new estimator
is derived as:
n
t  ntn  ( n  1)t n 1 , where t n 1 
'
n
If the order of the bias of the statistic tn is
the bias becomes O(n-2).
Variance is estimated using:
O(n-1)
t
i 1
n 1,i
n
then after the jacknife the order of
n1
n
1
VˆJ 
(t n1,i  t n1 ) 2

n i1
This procedure can be applied iteratively. I.e. for the new estimator jacknife can be
applied again. First application of Jacknife can reduce bias without changing
variance of the estimator. But its second and higher order application can in
general increases the variance of the estimator.
Jacknife: An example
Let us take a data set of size 12 and perform jacknife for mean value.
data
mean
0) 368 390 379 260 404 318 352 359 216 222 283 332
323.5833
Jacknife samples
1)
390 379 260 404 318 352 359 216 222 283 332
2) 368
379 260 404 318 352 359 216 222 283 332
3) 368 390
260 404 318 352 359 216 222 283 332
4) 368 390 379
404 318 352 359 216 222 283 332
5) 368 390 379 260
318 352 359 216 222 283 332
6) 368 390 379 260 404
352 359 216 222 283 332
7) 368 390 379 260 404 318
359 216 222 283 332
8) 368 390 379 260 404 318 352
216 222 283 332
9) 368 390 379 260 404 318 352 359
222 283 332
10) 368 390 379 260 404 318 352 359 216
283 332
11) 368 390 379 260 404 318 352 359 216 222
332
12) 368 390 379 260 404 318 352 359 216 222 283
319.5455
317.5455
318.5455
329.3636
316.2727
324.0909
321.0000
320.3636
333.3636
332.8182
327.2727
322.8182
tjack = 12*323.5833-11*mean(t) =323.5833. It is equal to the sample. It is not surprising
since mean is an unbiased estimator
Cross-validation
Cross-validation is a resampling technique to overcome overfitting.
Let us consider a least-squares technique. Let us assume that we have a sample of size n
y=(y1,y2,,,yn). We want to estimate the parameters =(1, 2,,, m). Now let us further
assume that mean value of the observations is a function of these parameters (we may not
know form of this function). Then we can postulate that function has a form g. Then we
can find values of the parameters using least-squares techniques.
n
h   ( yi  g ( xi1 , xi 2 , , , xim , 1 ,  2 , , ,  m ))2  min
i 1
Where X is a fixed (design) matrix. After minimisation of h we will have values of the
parameters, therefore complete definition of the function. Form of the function g defines
model we want to use. We may have several forms of the function. Obviously if we have
more parameters, the fit will be “better”. Question is what would happen if we would have
new observations. Using estimated values of the parameters we could estimate the square
of differences. Let us say we have new observations (yn+1,,,yn+l). Can our function predict
these new observations? Which function predicts future observations better? To answer to
these questions we can calculate new differences:
1 l
PE   ( yni  g ( x( ni )1 , , , x( nl ) m , 1 , , ,  m ))2
l i 1
Where PE is the prediction error. Function g that gives smallest value for PE have higher
predictive power. Model that gives smaller h but larger PE corresponds to overfitted
model.
Cross-validation: Cont.
If we have a sample of observations, can we use this sample and choose among given models.
Cross validation attempts to reduce overfitting thus helps model selection.
Description of cross-validation: We have a sample of the size n – (yi,xi) .
1)
Divide the sample into K roughly equal size parts.
2)
For the k-th part, estimate parameters using K-1 parts excluding k-th part. Calculate
prediction error for k-th part.
3)
Repeat it for all k=1,2,,,K and combine all prediction errors and get cross-validation
prediction error.
If K=n then we will have leave-one-out cross-validation technique.
Let us denote an estimate at the k-th step by k (it is a vector of parameters). Let k-th subset of
the sample be Ak and number of points in this subset is Nk.. Then prediction error per
observation is:
PE 
1 K 1

K k 1 N k
 ( y  g ( x, 
iAk
i
k
))2
Then we would choose the function that gives the smallest prediction error. We can expect that in
future when we will have new observations this function will give smallest prediction
error.
This technique is widely used in modern statistical analysis. It is not restricted to least-squares
technique. Instead of least-squares we could could use other techniques such as maximumlikelihood, Bayesian estimation, M-estimation.
Bootstrap
Bootstrap is one of the computationally expensive techniques. Its simplicity and increasing
computational power makes this technique as a method of choice in many applications. In a
very simple form it works as follows.
We have a sample of size n. We want to estimate some parameter . The estimator for this
parameter gives t. We want the distribution of t. For each sample point we assign probability
(usually equal to 1/n, i.e. all sample points have equal probability). Then from this sample with
replacement we draw another random sample of size n and estimate . This procedure is
repeated B times. Let us denote an estimate of the parameter by tj* at the j-th resampling stage.
Bootstrap estimator for  and its variance is calculated as:
1 B *
1 B * * 2
*
t   t j and the variance VB (t B ) 
 (t j  tB )
B j 1
B  1 j 1
*
B
It is a very simple form of application of the bootstrap resampling. For the parameter
estimation, the number of the bootstrap samples is usually chosen to be around 200. When the
distribution is desired then the recommended number is around 1000-2000
Let us analyse the working of bootstrap in one simple case. Consider a random variable X with
sample (outcome) space x=(x1,,,,xM). Each point have the probability fj. I.e.
P( X  x j )  f j
f =(f1,,,fM) represents the distribution of the population. The sample of size n will have relative
frequencies for each sample point as
fˆ  ( fˆ1, , , , fˆM )
Bootstrap: Cont.
Then the distribution of fˆ conditional on f will be multinomial distribution:
fˆ | f  Mn(n, f )
Multinomial distribution is the extension of the binomial
distribution
and expressed as:
M
M
P( X  ( x1, x2 , , , , xM ) 
Limiting distribution of:
n!
f1x1 ... f MxM ,
x1!...xM !
x
j 1
j
 n,
f
j 1
j
1
fˆ  f
is multinormal distribution. Now if we resample from the sample then we should consider
conditional distribution of the following (that is also multinomial distribution):
fˆ * | fˆ  Mn(n, fˆ )
Limiting distribution of
fˆ *  fˆ
is the same as the conditional distribution of the original sample. Since these two distribution
converge to the same distribution then well behaved function of sample also will have
the same limiting distributions. Thus if we use bootstrap to derive distribution of the
sample statistic we can expect that in the limit it will converge to the distribution of
sample statistic. I.e. following two function will have the same limiting distributions:
t( fˆ *, fˆ ) and t( fˆ , f )
Bootstrap: Cont.
If we could enumerate all possible resamples from our sample then we could build
“ideal” bootstrap distribution (the number of samples is nn). In practice even
with modern computers it is impossible to achieve. Usually from few hundred
to few thousand bootstrap samples are used.
Usually bootstrap works like:
1)
Draw a random sample of size of n with replacement from the given sample of
size n.
2)
Estimate parameter and get the estimate tj.
3)
Repeat step 1) and 2) B times and build frequency and cumulative distributions
for t
Bootstrap: Cont.
While resampling we did not use any assumption about the population distribution.
So, this bootstrap is a non-parametric bootstrap. If we have some idea about the
population distribution then we can use it in resampling. I.e. when we draw randomly
from our sample we can use population distribution. For example if we know that
population distribution is normal then we can estimate its parameters using our
sample (sample mean and variance). Then we can approximate population
distribution with this sample distribution and use it to draw new samples. As it can be
expected if assumption about population distribution is correct then parametric
bootstrap will perform better. If it is not correct then non-parametric bootstrap will
outperform its parametric counterpart.
Balanced bootstrap
One of the variation of bootstrap resampling is balanced bootstrap. In this case, when
resampling, one makes sure that the number of occurrences of each sample point is
the same. I.e. if we make B bootstrap we try to make the total number of occurrences
of xi equal to B in all bootstrap samples. Of course, in each sample some of the
observation will be present several times and other will be missing. It can be achieved
as follows:
Let us assume that the number of sample points is n.
1)Repeat numbers from 1 to n, B times
2)Find a random permutation of numbers from 1 to nB. Call it a vector N(nB)
3)Take the first n points from 1 to n and the corresponding sample points. Estimate
parameter of interest. Then take the second n points (from n+1 to 2n) and
corresponding sample points and do estimation. Repeat it B times and find bootstrap
estimators, distributions.
Balanced bootstrap: Example.
Let us assume that we have 3 sample points and number of bootstraps we want is 3.
Our observations are: (x1,x2,x3)
Then we repeat numbers from 1 to 3 three times:
123123123
Then we take one of the random permutations of numbers from 1 to 3x3=9. E.g.
439561287
First we take observations x1,x3,x3 estimate the parameter
Then we take x2,x3,x1 and estimate the parameter
Then we take x2,x2,x1 and we estimate parameter.
As it can be seen each observation is present 3 times.
This technique meant to improve the results of bootstrap resampling.
Bootstrap in R
We can either write our own bootstrap resampling functions or use what is available in
R. There is a generic function in R from the package boot that can do bootstrap
sampling. Perhaps its worth spending some time and study working of this unction.
To use boot function for a given statistic (let us take an example of mean) we need to
write a function that calculates it for a given sample points. For example:
mnboot = function(d,nn){mean(d[nn])} # where nn is an integer vector of length that is
equal to the length of d
Now we can use boot function from R (make sure that boot package has been loaded)
require(boot)
mnb = boot(del,mnboot,10000) # Calculate bootstrap estimation for del 10000 times.
Bootstrap: Example.
Let us take the example we used for Jackknife. We generate 10000 (simple) bootstrap
samples and estimate for each of them the mean value. Here is the bootstrap
distribution of the estimated parameter. This distribution now can be used for
various purposes (for variance estimation, for interval estimation, hypothesis
testing and so on). For comparison the normal distribution with mean equal to
the sample mean and variance equal to the sample variance divided by number
of elements is also given (black line) .
It seems that the approximation with the
normal distribution was very good.
Bootstrap: Example.
Once we have bootstrap estimates we can use them for bias removal, variance
estimation, interval estimation etc. Sequence of commands in R would be as follows
#read or prepare data and write a function for the statistic you want to estimate
d1 = c(368, 390, 379, 260, 404, 318,352, 359, 216, 222, 283, 332)
mnboot = function(d,nn){mean(d[nn])}
# This function defines the statistic you want to
calculate
require(boot)
nb = 10000
mnb = boot(d1,mnboot,nb)
# calculate mean value and variance
mean(mnb$t)
var(mnb$t)
hist(mnb$t)
# calculate 95% confidence intervals
quantile(mnb$t,c=(0.025,0.975))
# estimate empirical cumulative density functions
ecv = ecdf(mnb$t)
plot(ecv)
Bootstrap: intervals
Results of boot command can be used to estimate confidence intervals (i.e. interval
where statistic would fall if we would repeat experiments many times). They can be
calculated using boot.ci (from boot package). It calculates simple percentile intervals,
normal approximation intervals, intervals corrected to skewness and median biasedness.
If bootstrap variances are defined then t-distribution approximated intervals also are
given.
Bootstrap: Warnings.
1)
Bootstrap technique can be used for well behaved statistic (functions of
observations). For example bootstrap does not seem to be good for extreme value
estimations. Simulation may be used to design the distribution functions.
2)
Bootstrap can be sensitive to extreme outliers. It may be a good idea to deal with
outliers before applying bootstrap (or calculating any statistic) or generate more more
bootstrap samples (say instead of B we can generate (1+a)*B) and then deal with
outliers after bootsrap estimations.
3)
For complicated statistics and large number of observations bootstrap may be
very time consuming. Normal approximations to statistic may give reasonable results.
References
1)
2)
3)
4)
5)
Efron, B (1979) Bootstrap methods: another look at the jacknife. Ann Statist. 7,
1-26
Efron, B Tibshirani, RJ (1993) “An Introduction to the Bootstrap”
Chernick, MR. (1999) Bootstrap Methods: A practitioner’s Guide.
Berthold, M and Hand, DJ (2003) Intelligent Data Analysis
Kendall’s advanced statistics, Vol 1 and 2
Exercise 2
Differences between mean values of two samples; bootstrap confidence intervals
Differences between wear of shoes made using two materials, A and B is tested. 10 boys were
selected and for each boy a pair of shoes was made, one for each foot. Experiment should be
considered as paired design
A: 13.2
8.2 10.9 14.3 10.7
6.6
9.5 10.8
8.8 13.3
B: 14.0
8.8 11.2 14.2 11.8
6.4
9.8 11.3
9.3 13.6
Test hypothesis. H0: means are equal, H1: means are not equal
Use var.test for equality of variances, t.test for equality of means (assuming paired design).
Use bootstrap distributions and define confidence intervals.
Calculate power of the test
Write a report.