#### Transcript Concepts of Sampling and Sampling Distributions

Topic 9 Estimation Bias, Standard Error and Sampling Distribution From sample to population Inductive (inferential) statistical methods Make inference about a population based on information from a sample derived from that population Population inductive statistical methods sample Statistical Concepts of Sampling • Suppose we want to estimate the mean birthweight of Malay male live births in Singapore, 1992 • Due to logistical constraints, we decide to take a random sample of 50 live births from the records of all Malay male live births for that year Sampling from Target Population Target population: All Malay male live births in Singapore, 1992 random sample of 50 Malay male live births in Singapore, 1992 Suppose sample mean = 3.55 kg sample SD (S) = 0.92 kg What can we say about the population mean? Statistical Modeling • Assume the population values follow a normal or some other appropriate distribution. This means a relative frequency histogram of the population values will look like a normal or that appropriate distribution. • Assume we have a random sample, i.e., we sample n (=50 in example) values independently from the population Notation Sample data: X 1 ,..., X n Assume X 1 ,..., X n are independent and each is distributed according to say a normal distribution Population parameters: Population mean = mean of the normal population Population variance = variance of the normal population 2 Population standard deviation Statistical Inference Two general areas: (a) Statistical Estimation i.e. estimating population based on sample statistics parameters (b) Hypothesis Testing i.e. testing certain assumptions about the population Also called Test of Statistical Significance Statistical Estimation There are two ways by which a population parameter can be estimated from a sample: (1) Point estimate (2) Interval estimate Point Estimate Estimate the population parameter by a single value: Sample mean Sample median Sample variance Sample SD Sample proportion population mean population median population variance population SD population proportion Point Estimate If the average birthweight for a random sample of Malay male births was 3.55 kg and we use it to estimate , the mean birthweight of all Malay male births in the population, we would be making a point estimate for • Poor practice to report just the point estimate because • • people cannot judge how good the estimate is Should also report the accuracy of the estimate. Remember that the quality of an estimator is judged by its performance over REPEATED SAMPLING although we have just one sample in hand. Inference for population parameter should make allowance for sampling error Accuracy of statistical estimation Two types of error: (a) Sampling error or fluctuation “random” error or fluctuation that is due entirely to chance in the process of sampling. Minimizing the sampling error maximizes the precision of a statistical estimate. (b) Systematic error or bias Non-random error/bias which is either a property of the estimator itself or due to bias in the sampling or measurement process. Minimizing the systematic error maximizes the validity of a statistical estimate. Systematic errors can be minimized by making efforts to reduce measurement bias (eg non-random sampling, nonresponse and non-coverage, untruthful answers, unreliable calibration, errors with data recording and coding etc) Unbiased estimation of the mean i.e., the sample mean equals the population mean when averaged over repeated samples Hypothetical results of repeated sampling Sample Mean 1 2 3 4 5 6 7 8 9 10 3.55 3.59 3.48 3.51 3.49 3.46 3.48 3.52 3.51 3.49 •Unbiasedness means the sample mean equals the population mean when averaged over repeated samples •However, there is fluctuation from sample to sample •Variance = ? Standard Error (SE) of an estimator • The SE of an estimator (e.g., the sample mean) is just the standard deviation (SD) of the estimator. It measures the variability of the estimator under “repeated” sampling • SE is just a special case of SD • The reason why the standard deviation of an estimator is called standard error is because it is a measure the magnitude of the estimation error due to sampling fluctuation Standard Deviation vs Standard Error • The population standard deviation (SD) measures the amount of variation among the individual measurements that make up the population and can be estimated from a sample using the sample standard deviation. • The standard error (e.g. of the sample mean), on the other hand, measures how much the value of the estimator changes from sample to sample under repeated sampling. • As we take only 1 sample rather that repeated samples in practice, it seems impossible at first to estimate standard error which is defined with reference to repeated sampling. • Fortunately, the standard error of the sample mean is a function of the population SD. As the latter is estimable from a single sample, so is the standard error. Estimated standard error of the sample mean • Let denote the population SD • It was shown earlier that • SE = SD(sample mean) = / n , where n is the sample size • Since can be estimated by the sample standard deviation S, we can estimate the standard error by SE = S/ n Note that SE decreases with n at the rate 1/ n , i.e., the precision of the sample mean improves as sample size increases Knowing the mean and standard error of an estimator still doesn’t tell us the whole story The whole story is told by the sampling distribution since that helps in calculating the probabilities Sampling distribution of the sample mean • The distribution of the sample mean under “repeated” sampling from the population Sample Mean 1 2 3 4 5 6 7 8 9 10 3.55 3.59 3.48 3.51 3.49 3.46 3.48 3.52 3.51 3.49 •Distribution of the sample mean rather than individual measurements •In practice, we take only one sample, not repeated samples and so the sampling distribution is unobserved but fortunately it can often be derived theoretically Demo: http://www.ruf.rice.edu/~lane/stat_sim/index.html Exact result when sampling from a normal population • If the population is normal with mean and variance 2, then the sample mean based on a random sample of size n is also normal with mean and variance 2 / n 2 i.e., X ~ N , n • Note how we can derive theoretically the distribution of the sample mean under repeated sampling without actually drawing repeated samples • This is important because we usually only have one sample at our disposal in practice Topic 10: Interval Estimate • Provides an estimate of the population parameter by defining an interval or range of plausible values within which the population parameter could be found with a given confidence. • This interval is called a confidence interval. • The sampling distribution is used in constructing confidence intervals. Confidence interval for the mean of a normal population Fact: With probability 0.95, a normally distributed variable is within 1.96 standard deviations from its mean. 2 Now X ~ N ( , ), with SD( X ) SE n n •It follows that the sample mean must be within 1.96 standard errors from the population mean with probability 0.95. • Equivalently, the population mean is within 1.96 standard errors from the sample mean. 0.95 P X 1.96 X 1.96 n n We call X 1.96 , X 1.96 X 1.96 SE ( X ) n n a 95% confidence interval for the population mean. If is unknown, replace it by the sample SD S and replace 1.96 by the upper 2.5-percentile of a t-distribution with n-1 degrees of freedom to yield ^ S S X t , X tn1,0.025 X tn1,0.025 SE ( X ) n 1, 0.025 n n as a 95% confidence interval for the population mean The t densities • t densities are symmetric and similar in appearance to N(0,1) density but with heavier tails • Tables for t distributions are widely available • As d.f. increases, t distribution converges to standard normal distribution Demo: http://www.isds.duke.edu/sites/java.html 95% confidence interval for the population mean ^ S S X t , X tn1,0.025 X tn1,0.025 SE ( X ) n 1, 0.025 n n Birthweight data revisited •n = 100, Sample mean = 3.55 kg, S = 0.92 kg •SE = .92/sqrt(50) = 0.13 kg •d.f. = 49, upper 2.5-percentile of t = 2.01 •95% C.I. for the mean Malay male birthweight is 3.55 +/- 2.01 (0.13) = (3.29 kg, 3.81 kg) The meaning of confidence interval Under repeated sampling, X t n 1, 0.025 S n will contain the true mean 95% of the times. Demo: http://www.isds.duke.edu/sites/java.html