Transcript Slide 1
Quantitative Methods – Week 6:
Inductive Statistics I:
Standard Errors and Confidence Intervals
Roman Studer
Nuffield College
[email protected]
Repetition: Fitting the Regression Line
The regression line predicts the values of Y based on the values
of X. Thus, the best line will minimise the deviation between
the predicted and the actual values (the error, e)
Regression line
IP=a+bWage
=YUK - ŶUK
Repetition: The Goodness of Fit
yi
Regression line
TSS
ESS
y
xi
x
Total variation = explained variation + unexplained variation
TSS
ESS
R²=ESS/TSS
USS
Homework
Y
Norway
Switrzerland
US
Brazil
Iran
54360
49660
39430
3340
2340
Yi-Mean
24534
19834
9604
-26486
-27486
0
X
81
49
83
21
21
Xi-Mean Product of Deviations
30
736020
-2
-39668
32
307328
-30
794580
-30
824580
0
2622840
Deviation squared (X)
900
4
1024
900
900
3728
This means that the regression coefficient is 703.55.
(Product of the Deviations divided the Deviation squared of X).
Therefore a = -6055.12661 (29826-(703,55 x 51)
B
703.551502
Homework (II)
Xi
Norway
Yi
81
Yi-Mean
54360
24534
Squares
601917156
Explained variation
24534
50932.545
1
445486244.6
Switrzerla
nd
49
49660
19834
28418.897
1979938.865
755480196
US
83
39430
9604
52339.648
1
506864349.4
254452992
0
Brazil
21
3340
-26486
8719.4549
4
445486244.6
Iran
21
2340
-27486
8719.4549
4
445486244.6
19834
393387556
US
83
39430
9604
92236816
Brazil
21
3340
-26486
701508196
Iran
21
2340
-27486
Yi-Mean
Y
Predicted
54360
49660
Yi
Yi-Mean
81
49
Xi
Yi
Norway
Switrzerlan
d
Total Sum of Squares
Xi
Y
Predicted
Unexplained variation
Norway
81
54360
24534
50932.5451
11747447.34
Switrzerlan
d
49
49660
19834
28418.897
451184456.8
US
83
39430
9604
52339.6481
166659013.3
Brazil
21
3340
-26486
8719.45494
28938535.4
Iran
21
2340
-27486
8719.45494
40697445.28
Residual sum of squares
699226898.1
Explained sum of squares
1845303022
The Coefficient of Determination will be
therefore: 1845303022/2544529920 =
0.7252039
This will mean that Education is able to
account for 72% of the GDP per person.
Complete data set:
Coefficient of determination:
R-squared = 0.5446
a= -2523.28; b= 467.31
Inductive Statistics: Introduction
So far, we have only looked at samples, and we will most often
only have samples and not entire populations
We have described and analysed these samples and computed means,
standard deviations, correlation coefficients, regression coefficient, etc.
However, because of "the luck of the draw“, the estimated parameters will
deviate from the ‘true’ parameters of the whole population (sampling error)
We now move from descriptive statistics to inductive statistics…
We no longer only describe samples, but we now draw conclusions about
characteristics of the entire statistical population based on our sample
Chapters 5 & 6 provide the tools necessary to make inferences from a
sample
Inductive Statistics: Introduction (II)
What can we infer from a sample?
If we know the sample mean, how good is this an estimator of the population
mean?
If we calculated the correlation and regression coefficient from a sample of
observations, how good is this an estimator of the ‘true’ correlation and
regression coefficient?
How reliable are our estimates?
Sample Biases
In a first step, especially when working with historical data, we need
to ascertain whether our sample is likely to be representative or
whether is may suffer from some serious bias problems…
Is the sample of records that has survived representative of the full set of
records that was originally recorded?
•
•
Is the sample drawn from the records representative of the information in
those records?
•
•
Business records, household inventories
Did all records have an equal chance to make their way to the archive (success
bias)?
Should you computerise information of people whose surname begins with
W? Is B possibly a better choice?
Rate of return on equity (survivorship bias)
Is the information in the records representative of a wider population than
that covered by the records?
•
•
Height records of recruits
Tax records (selection bias)
Sampling will affect the inferences we (can) draw
Sampling Distribution
20
10
0
• 15 times, we
calculated
55<mean<=57.5
• 4 times, we
calculated
52.5<mean<=55
Frequency
• 34 times, we
calculated
57.5<mean<=60
30
40
Sampling distribution refers to the distribution of the parameters that would be
obtained if a large number of random samples of a given size were drawn from a given
population; it is a hypothetical distribution
Example: We draw a sample of 20 rabbits and then we calculate the mean ear length.
After this we let the rabbits free. We repeat this 100 times. We get 100 estimates of
mean ear length based on 100 samples of 20 rabbits. The distribution may look like
this
50
55
60
65
Mean Ear Lengths of Brown Hare Rabbits (in cm)
Sampling Distribution (II)
Probability
Sampling distribution of the sample mean
• m: population mean
•X: sample means
SE(X)
m
Sample mean estimatesX
The standard error is the estimated standard deviation of the
sampling distribution
Central Limit Theorem
1. Regardless of shape of the population distribution, as the sample
size (of samples used to create the sampling distribution of the
mean) increases, the shape of the sampling distribution becomes
normal
2. The mean of the sampling distribution will be equal to the ‘true’ but
unknown population mean. On average, the known sample mean X
will be equal to μ, the unknown population mean
3. The standard deviation of the sample (s) can be taken as the best
estimate of the population standard deviation (σ). The standard
error (SE) of the sample mean, i.e. the standard deviation of the
sampling distribution is therefore
s
SE ( x )
x
N
Standard Normal Probability Distribution
With the mean (X ) and the standard deviation (SE) of the sampling
distribution, we have all the information about the distribution
However, we now want to standardise this sampling distribution
using
Z
X mX
X
with
sx
x SE( X )
N
The distribution of Z has always a mean of zero and a standard
deviation of 1
The proportion of under the curve up to or beyond any specific value
of Z can now be obtained from a published table
Standard Normal Probability Distribution (II)
A standard normal
distribution is a normal
distribution N(0,1) with
mean m=0 and standard
deviation =1
95% of cases
=1
2,5% of cases
2,5% of cases
with
-1,96
0
+1,96
Student’s t-distribution
Student’s t-distribution is very similar to the standard normal Zdistribution, but adjust for the degrees of freedom (df)
X mX
Z
sX / N
X mX
t
sX / N 1
As the sample size N tends to infinity the t-distribution approximates
the standard normal Z-distribution
We know the proportion of cases below a certain t-value, e.g. 2.5%
of the cases are below t=1.98 for N-1=120 and t=1.96 when N
approaches infinity
Confidence Intervals
We now come back to the question asked before: how good are
our estimates of some parameters obtained by the sample? How
good an estimator is, say, the sample mean, X, of the what we
really want to know, which is the population mean μ?
The sample mean can be taken as an estimate of the unknown
population mean
Though correct on average, a single estimate from an individual
sample might differ from the true mean to some extent
We can generate an interval in which the "true" (population) mean
is located with a specified probability
• 90% CI: With a probability of 90%, the
interval includes m
• 95% CI: In 95 times out of 100, the
interval includes m
• 99% CI: There is a 99% probability
that the interval includes m
Confidence Intervals (II)
How many standard errors either side of the sample do we have to
add to achieve a degree of confidence of 95%?
The t-distribution gives the exact value!
We know the proportion of cases below a certain t-value, e.g.
2.5% of the cases are below t=1.98 for N-1=120 and t=1.96 when
N approaches infinity
X t0.025 SE ( X )
Example: Birth rate in English parishes
• N= 214 parishes
• The mean is 15.636 births per 100 families
• Standard error (SE) s is 0.308 births
x
N
• The t0.025 value for the t-distribution for 213 degrees of freedom is 1.971
The 95% confidence interval for the mean birth rate of the population is
therefore: 15.636 +/ (1.971 x 0.308) = 15.636 +/ 0.607
Computer Class:
•
Repetition & Confidence Intervals
Exercises
Weimar elections: Unemployment and votes for the Nazi
Get the dataset about the Weimar election of 1930-33 at
http://www.nuff.ox.ac.uk/users/studer/teaching.htm
• Look at the variables (votes for the Nazi party, level of unemployment) in turn
• Get a first visualisation of the data; does it look normally distributed?
• Compute the mean, median, standard deviation, coefficient of variation, kurtosis
and skewness for voting share of the Nazi party and the level of unemployment
• Estimate the following regression for each of the first two of the four elections
(09/30, 03/33): Nazi=a+bUnemployment
• Explain in words what the two regression tell you
• Draw the respective scatter plots and draw the regression lines
• Calculate the 90%, 95% and 99% confidence intervals for a and b
• Are the b and the explanatory power of the regression the same for the election
in 1930 and the one in 1933?
Homework
Readings:
• Feinstein & Thomas, Ch. 6
• Repeat what we have learned today
Problem Set 5:
Finish the exercises from today’s computer class if you haven’t done so already.
Include all the results and answers in the file you send me.