Week 4 Lecture Powerpoint

Download Report

Transcript Week 4 Lecture Powerpoint

University of Warwick, Department of Sociology, 2012/13
SO 201: SSAASS (Surveys and Statistics) (Richard Lampard)
Week 4
Regression Analysis:
Some Preliminaries...
Lecture content
•
•
•
•
•
•
•
•
Standard deviation
Normal distribution
z-scores
t-statistics and tests
Correlation (r)
Variation explained (r2)
F-test
Partial correlation
A key measure of spread
• Standard Deviation (SD) - and Variance
• This is linked with the mean, as it is based on
averaging [squared] deviations from it.
• The variance is simply the standard deviation
squared.
Why is the
standard deviation so important?
• The standard deviation (or, more precisely, the
variance) is important because it introduces
the idea of summarising variation in terms of
summed, squared deviations.
• And it is also central to some of the statistical
theory used in statistical testing/statistical
inference...
An example of the calculation of a standard deviation
• Number of seminars attended by a sample of undergraduates:
5, 4, 4, 7, 9, 8, 9, 4, 6, 5
• Mean = 61/10 = 6.1
• Variance = ((5 – 6.1)2 + (4 – 6.1)2 + (4 – 6.1)2 + (7 – 6.1)2 +
(9 – 6.1)2 + (8 – 6.1)2 + (9 – 6.1)2 + (4 – 6.1)2 + (6 – 6.1)2 +
(5 – 6.1)2)/(10 – 1) = 36.9 /9 = 4.1
• Standard deviation = Square root of variance = 2.025
Formula
• The standard deviation is given by the following
formula
 x  x 
2
i
n
• Like the preceding example, the formula shows it
to be the square root of the average of the
squared differences of the values from the mean.
The Standard Normal (z) Distribution
Mean = 0
Standard deviation = 1
Total area under the curve = 1 or 100%
-3
-2
-1
0
1
2
3
Question: How often are values in the standard normal
distribution below a particular number, e.g. 1.18 (SDs)?
The relative frequency of such values is equal to the area
under the curve for the relevant range of values.
The blue area is the relative
frequency of the event z < 1.18.
This area appears to be
approximately 85%-90%.
(Part of) a table of z-scores
Tables showing z-scores are laid out differently. However they all will tell you the same thing. A
common layout, is shown below.
If we go across from 1.1 to 0.08, we find the score for 1.18. It says 0.8810.
If you wanted to know the proportion of cases that fell beyond 1.18 we can work it out as
1 - 0.8810 = 0.1190.
Z
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.5000
0.5040
0.5080
0.5120
0.5160
0.5199
0.5239
0.5279
0.5319
0.5359
0.1
0.5398
0.5438
0.5478
0.5517
0.5557
0.5596
0.5636
0.5675
0.5714
0.5753
0.2
0.5793
0.5832
0.5871
0.5910
0.5948
0.5987
0.6026
0.6064
0.6103
0.6141
0.3
0.6179
0.6217
0.6255
0.6293
0.6331
0.6368
0.6406
0.6443
0.6480
0.6517
0.4
0.6554
0.6591
0.6628
0.6664
0.6700
0.6736
0.6772
0.6808
0.6844
0.6879
0.5
0.6915
0.6950
0.6985
0.7019
0.7054
0.7088
0.7123
0.7157
0.7190
0.7224
0.6
0.7257
0.7291
0.7324
0.7357
0.7389
0.7422
0.7454
0.7486
0.7517
0.7549
0.7
0.7580
0.7611
0.7642
0.7673
0.7704
0.7734
0.7764
0.7794
0.7823
0.7852
0.8
0.7881
0.7910
0.7939
0.7967
0.7995
0.8023
0.8051
0.8078
0.8106
0.8133
0.9
0.8159
0.8186
0.8212
0.8238
0.8264
0.8289
0.8315
0.8340
0.8365
0.8389
1.0
0.8413
0.8438
0.8461
0.8485
0.8508
0.8531
0.8554
0.8577
0.8599
0.8621
1.1
0.8643
0.8665
0.8686
0.8708
0.8729
0.8749
0.8770
0.8790
0.8810
0.8830
However …
• So far we’ve been working with a standard normal
distribution.
• However this is not really that useful since most
quantities do not have a mean of 0 nor a standard
deviation of 1.
• But we can convert quantities from more specific
forms of normal distributions into z-scores
• For instance: How likely is it for someone to get a
mark of at least 75 in an exam, if the mean is 60
and the standard deviation is 10?
z-scores (e.g. z-statistics)
In order to address these types of problems we use z-scores.
z-scores are calculated by subtracting the mean from any value and
dividing it by the standard deviation.
z =
x - mean
s
z-scores will always have a mean of 0 and a standard deviation of 1.
We can quickly see that the first part of this is true, since when the value
of x = the mean, the numerator will equal 0, and therefore z must = 0.
It may be a little less clear that the second part of this is true.
However if you think about the instance when x is one standard deviation
bigger than the mean (i.e. x = mean + s)
 z = (mean + s) - mean
s
= s
s
= 1
Judging whether differences occur by chance…
How do we judge whether it is plausible that two population
means are the same and that any difference between them
simply reflects sampling error?
Example: Household size of minority ethnic groups
(HOH = Head of household; data adapted from 1991 Census)
1.The size of the difference between the two sample means
Indian HOH:
Bangladeshi HOH:
Mean
3.0
5.0
Indian HOH:
Pakistani HOH:
Mean
3.0
4.0
The first difference is more ‘convincing’
The logic of a statistical test…
The statistical test that we have looked at so far:
Chi-square test: Are the observed frequencies in a
table sufficiently different from what one would have
expected to have seen (i.e. the expected frequencies) if
there was no relationship in the population for the idea
that there is no relationship in the population to be
implausible?
The more general testing approach asks whether the
difference between the actual (observed) data and what
one would have expected to have seen given a particular
hypothesis is sufficiently large that the hypothesis is
implausible.
The logic of a statistical test…
…e.g. this logic applies to comparing sample means.
If the two samples came from populations with identical
population means, then one would expect the difference
between the sample means to be (close to) zero. Thus,
• The larger the difference between the two sample means, the
more implausible is the idea that the two population means are
identical.
The plausibility of the idea is also affected by:
• Sample size
• The extent to which there is variation within each group (or
sample).
Judging whether differences occur by chance…
2. The sample sizes of the two samples
Pakistani HOH:
Bangladeshi HOH:
Pakistani HOH:
Bangladeshi HOH:
3 4 5
4 5 6
Mean
4.0
5.0
2 2 3 4 4 4 5 5 5 6
2 3 4 4 5 5 6 6 7 8
Mean
4.0
5.0
The second difference is more ‘convincing’
Judging whether differences occur by chance…
3. The amount of variation in each of the two groups (samples)
Pakistani HOH:
Bangladeshi HOH:
Pakistani HOH:
Bangladeshi HOH:
2 2 3 4 4 4 5 5 5 6
2 3 4 4 5 5 6 6 7 8
Mean
4.0
5.0
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
Mean
4.0
5.0
The second difference is more ‘convincing’.
Example continued: the impact of variability on
the difference of means.
The three graphs each show
two groups with the same
mean difference.
However the groups in each
of the three graphs have
different levels of
variability.
Where there is lower
variability there is less
cross-over between the
groups, and so the
difference of the means
reflects a more pertinent
difference (virtually no one
in group A has the same
score as anyone in group B).
Judging whether differences occur by chance…
As we’ll see these three things – the size of the
difference between the means, the sample sizes,
and the amount of variation (measured by the
standard deviations) within the samples – are
critical to our determination (test) of whether
the difference we observe between samples is
likely to reflect no actual difference in the
population between the groups being compared.
But first let’s think about how single sample
means behave...
A formal theorem:
“If repeated (simple random) samples of size N are
drawn from a normally distributed population, the
means of such samples will be normally distributed
with mean  and standard error /N... if the N of
each sample drawn is large, then regardless of the
shape of the population distribution the sample
means will tend to distribute themselves normally
with mean  and standard error /N”.
Reminder:
 = population mean
 = population standard deviation
N = number in sample
Combining that with what we know about the normal distribution:
• 95% of cases lie within +/- 1.96 standard deviations of the mean in a normal
distribution.
• The distribution of sample means is normal.
• And the standard error of sample means is approximately /n
F
r
e
q
u
e
n
c
y
95% of
sample means
2.5% of
sample means
1.96/n
1.96/n

(population mean)
2.5% of
sample means
Sample mean
Therefore 95% of sample means fall into the range:
 - 1.96(/n) to  + 1.96(/n)
What do we do if the sample size is too small for us to safely
assume that the sample mean has a normal distribution?
• When a sample is small (i.e. less than about 25) the
assumption that the sample mean is normally
distributed is not reasonable.
• In fact, regardless of sample size, the sample mean can
be assumed to have a t-distribution; the precise shape
of a t-distribution depends on the sample size, and for
moderate to large sample sizes the t-distribution is
very similar to (and, as the sample size approaches
infinity, eventually converges with) the normal
distribution.
Small samples continued…
• The procedure for identifying a statistic which has a statistically significant
value remains very similar to the one for the normal distribution.
• The only difference is that the ‘magic number’ 1.96 is replaced by a slightly
larger number, whose magnitude gets bigger as the sample size gets
smaller.
• Thus for a sample size of 25, 1.96 is replaced by 2.06 and for a sample size
of 15 it is replaced by 2.13. (you can see a t-distribution table printed in
many statistics textbooks)
However another problem arises with small samples. The distribution of
sample means can be asymmetric; in fact, the assumption that the sample
mean has a t-distribution is only reasonable for small samples if the
distribution of the variable under consideration is approximately normal.
What does a two sample t-test measure?
Note: In the above, T = treatment group and C = control group (as in
experimental research). In most discussions these will just be shown
as groups 1 and 2, indicating different groups.
If we take a large number of random samples from two
groups with the same population mean and calculate the
difference between each pair of sample means, we will end
up with a sampling distribution with the following properties:
It will be a t-distribution
The mean of the difference between sample means will be
zero
Mean M1 - M2 = 0
The spread of scores around this mean of zero (the standard
error) will be defined by the formula:
The left-hand square brackets contain the pooled variance estimate
Example
•
•
We want to compare the average amount of television watched
per night by Australian and by British children.
We are making an inference from two samples, which are
being compared in terms of an interval-ratio variable – hours of
TV watched. Therefore we select the two sample t-test for the
equality of means as the relevant test of significance.
Table 1. Descriptive statistics for the samples
Descriptive
statistic
Australian
sample
British sample
Mean
166 minutes
187 minutes
Standard deviation 29 minutes
30 minutes
Sample size
20
20
t-test of independent means: formulae
Note: 1 + 1 = N1 + N2
N1 N2
N 1 N2
Where:
M = mean
SDM = Standard error of the difference between means
N = number of subjects in group
s = Standard Deviation of group
df = degrees of freedom
Calculating the t-statistic
SDM
=
Descriptive
statistic
Australian
sample
British
sample
Mean
166 minutes
187 minutes
Standard
deviation
29 minutes
30 minutes
Sample size
20
20
(20-1)292 + (20-1)302
20 + 20 – 2
tsample = 166 – 187
9.3
= – 2.3
20+20
20 x 20
= 9.3
Obtaining a p-value for a t-statistic
• To obtain the p-value for this t-statistic we could consult a
table of critical values for the t-distribution
• The number of degrees of freedom we refer to in the table
in the combined sample size minus two:
df = N1 + N2 – 2
• Here that is 20 + 20 – 2 = 38
• If a table doesn’t have a row of probabilities for 38, we can
refer to the row for the nearest reported number of degrees
of freedom below the desired number, which might be 35.
• For 35 degrees of freedom and a two-tailed test, the tstatistic falls between the critical values of t for p=0.05 and
p=0.01, i.e. 2.03 and 2.72.
• The p-value is therefore probably (given the 35 for 38
swap) between 0.01 and 0.05, and certainly less than 0.05.
• Therefore the p-value is statistically significant at a 0.05
(5%) level but not at a 0.01 (1%) level.
Reporting the results
We can say that:
•The mean number of minutes of TV watched by
the sample of 20 British children is 187 minutes,
which is 21 minutes higher than the mean for the
sample of 20 Australian children, and this
difference is statistically significant at the 0.05
(5%) level (t(38)= -2.3, p = 0.03, two-tailed).
•Based on these results we can reject the
hypothesis that British and Australian children
watch the same average amount of television per
night.
Moving on to correlation
• Measures of association quantify the (strength
of the) association between two variables.
• The most well-known of these is the Pearson
correlation coefficient, often referred to as
‘the correlation coefficient’, or even ‘the
correlation’.
• This quantifies the closeness of the
relationship between two interval-level
variables (scales).
Spearman correlation coefficient
• If we have measures that are clearly ordinal but
not necessarily interval-level, we can use the
Spearman correlation coefficient instead, as this
looks at the relationship between the ranks of the
values rather than the values per se.
• This is also a non-parametric measure, which
does not make assumptions about distributions
when used as the basis of a test for a
relationship.
Other measures of association
(for cross-tabulations)
• In a literature review more than thirty years ago,
Goodman and Kruskal identified several dozen of
these, including Cramér’s V:
Goodman, L.A. and Kruskal, W.H. 1979. Measures of association for cross
classifications. New York, Springer-Verlag.
• … and I added one of my own, Tog, which measures
inequality (in a particular way) where both variables
are ordinal…
One of Tog’s
(distant) relatives
The Correlation Coefficient (r)
This shows the strength/closeness of a relationship
Age at First
Childbirth
r = 0.5
(or perhaps less…)
Age at First Cohabitation
r = -1
r=+1
r=0
Working out the correlation coefficient
(Pearson’s r)
• Pearson’s r tells us how much one variable changes as the values
of another changes – their covariation.
• Variation is measured with the standard deviation. This measures
average variation of each variable from the mean for that variable.
• Covariation is measured by calculating the amount by which each
value of X varies from the mean of X, and the amount by which
each value of Y varies from the mean of Y and multiplying the
differences together and finding the average (by dividing by n-1).
• Pearson’s r is calculated by dividing this by (SD of x) times (SD of
y) in order to standardize it.
  x  X  y  Y 
(n  1) sx s y
Example of calculating a
correlation coefficient
(corresponding to the last slide)
•
•
•
•
•
•
X = 9, 10, 11... 19, 19
Mean(X) = 15.3
Y = 55, 59, 60... 59, 60
Mean(Y) = 60.1
(9-15.3)(55–60.1) + (10–15.3)(59–60.1) ... etc.
32.0 + 5.8 + ... = 85.2 (Covariation)
S.D. (X) = 2.71 ; S.D. (Y) = 3.65
(85.2 / (27-1)) / (2.71 x 3.65) = 0.331
So what?
• The correlation coefficient varies between -1 and 1, so
0.331 signifies a moderately strong (!!), positive
relationship between essay mark and seminar
attendance. However, how do we know whether a
correlation coefficient of 0.331 is convincing evidence
of a genuine underlying relationship?
• The answer is, as usual, that we need to test how likely
a correlation of 0.331 would be to occur given that
there was no underlying relationship (i.e. how likely it
would be that a value as big as 0.331 would occur
simply as a consequence of sampling error).
Variation explained
• One can view seminar attendance as ‘explaining’,
in part, the variation in the essay mark. In fact,
the proportion of the variation in essay mark
explained by seminar attendance is equal to the
correlation coefficient squared, i.e. 0.331x0.331 =
0.110.
• A reminder that the correlation coefficient is
often labelled r, hence the variation explained is
often referred to as r2, or r-squared, or, more
specifically, the coefficient of determination.
Variation and its sources
We can now split up the variation in essay mark
0.110 of the variation in essay mark is explained by seminar
attendance
This is explained variation.
Hence 0.890 of the variation is left unexplained (1 - 0.110 = 0.890)
This is unexplained variation
We have used 1 variable (seminar attendance) to explain essay mark,
hence we have ‘used up’ 1 degree of freedom (source of variation).
There are 27 cases, hence there are 26 degrees of freedom overall.
Since we have used 1 degree of freedom when we explained some of
the variation in essay mark using seminar attendance, there are 25
degrees of freedom corresponding to the unexplained variation.
Using r as the basis of a test statistic
Hence the explained proportion (0.110) of the
variation corresponds to 1 degree of freedom,
and the unexplained proportion (0.890) of the
variation corresponds to 25 degrees of freedom.
Variation d.f. Variation/d.f. Ratio
Explained
0.110
1
0.110
3.08
Unexplained 0.890 25
0.0356
...which turns out to be an F-statistic
• The ratio of 0.110 to 0.0356 is 3.08. If there were no underlying
relationship between essay mark and seminar attendance, then we
would expect the 1 degree of freedom corresponding to seminar
attendance to have explained approximately its ‘fair share’ of the
variation in essay mark, rather than 3.08 times its ‘fair share’!
• The figure of 3.08 is in fact an F-statistic, since the ratio of the
variation per degree of freedom for explained variation to the
variation per degree of freedom for unexplained variation has an
F-distribution (assuming that there is no underlying relationship).
• The relevant F-distribution is the one corresponding to the degrees
of freedom of the explained variation and the degrees of freedom
of the unexplained variation (i.e. 1 and 25 in this example).
F-distributions
F-distributions
for 1 d.f. and N
d.f. do not look
dissimilar for
values of N of
10 upwards, but
the right-hand
tail becomes
slightly more
shallow.
Doing an F-test
• We now need to compare the figure of 3.08 with the
critical value at the 5% level for an F-statistic with 1
degree of freedom and 25 degrees of freedom, i.e. 4.24.
(Such values can be obtained from tables of critical
values of F, at the back of many stats textbooks).
• Since 3.06 < 4.24, there is more than a 5% chance that a
correlation coefficient as large as 0.331 would have
occurred if there were no underlying relationship
between essay mark and seminar attendance (In fact,
SPSS would tell us that p=0.092 > 0.05).
• Hence the data for the 27 students does not provide
statistically significant evidence of a relationship between
essay mark and seminar attendance.
Correlation… and Regression
• r measures correlation in a linear way
• … and is connected to linear regression
• More precisely, it is r2 (r-squared) that is of
relevance
• It is the ‘variation explained’ by the regression
line...
• Although the relationship between r-squared
and the basic correlation coefficient
disappears in multiple regression.
Partial correlations
• Partial correlations allow us to measure the
association between two variables controlling for
one or more others.
• They allow us to test whether variables are
significantly correlated net of such other factors.
• As such they allow us to see whether one variable
adds significantly to the variation explained in a
focal variable of interest by other factors.
• More on this topic on Wednesday!