t Test for a Single Group Mean (Part 3), Effect Size

Download Report

Transcript t Test for a Single Group Mean (Part 3), Effect Size

Psych 5500/6500
The t Test for a Single Group Mean (Part 3):
Effect Size
Fall, 2008
1
Effect Size
In the t test for a single group the ‘effect size’ is the
difference between the actual value of the population
mean (the population from whence the sample was
drawn) and the value of the population mean proposed
by H0.
For example, math scores in some class have traditionally
had a mean of 50, a new teaching program is tested to
see if it changes the math scores.
H0 can be written as μH0=50 (no effect due to new
program).
But say that the new program actually leads to an
improvement in math scores such that now μY=55.
The effect of the new program was to raise the scores by 5:
2
μY - μH0= 5
p Values
Until fairly recently the report of the statistical analysis of
experiments in psychology focused primarily on the ‘p
values’ that were obtained. In the example of the math
scores and the effect of the new teaching program the
results of the analysis would be presented as
t(23)=2.45, p=0.02, from this it can be concluded that
the effect of the new teaching program was statistically
significant (H0 is rejected).
With p values the focus is on whether or not the effect is
statistically significant; with a p value of .051 (don’t
reject H0) being fundamentally different than a p value
of .049 (reject H0).
3
p Values and Effect Size
p values simply tell us whether the effect was
statistically significant (i.e. unlikely to have
occurred due to chance alone). p values are a
poor indication of the size of the effect in an
experiment as the value of p is influenced by a
variety of things, including the effect size, the
size of the sample, and the variance of the
populations.
The trend in psychology is to report both the size of
the effect and it’s p value.
4
“Until quite recently in the history of experimental
psychology, when researchers spoke of ‘the results of a
study,’ they almost invariably were referring to whether
they had been able to ‘reject the null hypothesis,’ that is,
to whether the p values of their tests of significance were
.05 or less. Spurred on by a spirited debate over the
failings and limitations of the ‘accept/reject’ rhetoric of
the paradigm of null hypothesis testing, the American
Psychological Association (APA) created a task force to
review journal practices and to propose guidelines for
the reporting of statistical results. Among the ensuing
recommendations were that effect sizes...be reported.”
Rosnow & Rosenthal, 2003. p. 221)
5
“It is no longer considered sufficient to ask of
an effect or relationship: ‘Is it there?’ It is
increasingly considered essential to also ask
‘How much is there?’ and sometimes even
‘Is it enough to care?’
McGrath & Meyer, 2006, p. 386.
6
Measures of Effect Size
There are many ways of measuring and
reporting effect size, and various authors
provide various ways of clumping these
approaches into categories. We will
consider three categories:
1. Simply reporting ‘raw’ effect size.
2. Standardized effect size
3. Strength of Association
7
1) Simply Reporting ‘Raw’ Effect Size
If the measures are easily comprehendible then you
can simply state the effect size.
In our math example you can report that the expected
value of the μ of math scores given H0 was 50
while the estimate of μ given your sample was 55,
or you could simply state that the mean math
scores in the sample was 5 greater than what was
predicted by H0.
Belying the concept that if it isn’t complicated it
can’t be good, this is actually the approach favored
by the APA.
8
2) Standardized Effect Size
If the measure is something that is hard to grasp
(i.e. inverse reaction times, where an effect of
‘0.2’ would be hard to intuitively understand)
or if you want to do a meta-analysis
(comparing results across several studies) then
a standardized effect size may be more useful.
In a standardized effect size you are turning the
effect size into something that is similar to a
standard score.
9
Cohen’s d (population)
Y   H 0

Y
This formula is for computing the actual effect size
as it occurs in the population from which we
sampled. The difference between the mean
proposed by H0 and the actual mean of the
population is divided by the standard deviation of
the population from which the sample was drawn.
10
Example
Say that in our example the actual mean score of the
population of students taught using the new method
was 56 (slightly higher than the sample mean we
happened to get) with a population standard deviation
of 7.3. The formula turns the difference of 6 between
H0 and the actual population mean into a standard
score of 0.82. Is that a big effect or a small effect? We
will cover that in a minute.
Y   H 0 56  50


 0.82
Y
7.3
11
Cohen’s d (sample)
Y  μ H0
d
SY
SS
where SY  S 
N
2
Y
This formula is for computing the effect size as it
occurs in the sample. The difference between the
mean proposed by H0 and the mean of the sample
is divided by the standard deviation of the
sample.
12
Cohen’s d (sample)
Alternative Formula
t obt.
d
df
This is an easy way to compute d if you have the
tobtained value and df from the t test for a single
group mean. It has the disadvantage of not
making it clear what d is actually measuring (i.e.
the standardized effect size).
13
Hedges’s g (estimate of the effect
size in population)
est .μ Y  μ H0 Y  μ H0
g

est .σ Y
est .σ Y
where est .σ Y  est .σ 
2
Y
SS
N -1
This formula uses the data from the sample to
estimate the effect size in the population.
14
Hedges’ g and Cohen’s d
Y  μ H0
g
est.σ Y
Y  μ H0
d
SY
For large samples the difference between est.σY and SY (the
estimate of the population std dev and the std dev of the sample)
will be quite small, and thus the values of Hedges’s g and Cohen’s
d will be quite close.
15
Interpreting ‘d’
Cohen proposed a simple way to evaluate the size
of an effect based upon the value of ‘d’ (and as
there is only a small difference between ‘d’ and
‘g’ it could apply to Hedges’s g as well). Note:
take the absolute value of the d, whether it is
negative or positive is irrelevant to the strength
of the effect.
|d|= .2
|d|= .5
|d|= .8
a ‘small’ effect
a ‘medium’ effect
a ‘large’ effect
16
Interpreting ‘d’ (cont.)
Where did this come from? According to Cohen an
effect size of d=.5 (a ‘medium’ effect size) is
usually noticeable to someone looking at graphs
of the data. Subsequent surveys of the literature
have found that the average size of effects
reported in various fields is approximately equal
to a d of .5. A small effect (.2) is smaller than
that but still not too trivial, and a large effect (.8)
is the same distance above a medium effect as a
small effect is below it.
17
Interpreting ‘d’ (cont.)
Cohen offered these criteria with some misgivings. His
goal was to make the value of ‘d’ more meaningful but
he was worried that people would take them too
seriously (he was right). These criteria are fairly
arbitrary and are based upon what might be considered
the size of the effect view purely through the lens of
statistics. A ‘small’ effect might still be of great
theoretical interest, a ‘small’ effect in the field of
medicine might lead to saving 10’s of thousands of lives
(giving it great social or pragmatical interest). A ‘large’
effect might be of little theoretical or practical
significance.
18
Interpreting ‘d’ (cont.)
The real value of Cohen’s effect size values (small,
medium, and large) will be seen when we discuss
‘power’. When computing the possible power of
an experiment that you are designing, you need
to guess what the effect size will be. Cohen’s
criteria provide one way to help you guess. If
you anticipate that the effect you will be looking
at will be small, then plug in a value of .2 for d,
etc. We will take a look at this later.
19
More on Standardized Effect Sizes
1) These formulas are for the context of testing a
single mean versus what is predicted by H0,
different forms of the formula are necessary for
other experimental designs (we will cover these
later).
20
More on Standardized Effect Sizes
2) Beware that there is a bewildering lack of consistency
in the literature on how to compute Cohen’s d. Often
the formula for finding the effect size in the population
will be given, followed by an example where the mean
and standard deviation of the sample are plugged into
the formula (under the assumption, I assume, that by
not generalizing to a larger population we are treating
the sample as our population of interest). One of the
reasons I like the way I use the symbols in this class is
the way in which it makes it easy to discriminate
exactly what is being accomplished by the various
formulas.
21
More on Standardized Effect Sizes
3) What does SPSS provide? None of these. SPSS
will provide the mean of Y and the ‘standard
deviation’ of Y (which is actually the est. σY),
making it a simple process to calculate Hedges’s
g. If you want to calculate Cohen’s d (for the
sample) you can either translate est. σY into S
using the formula given earlier this semester
(provided again below) or you can use the formula
for computing ‘d’ from ‘t’ (as SPSS will give you
both tobt and df).
Y  μ H0
N -1
S  est.σ
, thend 
N
S
t obt.
or, d 
df
22
More on Standardized Effect Sizes
4) Advantages of using standardized effect sizes:
a) If the effect size involves some metric that is hard
to conceptualize (i.e. an effect size of -0.2 in a
measure of inverse reaction times) then turning it
into a standard score will help. Cohen’s criteria
for what constitutes a small, medium, and large
effect size can give the standardized effect size
some level of meaning.
23
More on Standardized Effect Sizes
4) Advantages of using standardized effect sizes:
b) Standardizing the effect size makes it easier to do
meta-analysis (where you compare the effect size
of several different studies) particularly when the
studies are examining the same topic but with
different measures. By translating the effect sizes
found in all of the studies into standardized effect
sizes you turn them into essentially the same
metric so that they can be directly compared.
24
Example
Say one study used inches to measure the variable ‘length’
and found an effect size of 24 inches. Say another
study measured exactly the same subjects but used feet
to measure length and found an effect size of 2 feet.
The standard deviation of scores in the first study was
15, that would make the standard deviation of the
second study be 1.25 (i.e. 15/12....trust me). While the
first study had an effect size of 24 (inches) and the
second study had an effect size of 2 (feet) we can see by
computing the d’s that they found the same effect.
Y  μ H0 60  36 24
d


 1.6
S
15
15
53
2
d

 1.6
1.25 1.25
25
Example (cont.)
While it is obvious in the example that the effect size should
be the same, let’s apply the idea to a more realistic
scenario. In this scenario ‘Study A’ measures intelligence
using one IQ test and finds an effect size of 4. ‘Study B’
measures intelligence using a different IQ test and finds an
effect size of 6. The two IQ tests have different means and
different variances and it is hard to know how the two
effect sizes really compare, but if we change each effect
size to standardized differences we can compare then
directly.
Y  μ H0 112 116 - 4
dA 


 .36
S
11
11
103 109 - 6
dB 

 .21
26
29
29
More on Standardized Effect Sizes
5) Disadvantages of using standardized effect
sizes:
a) The problem with standard scores is that they take
you away from the units of measure that you used
in the study. It might be more useful to know that
the fertilizer increased growth rate by 24 inches a
year than to know that d=0.3.
27
More on Standardized Effect Sizes
5) Disadvantages of using standardized effect sizes:
b) Standardized effect sizes bring the standard deviation of the
scores into the expression of effect size, which in some
cases can hide the pure understanding of the effect. Say
that the math teaching method raised the scores of students
on the average by 5, and these students were a pretty varied
lot (differed a lot in terms of math ability). Now say that in
another study the teaching method raised scores again by 5,
but in a class where the students were similar in math
ability. Even though the teaching method had the same
effect in both classes the values of d would differ in the two
studies (as the denominator would differ in the d formulas).
Y  μ H0 55  50
Y  μ H0 55  50
dA 

 0.22 d B 

 0.36
28
S
23
S
14
More on Standardized Effect Sizes
6) Which to use ‘g’ or ‘d’?
‘d’ gets more press, ‘g’ seems to be of more
interest (to me), you can use either, and with
any kind of large N they will be very close in
value. If you want to compare your study to
other similar studies see which one most of
them use so you can more easily compare.
29
3) Strength of Association
This category of effect size measures is also called
‘correlation’ or ‘amount of variance accounted for’.
Everything we do next semester will automatically
crank these out and in that context they will be quite
understandable. In the context of what we are doing
this semester (ANOVA) standardized measures
(such as ‘d’) are often used, consequently we will
hold off discussion of ‘Strength of Association’
measures of effect size until next semester.
30