Stats Refresher 2
Download
Report
Transcript Stats Refresher 2
Research Methods in
Psychology
Data Analysis and Interpretation: Part II.
Tests of Statistical Significance and the
Analysis Story
Null Hypothesis Significance Testing
(NHST)
Null hypothesis testing is used to determine
whether mean differences among groups in an
experiment are greater than the differences that are
expected simply because of error variation (chance).
Null Hypothesis Significance Testing
(NHST)
The first step in null hypothesis testing is to assume
that the groups do not differ — that is, that the
independent variable did not have an effect (i.e., the
null hypothesis — H0).
Probability theory is used to estimate the likelihood
of the experiment’s observed outcome, assuming
the null hypothesis is true.
NHST (continued)
A statistically significant outcome is one that has a
small likelihood of occurring if the null hypothesis
is true.
We reject the null hypothesis, and conclude that
the independent variable did have an effect on
the dependent variable.
A statistically significant outcome indicates that
the difference between means obtained in an
experiment is larger than would be expected if
error variation alone (i.e., chance) were
responsible for the outcome.
NHST (continued)
How small does the probability have to be in order
to decide that a finding is statistically significant?
Consensus among members of the scientific
community is that outcomes associated with
probabilities of less than 5 times out of 100 (p <
.05) if the null hypothesis were true are judged to
be statistically significant.
This is called alpha (α) or the level of significance.
NHST (continued)
What does a statistically significant outcome
tell us?
An outcome with a probability just below .05 (and thus
statistically significant) has about a 50/50 chance of being
repeated in an exact replication of the experiment.
As the probability of the outcome of the experiment
decreases (e.g., p = .025, p = .01, p = .005), the likelihood
of observing a statistically significant outcome (p < .05) in
an exact replication increases.
APA recommends reporting the exact probability of the
outcome.
NHST (continued)
What do we conclude when a finding is not
statistically significant?
We do not reject the null hypothesis if there is no
difference between groups.
However, we don’t necessarily accept the null hypothesis
either — that is, we don’t conclude that the independent
variable did not have an effect.
We cannot make a conclusion about the effect of the
independent variable. Some factor in the experiment may
have prevented us from observing an effect of the
independent variable (e.g., too few participants).
NHST (continued)
Because decisions about the outcome of an
experiment are based on probabilities, Type I or
Type II errors may occur.
A Type I error occurs when the null hypothesis is
rejected, but the null hypothesis is true.
That is, we claim that the independent variable is
statistically significant (because we observed an outcome
with p < .05) when there really is no effect of the
independent variable.
The probability of a Type I error is alpha — or the level
of significance (p = .05).
NHST (continued)
A Type II error occurs when the null hypothesis is
false, but it is not rejected.
That is, we claim that the independent variable is not
statistically significant (because we observed an outcome
with p > .05) when there really is an effect of the
independent variable that our experiment missed.
Because of the possibility of Type I and Type II
errors, researchers are tentative in their claims. We
use words such as “support for the hypothesis” or
“consistent with the hypothesis” rather than stating
that a hypothesis has been “proven.”
NHST: Comparing Two Means
The appropriate inferential statistical test when
comparing two means obtained from different groups
of participants is a t -test for independent groups.
The appropriate test when comparing two means
obtained from the same participants (or matched
groups) is a repeated measures (within-subjects) t-test.
A measure of effect size should be reported when
NHST is used.
Comparing Two Means (continued)
Independent Groups t-test
The t-test for independent groups is defined as the
difference between two sample means (e.g., treatment
group and control group) divided by the standard error
of the mean difference (sM1- M2).
The calculation formula is:
M1 – M2
t=
(n1 - 1) s12 + (n2 – 1) s22 1 + 1
n1 + n2 – 2
n1 n2
Comparing Two Means (continued)
Standard deviation and Variance.
The standard deviation (SD or s) is a measure of how
far on the average a score (X) is from the mean.
Formula:
∑(X − M)2
____________
N−1
The variance (s2) is a measure of variability; it is the
square of the standard deviation.
Comparing Two Means (continued)
Either by using a calculator or a computer, obtain a
value for the t statistic. Next, identify the probability
associated with the outcome.
If a computer and statistical software are used, the
probability of the outcome will be presented with
the value for t as part of the output.
If the value for t is calculated using the formula, the
probability of the outcome can be found by using
the t table (Table A.2 of the Appendix) with df = N –
2.
Comparing Two Means (continued)
If the probability of the outcome is less than .05 (p
< .05), reject the null hypothesis of no difference
between the means, and conclude that the
independent variable had a statistically significant
effect on the dependent variable.
If the probability of the outcome is greater than .05
(p > .05), do not reject the null hypothesis of no
difference between the means. With a
nonsignificant outcome, we withhold judgment
about the effect of the independent variable.
Calculate the effect size.
Determine the power of the statistical test.
Comparing Two Means (continued)
A measure of effect size should always be
calculated.
For two means, Cohen’s d can be calculated using
values from the t test:
2t
d = √df
Sometimes a large effect size can be observed with
an outcome that is not statistically significant. This
can occur when there is not sufficient power to
detect the effect of the independent variable (e.g.,
too few participants).
Comparing Two Means (continued)
A repeated measures (within-subjects) t-test is used to test
the difference between performance in the
treatment condition and the control condition in a
repeated measures design or matched groups
design.
D
t = sD
where “D-bar” is the mean of difference scores between the
treatment and control conditions for each participant and sD
is the standard error of difference scores.
Comparing Two Means (continued)
The standard error of the mean formula:
sM =
s
√N
Comparing Two Means (continued)
Although the formula for calculating the t statistic in
the repeated measures design is slightly different, the
procedures for NHST are the same.
The t value is obtained, followed by the associated
probability value.
If p < .05, reject the null hypothesis, and conclude the
independent variable had a statistically significant effect
on the dependent variable.
If p > .05, do not reject the null hypothesis; the
outcome of the statistical test was not significant.
Data Analysis Involving More Than
Two Conditions
An experiment can have one independent variable
with more than two levels, or an experiment might
have two or more independent variables (each with
at least two levels) in a complex design.
The most frequently used statistical procedure for
experiments with more than two conditions is
analysis of variance (ANOVA) which uses null
hypothesis significance testing (NHST).
ANOVA
Analysis of variance (ANOVA) is an inferential
statistics test used to determine whether an
independent variable has had a statistically significant
effect on a dependent variable.
The logic of ANOVA is based on identifying sources
of error variation and systematic variation in the data.
In a properly conducted experiment, the
differences among participants should be the only
source of error variation within each group.
The experimental procedures should be held
constant within each condition to decrease error
variation.
ANOVA (continued)
The second source of variation in a random groups
design is variation between the groups.
If the null hypothesis is true (no difference between
the groups), any observed difference among the
means of the groups can be attributed to error
variation — the differences among people in the
groups.
Thus, when the null hypothesis is assumed to be
true, any differences among means in the
experiment are attributed to error variation within
the groups and error variation between the groups.
ANOVA (continued)
When the null hypothesis is false (the independent
variable has an effect), the means for the conditions
of the experiment should be different.
An independent variable that has an effect on
behavior should produce systematic differences in the
means across the conditions of the experiment.
Therefore, when the independent variable has an
effect on behavior differences among group means,
it can be attributed to the effect of the independent
variable (systematic variation) plus error variation.
ANOVA (continued)
The F-test is a statistical test that allows us to determine
whether the variation due to the independent variable is
larger than what would be expected based on error
variation alone.
The conceptual definition of the F-test is:
F
=
Variation between groups
---------------------------------Variation within groups
ANOVA (continued)
Because “variation between groups” can be attributed
to error variation plus systematic variation, and
“variation within groups” is attributed to error
variation, the F ratio can be re-written as:
F
=
Error variation + systematic variation
-----------------------------------------------Error variation
ANOVA (continued)
If the null hypothesis is true, there is no systematic
variation between groups (no effect of the independent
variable), and the F ratio has an expected value of 1.00.
Error variation divided by error variation would
equal 1.0.
F
=
(zero)
Error variation + systematic variation
-----------------------------------------------Error variation
ANOVA (continued)
As the amount of systematic variation increases (due to the
effect of the independent variable), the expected value of the F
ratio becomes greater than 1.00.
F
=
(effect of the IV)
Error variation + systematic variation
-----------------------------------------------Error variation
How much greater than 1.00 does the F ratio have to be before
we can be confident that it reflects true systematic variation due
to the independent variable (and not simply chance factors)?
This is where NHST comes in: To be statistically significant, the
F value needs to be large enough so that its probability of
occurrence if the null hypothesis were true is less than our level
of significance (p < .05).
ANOVA (continued)
The logic of statistical inference with ANOVA is
similar to that used with the t-test.
The first step is to assume no difference among the
means of the experiment.
If the omnibus F-test is statistically significant, we
reject the null hypothesis of no difference among
means.
A statistically significant F-test indicates that there
is a difference somewhere among the means in the
experiment.
The statistically significant omnibus, or overall, Ftest does not indicate which means are different.
ANOVA (continued)
The ANOVA Summary Table provides the information for
estimating the sources of variance: between groups (systematic +
error variation) and within groups (error variation).
Source
Sum of Squares (SS)
Group (between) 54.55
3
Error (within)
37.20
df Mean Square (MS) F-test
18.18
16
7.80
p
.002
2.33
The Mean Square for the “Group” independent variable provides
an estimate of systematic variation plus error variation.
The Mean Square for “Error” provides an estimate of error
variation.
The F-test is the Group MS divided by the Error MS (18.18 ÷ 2.33
= 7.80).
This F-test is statistically significant because .002 < .05.
ANOVA (continued)
The between-groups sum of squares is equal to the sum
of the differences between the overall mean and the
mean of each group, which is then squared
The within-group sum of squares is equal to the sum of
the differences between each individual score in a
group and the mean of each group, which is then
squared.
The total sum of squares is equal to the sum of the
between-groups SS and the within-group SS
Mean sum of squares is simply SS divided by df.
ANOVA (continued)
Because the F-test is statistically significant, we
reject the null hypothesis, and conclude that the
independent variable had a statistically significant
effect on the dependent variable.
The significant F-test tells us that the group means
in the experiment are different — but it doesn’t tell
us which means in the experiment are different.
It is essential to examine the means to interpret the
effect of the independent variable. Simply finding
out whether the effect was statistically significant or
not is not sufficient when analyzing data.
Calculating Effect Size for Designs with
Three or More Independent Groups
The effect size for experiments with three or more
groups is based on measures of “strength of
association.”
These measures allow researchers to estimate the
amount of variability (variance) in participants’
scores that can be attributed to the effect of the
independent variable.
Larger effect sizes indicate that the independent
variable can account for or “explain” participants’
performance more than smaller effect sizes.
In ANOVA, a popular measure of association is eta
squared (η2).
Effect Size (continued)
Eta squared is easily calculated from values found in
the ANOVA summary table:
η2 =
Sum of Squares Between Groups
Total Sum of Squares
Eta squared can also be calculated simply from the
report of an F-test:
η2 =
(F) (df effect)
[(F)(df effect)] + (df error)
Effect Size (continued)
Another measure of effect size for use with three or
more groups is Cohen’s f.
Calculate f using values for eta squared:
η2
f = 1 - η2
Cohen’s suggested guidelines for interpreting effect
sizes using f:
Small: f = .10
Medium: f = .25
Large: f = .40
Assessing Power for Independent Groups
Designs
Suppose a researcher observes an effect size of f = .40
(a “large” effect), but the effect of the independent
variable is not statistically significant. Suppose there
were 5 participants in each of four conditions (df = 3
for the effect of the independent variable).
By referring to Power tables (Table A.5 in the
Appendix), the researcher discovers that the power
was .26.
Power (continued)
When power = .26, this indicates that a statistically
significant outcome would occur only in approximately onefourth of the attempts to conduct this experiment under
these circumstances (i.e., with 5 participants in each of 4
conditions and an effect size =.40).
Typically, before they begin their research, researchers identify the
number of participants they would need with power = .80 (a
statistically significant outcome would occur in 80% of the
attempts of an experiment).
In this example, the power table indicates we would need 18
participants in each of the 4 conditions for a total of N =
72.
Comparisons of Two Means
When an independent variable with three or more
levels is statistically significant, the next step is to
identify which of the group means in the
experiment are different: These are called
“comparisons of two means.”
These comparisons focus on a particular difference
between two means.
For example, suppose that an experiment has two
control groups and one treatment group, and that
the F-test for this independent variable with three
levels is statistically significant.
Comparisons of Two Means (continued)
One comparison, in this example, would be to determine
whether the mean for the treatment group is significantly
different from the average of the means for the two control
groups.
A t-test can be used to compare the means using the
following formula:
M1 — M2
t=
MSerror
1+ 1
n1
n2
The MSerror comes from the ANOVA summary table —
n1 and n2 are the sample sizes associated with each mean in
the test.
Comparisons of Two Means (continued)
The statistical significance of the t-test can be obtained by
checking a t-test table (Table A.2 in the text), or by
using a computer program in which the observed t value and
df are entered, and the exact probability of the result is
obtained. One Web site to check is:
http://math.uc.edu/~brycw/classes/148/tables.htm
Cohen’s d can be calculated for the comparison using
the following formula:
d = _2 (t)
√dferror
Repeated Measures Analysis of Variance
The general procedures and logic for null
hypothesis testing using repeated measures analysis
of variance are similar to those used for
independent groups analysis of variance.
Before beginning the ANOVA for a complete
repeated measures design, a summary score (e.g.,
mean) for each participant must be computed for
each condition.
Descriptive data are calculated to summarize
performance for each condition of the independent
variable across all participants.
Repeated Measures ANOVA (continued)
The primary way that ANOVA differs for repeated
measures is in estimation of error variation or
residual variation.
Residual variation is the variation that remains
when systematic variation due to the independent
variable and participants is removed from the
estimate of total variation.
Repeated Measures ANOVA (continued)
Variation due to different participants in conditions is
eliminated in repeated measures designs because the
same individuals participate in each condition.
Because this source of variation is eliminated, repeated
measures designs are more sensitive than independent
groups designs — they are better able to detect the
effect of an independent variable when that effect is
present.
Two-Factor Analysis of Variance for
Independent Groups Designs
Complex designs have two or more independent
variables — each with two or more levels.
The ANOVA indicates the statistical significance of
main effects of each independent variable and the
interaction effect(s) between variables.
The analysis of complex designs differs depending on
whether an interaction effect is statistically significant or
not.
Analysis of a Complex Design with an
Interaction Effect
If the omnibus (overall) ANOVA reveals a statistically
significant interaction effect, the source of the interaction is
identified using simple main effects analyses and comparisons
of two means.
A simple main effect is the effect of one independent variable
at one level of a second independent variable.
If an independent variable has three or more levels,
comparisons of two means can be used to examine the source
of a simple main effect by comparing means two at a time.
After the simple main effects are analyzed, researchers examine
the main effects of the independent variables.
Analysis of a Complex Design with an
Interaction Effect (continued)
Confidence intervals may be drawn around group
means to provide information regarding how
precisely the population means have been
estimated.
The wider the intervals around the sample means, the
less precise the estimate of the population means.
A rule of thumb for interpreting confidence intervals is
that if the intervals around the means do not overlap,
then a difference between the population means is likely.
Analysis with No Interaction Effect
If an omnibus ANOVA indicates the interaction effect
between independent variables is not statistically
significant, the next step is to determine whether the
main effects of the independent variables are
statistically significant.
The source of a statistically significant main effect can
be specified more precisely by performing comparisons
that compare means two at a time and by constructing
confidence intervals.
Reporting Results of a Complex Design
The following should be included when describing the
results of a complex design experiment:
Description of variables and definition of levels
(conditions) of each;
Summary statistics for cells of the design in text,
table, or figure; including, when appropriate,
confidence intervals for group means;
Report of F-tests for main effects and interaction
effects with exact probabilities.
Reporting Results (continued)
The following should be included when describing the
results of a complex design experiment:
Effect size measure for each effect;
Statement of power for nonsignificant effects;
Simple main effects analysis when interaction effect
is statistically significant and comparisons of means
two at a time, if appropriate;
Verbal description of statistically significant
interaction effect (when present), referring reader to
differences between cell means across levels of the
independent variables.
Reporting Results (continued)
The following should be included when describing the
results of a complex design experiment:
Comparisons of two means, when appropriate, to
clarify sources of systematic variation among means
contributing to main effect;
Conclusion that you wish reader to make from the
results of this analysis.