Inferential statistics
Download
Report
Transcript Inferential statistics
PALA Summer School
2014
Inferential Statistics
Willie van Peer
Ludwig-Maximilians-University
Munich
[email protected]
Inferential statistics
• Also ‘test statistics’
• sample ---> population?
• Tests whether observed results in sample
may be generalized to the population.
• Not as ‘yes’ or ‘no’, but as a probability.
• Statistics is a discipline in which such
probabilities are investigated.
Sample vs. Population
• Suppose you wish to investigate whether ‘free’ or
‘guided’ reading lessons in school yield different
pedagogical results.
• You will have to make observations, ask
questions, in one word: collect data.
• But impossible to ask ALL pupils in your country.
• So you make a SELECTION: a sample.
• But you are not interested in this sample only:
• You want to go beyond the sample: to the
population (here in the statistical sense!)
This is a generalization
• Beyond the sample.
• But this can be tricky / dangerous / fatal (!)
• Gunners in WWII bombers mostly said the attacks
came from behind and above.
• Can one generalize these answers?
• In order to be able to, the sample has to be
representative of the population.
• Is that condition fulfilled here?
• Of course it is NOT. Think why!
• Suppose you have followed the two instruction
methods for reading (‘free’ or ‘guided’) in 4
Maribor schools for 2 months.
• Your data suggest that the ‘guided’ method yields
superior pedagogic effects for boys.
• Are you able to generalize your findings to:
– All boys in Maribor schools? (Most probably)
– All pupils in Maribor schools? (Certainly not)
– All Slovene boys? (Difficult, maybe)
– All Slovene pupils? (Definitely not)
– All pupils? (no way)
A paradox
• The paradox of sampling: you need to know what
you are in fact trying to find out…
• If your sample is not representative, its data will
be misleading, but how do you know whether it is
representative?
• Another serious problem: self-selection of
participants!
• To avoid sampling problems when asking people
in the street: use random numbers, or accost
every 4th or 5th person who walks by.
Errors
• 2 types:
– constant errors (E-group in Maribor, C-group in
Ljubljana!)
– random errors (the weather, the time of the day/year,
the general mood in the country, …)
• Constant errors must be under control at all cost, e.g.
through randomization. This does not eliminate errors, but
makes them into random errors.
• Random errors cannot be avoided!
• When nevertheless we find an effect in the E-group:
‘robust’ effect!
Hence
• We must estimate how great the probability
is that the effect came about through
random errors.
• I.e.: how probable is it that only the
unavoidable random errors created the
observed effect of the IV?
• When this is not particularly probable, we
decide that the IV had an effect on the DV.
• But when do we judge something ‘not
particularly probable’?
An example
• We wish to know, whether reading a story
with a sad ending is judged more rewarding
than reading a story with a ‘happy end’.
• Imagine we asked 8 people, 7 of whom said
they preferred the sad version, and only 1
the happy version
• How probable is such a result?
• To investigate this, let us start from the fact
that every informant had two possible
choices (prefer ‘sad’, or prefer ‘happy’),
both therefore having a probability of 50 %.
Probability
• VP1:
•
•
•
•
•
•
+
-
VP2:
+
+
4 possible results (22):
VP1 + VP2 + [+ here means: prefer sad version]
VP1 + VP2 VP1 - VP2 +
VP1 - VP2 -
Out of these 4 possibilities
• 2 x + occurs once:
p = 1/4 = 0.25
• 1 x + occurs twice:
p = 2/4 = 0.50
• 0 x + occurs once:
p = 1/4 = 0.25
Tree structure for 4 Ss
S1
+
S2
S3
S4 +
+
+
-
- +
-+ -+ -+
+
-
+
- +
-
+
- + -+ -
Now there are 16 possibilities in all: 24
+ -
16 possibilies are distributed as
follows:
No of +
4
3
2
1
0
Fraction
Probability
1/16
4/16
6/16
4/16
1/16
0.0625
0.2500
0.3750
0.2500
0.0625
For 8 Ss
No. of +
8
7
6
5
4
3
2
1
0
Fraction
1/256
8/256
28/256
56/256
70/256
56/257
28/256
8/256
1/256
Probability
0.004
0.031
0.110
0.220
0.270
0.220
0.110
0.031
0.004
p
• 7 out of 8 informants said, they preferred the
‘sad’ version of the story best.
• The probability of which we now know: p = 0.031
• p = Probability, varying between 0 (NEVER
happens) and 1 (ALWAYS happens).
• Placing the comma 2 places to right = %
• p = 0.031 means: 3,1 % probability
• = the probability, that the results came about by
random errors!
• I.e. error probability.
• Which must be as low as possible!
Because random errors will
NEVER go away
• p means the probability that we falsely conclude
that the IV had an effect on the DV
• How certain are we therefore? 100 - 3,1 = 96,9%
• When we repeat this experiment 100 x, we will on
average find the same results 96,9 times.
• In such a situation it is allowed to say that the
ending of a story has an effect on readers’
preference.
graphically!
• Distribution of the
number of + when only
random errors are at
stake.
• both 0 + and 8 + are
rare (p = 0.004; = 0,4
%)
• also 1 + and 7 + (p =
0.031)
• How low must p be?
• No ultimate answer
• because random errors
remain!
80
60
40
20
Std. Dev = 1.44
Mean = 4.0
N = 257.00
0
0.0
1.0
VAR00001
2.0
3.0
4.0
5.0
6.0
7.0
8.0
H0 vs. Ha
• We are testing a hypothesis.
• Usually a hypotheses of difference (between
groups) = Ha, the alternative hypothesis.
• Alternative to its logical opposite is the
hypothesis of no difference = H0 (‘null
hypothesis).
• We try to REJECT Ha
• If not successful, we reject the null hypothesis =
Ho
• But watch out: in science, we have to be
cautious!
Alpha ()
• choose a ‘Significance
level’ (=
)
• high: high probability
to make a Type 1 error
: to conclude that the
IV had an effect on the
DV, when in reality it
did not.
• small: high chance to
make a Type 2 error:
accept H0 although it is
wrong.
80
60
40
20
Std. Dev = 1.44
Mean = 4.0
N = 257.00
0
0.0
1.0
VAR00001
2.0
3.0
4.0
5.0
6.0
7.0
8.0
Error types and alpha
+ : Type 1 Error: we
falsely accept the Ha
(we think the IV had
an
influence, but it does
not)
- : Type 2 Error: we
falsely accept the H0
(we think the IV had no
influence, but it does)
a=
a=
HIGH LOW
Ha
+
-
-
+
H0
Decision matrix
H0 is true
H0 is false
Fail to reject H0
correct decision
Type 2 error
Reject H0
Type 1 error
correct decision
However,
• Since we have no means of knowing whether the
H0 is really true of false
• All we can do is to reduce the uncertainty of our
decision.
• And thereby reduce the chance of making a Type
I error.
• There are no certainties, only probabilities in
statistics.
Compare to a case in court
• When very weak evidence for a crime is accepted
by a court of law, then a lot of (innocent) people
are going to be convicted.
• If a court accepts only the strongest form of
evidence, then a lot of criminals will get free
without a conviction.
• So … some kind of balance is needed.
• And this balance can best be provided if you know
a bit about statistics.
A memory enhancing drug
• We select 100 students, 50 of which get the drug,
the other 50 a placebo (without they themselves
knowing who got what!)
• We then give them some exam which is heavily
dependent on memory.
• Results are scored by examiners who do not know
which student got which pill.
• This is called a double blind design:
• Neither observer nor observed know who is who.
How big must the difference be?
• This is a somewhat misleading question.
• It is like asking “How tall must you be to become
a good basketball-player?”
• Well, ideally as tall as possible.
• But there is not clear-cut height below which you
cannot dream of it.
• So it is a gliding scale.
• So it is with p-values: the lower the better!
• But statisticians have established a conventional
level: p < .05
• But does this mean that p = .049 is significant,
while p = .051 is not?
• That is to fundamentally misunderstand the
nature of p-values.
• The criterion of .05 is merely a convention.
• The lower it is, the more confident we are that
we may reject the H0.
• If that level is marginally about the .05 criterion,
it does not mean that the Ha has no plausibility.
• It is exactly this gliding scale that makes
significance values so informative.
• BTW: .05 means one in twenty!
• Within this range =
probability: 95 % of all
observations.
• Outside this range
remains 5 % of all
observations.
• This is the level of
random errors we are
ready to accept.
• Here we say that the IV
had an effect on the DV
• Therefore we reject the
H0
• Since we know about
the normal distribution
• We know that 68 % of
all values lie within 1
SD of the mean
• 95 % between 2 SD.
• Mean = 4.00, SD = 1.44
• 2 SD – 2.88.
• I.e. between 1.12 and
6.88.
• Our observations lie
outside this range.
• Hence: significant!
Region of rejection
• p < 0.05 = ‘significant’ / p
< 0.01 = ‘highly
significant’/ p < 0.001 =
‘very highly significant’
• Significance level leads to
separation between: 1)
area where only random
errors had an effect
• 2) where the IV had an
effect on the DV (the
critical region)
• Where we reject the H0
80
60
40
20
Std. Dev = 1.44
Mean = 4.0
N = 257.00
0
0.0
1.0
VAR00001
2.0
3.0
4.0
5.0
6.0
7.0
8.0
NB
• A significant difference does NOT imply a value
judgment
• It merely tells us how likely the results are due to
chance.
• Whether this leads to any change (for instance in
instruction methods, a new medicine, etc.) has to
be decided on other than statistical grounds
• E.g. how much does it cost (in time, money,
learning curve,…), what the consequences are of
not changing anything, etc.
Comparison of means
• In general: between E- and C-groups
• or between 2 E-groups
• to compare both: certain statistical techniques
(=Tests --> Test-statistics = Inferential statistics)
• Matrix with measurement level + ? Normal
distribution + type of sample (independent /
dependent)
• Better still: decision chart (see Scientific Methods
for the Humanities, p. 231-)
Levels of measurement
• Nominal: putting a variable into a category, e.g.
gender, place of living, political preference, etc.
• Ordinal: these are ordered categories, e.g.
education level, EFL proficiency level, preference
of musical composer, price of a car, …. [lacking is
the distance between ranks: is the 2nd composer
only half as good as the first one? And the 3rd one?]
• Interval: scaled order, with equal distances.
• Ratio: likewise, but now with a zero-point. E.g.
age, divide a 100 points among 4 authors,…
flowchart
Three possibilities
1. The means of the samples differ
2. The variance of the samples differs
3. Both the mean and the variance differ
• In each case, we apply statistical tests to
estimate the significance of the differences.
• When p is below the conventional level of 5 %
(error probability), we accept that the sample
differences may be generalized to the
population.
Variables
•
•
•
•
•
Attributes, characteristics, qualities, etc.
E.g. Gender, age, nationality, but above all:
‘treatment’ (what you think exerts an influence).
= independent variable.
‘reactions to the treatment (what you expect the
influence will be).
• = dependent variable.
• The IV causes the DV, the DV is caused by the IV.
Kinds of tests
• T-test: 1 IV, 1 DV
• ANOVA: 1 IV, >1 DV
• MANOVA: > 1 IV, > 1 DV (= GLM)
• But these are parametric tests: they presuppose
that your data are normally distributed, and at
least in interval measurement.
• How to know whether my data follow a normal
distribution?
The Kolmogorov-Smirnov test
• This test takes an ideal distribution
• And projects your distribution on it
• And then gauges whether the two differ
significantly from each other.
• ‘Significance’ here in the statistical sense!
• Meaning: the error probability < .05
• Or, in the table of the results: p < .05.
• In that case, your data are NOT normally
distributed!
What to do in that case?
• The parametric tests assume a number of things
(i.e. interval measurement, normal distribution,
etc.)
• When these assumptions are not fulfilled: use
non-parametric tests!
• 2 independent / dependent samples
• k [ > 2] independent / dependent samples
• Independent samples: no overlap between the
samples.
• Dependent samples: the same people.
One or two-tailed?
• Ha says only THAT there will be a difference
between 2 groups (without direction of the
difference) = two-tailed.
• WITH a direction (e.g.: E > C) = one-tailed.
• In case of one-tailed: divide p-values by 2.
• But note that this is a controversial issue among
statisticians.
• E.g.: if you know the direction of the hypothesis,
then why do you need to test for significance?
F-ratio
(Between-groups means)2
• F = -----------------------(Within-groups mean)2
•
•
•
•
•
When H0 , F = 1
F > 1 means an effect
Very high F-values mean very low p-values!
I.e. very unlikely the result of chance!
Hence accept Ha