Expected frequency (E i )

Download Report

Transcript Expected frequency (E i )

Using Statistics To Make
Inferences 8
Summary
Contingency tables.
Goodness of fit test.
8.11
Friday, 01 April 2016
11:32 PM
Goals
To assess contingency tables for
independence.
To perform and interpret a
goodness of fit test.
Practical
Construct and analyse contingency
tables.
8.22
Recall
To compare a population and sample
2
variance we employed? χCc
cc
8.33
Today
The probability approach from last
week is employed to tell if “observed”
data confirms to the pattern
“expected” under a given model.
8.44
Categorical Data - Example
Assessed intelligence of athletic and
non-athletic schoolboys.
bright
stupid
Total
athletic
581
567
1148
lazy
209
351
560
Total
790
918
1708
K. Pearson “On The Relationship Of Intelligence To Size And Shape
Of Head, And To Other Physical And Mental Characters”, Biometrika,
8.55
1906, 5, 105-146, data on page 144.
Procedure
1. Formulate a null hypothesis. Typically the
null hypothesis is that there is no association
between the factors.
2. Calculate expected frequencies for the cells
in the table on the assumption that the null
hypothesis is true.
3. Calculate the chi-squared statistic. This is
for an r x c table with entries in row i and
column j.
2






observed
i
,
j

expected
i
,
j
2   
i 1 j 1
expected i, j 
r
c
8.66
Procedure
4. Compare the calculated statistic with
tabulated values of the chi-squared
distribution with ν degrees of freedom.
ν
= (rows - 1)(columns - 1) = (r – 1)(c – 1)
8.77
Key Assumptions
1. Independence of the observations. The data
found in each cell of the contingency table used
in the chi-squared test must be independent
observations and non-correlated.
2. Large enough expected cell counts. As
described by Yates et al., "No more than 20%
of the expected counts are less than 5 and all
individual expected counts are 1 or greater"
(Yates, Moore & McCabe, 1999, The Practice of
Statistics, New York: W.H. Freeman p. 734).
8.88
Key Assumptions
3. Randomness of data. The data in the table
should be randomly selected.
4. Sufficient Sample Size. It is also generally
assumed that the sample size for the entire
contingency table is sufficiently large to
prevent falsely accepting the null hypothesis
when the null hypothesis is true.
8.99
Example
Assessed intelligence of athletic
and non athletic schoolboys.
Observed
bright
stupid
Total
athletic
581
567
1148
lazy
209
351
560
Total
790
918
1708
8.10
10
Probabilities
C
C
1148
CThe probability a random boy is athletic is
 0.6721
1708
C
C
The probability a random boy is bright is 790  0.4625
C
1708
C
CAssuming independence, the
1148
790
Cprobability a random boy is both

 0.3109
1708 1708
Cathletic and bright is
CFor 1708 respondents the
bright
stupid
Total
Cexpected number of athletic
1148  790
athletic
581
567
1148
 530.98
Cbright boys is
1708
209
351
560
C lazy
C Total
790
918
1708
8.11
11
Expected
The expected number of
athletic bright boys is
1148  790
 530.98
1708
bright
stupid
athletic 530.98
1148
lazy
Total
Total
560
790
918
1708
8.12
12
Expected
The expected number of
athletic stupid boys is
bright
athletic 530.98
stupid
Total
?
1148
lazy
Total
560
790
918
1708
8.13
13
Expected
The expected number of
athletic stupid boys is
1148 – 530.98 = 617.02
bright
stupid
Total
athletic 530.98
617.02
1148
lazy
Total
560
790
918
1708
8.14
14
Expected
The expected number of lazy
bright boys is
bright
stupid
Total
athletic 530.98
617.02
1148
lazy
?
Total
790
560
918
1708
8.15
15
Expected
The expected number of
stupid lazy boys is
bright
stupid
Total
athletic 530.98
617.02
1148
lazy
259.02
?
560
Total
790
918
1708
8.16
16
Expected
The expected number of
stupid lazy boys is
918 – 617.02 = 300.98
bright
stupid
Total
athletic 530.98
617.02
1148
lazy
Total
259.02 300.98
790
918
560
1708
8.17
17
Expected
bright
stupid
Total
athletic 530.98
617.02
1148
lazy
Total
259.02 300.98
790
918
560
1708
8.18
18
2
χ
Observed - Expected 
2
Expected
Observed
Expected
bright
stupid
Total
bright
stupid
Total
athletic
581
567
1148
athletic 530.98
617.02
1148
lazy
209
351
560
lazy
Total
790
918
1708
Total

581  530.98
2
2
calc

259.02 300.98
790
918
560
1708
567  617.02 
2

530.98
617.02
2
2
209  259.02  351  300.98


 26.73
259.02
300.98
  r  1c  1  1 Only one cell is free.
8.19
19
2
χ
As a general rule to employ this statistic,
all expected frequencies should exceed 5.
If this is not the case categories are pooled
(merged) to achieve this goal. See the Prussian
data later.
8.20
20

Conclusion
2
calc
 26.73
 1
ν
p=0.1 p=0.05 p=0.025 p=0.01 p=0.005 p=0.002
1
2.706
3.841
5.024
6.635
7.879
9.550
12 .05  3.84
The result is significant (26.73 > 3.84) at the 5% level.
So we reject the hypothesis of independence between
athletic prowess and intelligence.
We have observed more athletic and bright boys than
expected by independence.
8.21
21
SPSS
Raw data
Note v1 are the row labels
v2 are the column labels
v3 is the frequency
for each cell
8.22
22
SPSS
Data > Weight Cases
Since frequency data has been input,
necessary to weight.
This is essential, do not use percentages.8.23
23
SPSS
Analyze > Descriptive Statistics > Crosstabs
Set row and
column
variables.
Frequencies
already set.
8.24
24
SPSS
Select
chi-square
8.25
25
SPSS
Select
Observed – input data
Expected – output data,
under the model
8.26
26
SPSS
Expected cell frequencies
V1 * V2 Crosstabulation
V2
V1
athletic
lazy
Total
Count
Expected Count
Count
Expected Count
Count
Expected Count
bright
581
531.0
209
259.0
790
790.0
s tupid
567
617.0
351
301.0
918
918.0
Total
1148
1148.0
560
560.0
1708
1708.0
Expected under the model.
8.27
27
SPSS
Pearson Chi Square is the required statistic
Chi-Square Tests
Pears on Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
N of Valid Cas es
Value
26.736b
26.204
26.973
df
1
1
1
Asymp. Sig.
(2-s ided)
.000
.000
.000
Exact Sig.
(2-s ided)
Exact Sig.
(1-s ided)
.000
.000
1708
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count les s than 5. The minimum expected count is 259.
02.
ff Do not report p = .000, rather p < .001
Note Fisher’s exact test, only available
in SPSS for 2x2 tables (see next slide).
8.28
28
What If We Have Small Cell Counts?
Fisher's exact test
The Fisher's exact test is used when you want to
conduct a chi-square test but one or more of your
cells has an expected frequency of five or less.
Remember that the chi-square test assumes that
each cell has an expected frequency of five or
more, but the Fisher's exact test has no such
assumption and can be used regardless of how small
the expected frequency is. In SPSS, unless you
have the SPSS Exact Test Module, you can only
perform a Fisher's exact test on a 2x2 table, and
these results are presented by default.
8.29
29
Aside
Two dials were compared. A subject was asked
to read each dial many times, and the
experimenter recorded his errors. Altogether 7
subjects were tested. The data shows how many
errors each subject produced. Do the two
conditions differ at the 0.05 significance level
(give the appropriate p value)?
1
36
29
2
31
35
Observed data
3
4
5
31
29
32
34
35
34
6
25
35
7
26
30
What key word describes this data?
8.30
30
Aside
C
C
C
C
C
C
C
C
C
c
What tests are available for paired
data?
One sample t test
Sign test
Wilcoxon Signed Ranks Test
8.31
31
Aside
What tests are available for paired
data? What assumptions are made?
One sample t test
Sign test
normality
No assumption of normality
Wilcoxon Signed Ranks Test
Resembles the SignTest in scope, but it is much
more sensitive. In fact, for large numbers it is
almost as sensitive as the Student t-test
8.32
32
Aside
What tests are available for paired
data?
One sample t test
Wilcoxon Signed Ranks Test
Sign test
Sign test answers the question How Often?,
whereas other tests answer the question How Much?
One sample t test – mean
Wilcoxon Signed Ranks Test - median
8.33
33
Example
The table is based on case-records
of women employees in Royal
Ordnance factories during 1943-6.
The same test being carried out on
the left eye (columns) and right eye
(rows).
Stuart “The estimation and comparison of
strengths of association in contingency tables”,
Biometrika, 1953, 40, 105-110.
8.34
34
Observed
Highest Second
Third
Lowest
Total
Highest
1520
266
124
66
1976
Second
234
1512
432
78
2256
Third
117
362
1772
205
2456
Lowest
36
82
179
492
789
Total
1907
2222
2507
841
7477
Is there any obvious structure?
8.35
35
Expected
In general to find the expected frequency in a
particular cell the equation is
Row total x Column total / Grand total
8.36
36
Expected
In general to find the expected frequency in
Highest Second Third Lowest
a particular cell the equation is
Highest
1520
266
124
66
Row total x Column total / Grand total
Second
234
1512
432
78
So for highest right and bottom left the
equation
Third becomes
117
362
1772
205
Lowest1976 36
82 = 503.98
179
x 1907 / 7477
Total
1907
2222
2507
Total
1976
2256
2456
492
789
841
7477
8.37
37
Expected
Highest
Lowest
Total
?
1976
Second
?
2256
Third
?
2456
Highest
Second
Third
503.98
Lowest
?
?
?
?
789
Total
1907
2222
2507
841
7477
Row total x Column total / Grand total
1976 x 1907 / 7477 = 503.98
8.38
38
Expected
Highest
Second
Highest
503.98
Second
Third
Lowest
Total
587.22 662.54
?
1976
575.39
670.43 756.43
?
2256
Third
626.40
729.87 823.48
?
2456
Lowest
?
?
?
?
789
Total
1907
2222
2507
841
7477
Row total x Column total / Grand total
8.39
39
Expected
Highest
Second
Highest
503.98
Second
Third
Lowest
Total
587.22 662.54
?
1976
575.39
670.43 756.43
?
2256
Third
626.40
729.87 823.48
?
2456
Lowest
?
?
?
?
789
Total
1907
2222
2507
841
7477
The missing values are simply found by subtraction
8.40
40
Expected
Highest
Second
Third
Lowest
Total
Highest
503.98
587.22 662.54
?
1976
Second
575.39
670.43 756.43
2256
Third
626.40
729.87 823.48
2456
Lowest
Total
789
1907
2222
2507
841
7477
1976 – 503.98 – 587.22 – 662.54 = 222.26
8.41
41
Expected
Highest
Second
Third
Lowest
Total
Highest
503.98
587.22 662.54 222.26
1976
Second
575.39
670.43 756.43
2256
Third
626.40
729.87 823.48
2456
Lowest
Total
789
1907
2222
2507
841
7477
1976 – 503.98 – 587.22 – 662.54 = 222.26
8.42
42
Expected
Highest
Second
Third
Lowest
Total
Highest
503.98
587.22 662.54 222.26
1976
Second
575.39
670.43 756.43
?
2256
Third
626.40
729.87 823.48
?
2456
Lowest
?
?
?
?
789
Total
1907
2222
2507
841
7477
Similarly for the remaining cells
8.43
43
Expected
Highest
Second
Lowest
Total
Highest
503.98
587.22 662.54 222.26
1976
Second
575.39
670.43 756.43 253.75
2256
Third
626.40
729.87 823.48 276.25
2456
Lowest
201.23
234.47 264.55
Total
1907
2222
Third
2507
88.75
789
841
7477
8.44
44
Short Cut
Contributions to the χ2 statistic,
observed  expected 
2
expected
for the top left cell the contribution is
1520  503.98
2
503.98
 2048.32
8.45
45
Conclusion
2
 calc
 2048.32   r  1c  1  9
Nine cells are free.
ν
p=0.1
p=0.05
p=0.025
p=0.01
p=0.005
p=0.002
9
14.684
16.919
19.023
21.666
23.589
26.056
 92 .05  16.92
The above statistic makes it very clear that
there is some relationship between the quality of
the right and left eyes.
For the top left cell only.
8.46
46
Total
Highest
Second
Third
Lowest
437.75
109.86
202.55 1056.38 139.14
121.73
Highest 2048.32 175.72
Second
Third
414.25
Lowest
135.67
Total
2
χ
185.41 1092.53
99.15
27.66
Total
18.38
1832.37
8097
8.47
47
Conclusion
2
 calc
 8096.87   r  1c  1  9
ν
p=0.1
p=0.05
p=0.025
p=0.01
9
14.684
16.919
19.023
21.666
Nine cells are free.
p=0.005 p=0.002
23.589
26.056
 92 .05  16.92
The above statistic makes it very clear that
there is some relationship between the quality of
the right and left eyes.
For all cells.
8.48
48
SPSS
Raw data
8.49
49
SPSS
Expected cell frequencies
V1 * V2 Crosstabulation
V2
V1
Highest
Lowest
Second
Third
Total
Count
Expected Count
Count
Expected Count
Count
Expected Count
Count
Expected Count
Count
Expected Count
Highest
1520
504.0
66
222.3
266
587.2
124
662.5
1976
1976.0
Lowest
36
201.2
492
88.7
82
234.5
179
264.5
789
789.0
Second
234
575.4
78
253.8
1512
670.4
432
756.4
2256
2256.0
Third
117
626.4
205
276.2
362
729.9
1772
823.5
2456
2456.0
Total
1907
1907.0
841
841.0
2222
2222.0
2507
2507.0
7477
7477.0
8.50
50
SPSS
Pearson Chi Square is the required statistic
Chi-Square Tests
Pears on Chi-Square
Likelihood Ratio
N of Valid Cases
Value
8096.877a
6671.512
7477
df
9
9
Asymp. Sig.
(2-s ided)
.000
.000
a. 0 cells (.0%) have expected count les s than 5. The
minimum expected count is 88.75.
8.51
51
Poisson Distribution
The Poisson distribution is a discrete probability distribution that
expresses the probability of a given number of events occurring in a
fixed interval of time and/or space if these events occur with a known
average rate and independently of the time since the last event. The
Poisson distribution can also be used for the number of events in other
specified intervals such as distance, area or volume.
Typical applications are to queues/arrivals.
The number of phone calls received per day.
The occurrence of accidents/industrial injuries.
More exotically, birth defects and the number of genetic mutations.
The occurrence of rare diseases.
8.52
52
Poisson Distribution
1
discrete events which are independent.
2 events occur at a fixed rate λ per unit continuum.
(λ lambda)
8.53
53
Poisson Distribution
x successes
Prob  x;   
x e  
x!
e is approximately equal to 2.718
λ is the rate per unit continuum
the mean is λ
the variance is λ
8.54
54
Casio 83ES
exp(1) = 2.7182818
exp(2) = 7.389056
Its inverse, on the
same key is ln, so
exp or “e”
ln(2.7182818) = 1
ln(7.389056) = 2
8.55
55
Alternate applications
A similar approach may be employed
to test if simple models are plausible.
8.56
56
2
χ Goodness

2
calc

i
of Fit Test
O  E 
2
i
i
Ei
The degrees of freedom are ν = m – n – 1,
where there are m frequencies left in the
problem, after pooling, and n parameters
have been fitted from the raw data.
For example…
8.57
57
Example
The number of Prussian army corps in
which soldiers died from the kicks of
a horse in a year.
Typical “industrial injury” data
8.58
58
Which distribution is appropriate?
Is the data discrete or continuous?
ccccccccccccccccccccccc
Discrete,
since a simple count
8.59
59
Check list of distributions
Discrete
Continuous
Binomial
Normal
Poisson
Exponential
8.60
60
Check list of distribution
parameters
Discrete
Continuous
Binomial
Normal
np
cccccccccccccccccccccccccc
μ σ2
Poisson
Exponential
cccccccccccccccccccccccccc
λ
λ
Discrete,
ccccccc no “n” implies Poisson
8.61
61
Observed Data
Number deaths in a Observed frequency
corps
(Oi)
0
144
1
91
We need to estimate the Poisson parameter λ.
Which is the2 mean of the distribution.32
3
11
4
2
5 or more
0
Total
280
8.62
62
Observed Data
Number deaths in a Observed frequency
corps
(Oi)
0
144
1
91
2
32
3
11
4
2
5 or more
0
Total
280
8.63
63
Mean
Number deaths in a Observed frequency
corps
(Oi)
0
144
1
91
2
32
3
11
4
2
5 or more
0
Total
280
0  144  1 91  2  32  3  11  4  2

 0.7
144  91  32  11  2
ccccccccccccccccccccc
8.64
64
Expected
λ = 0.7 and “e” is a constant on your calculator
Number deaths
in a corps
0
Poisson model
Expected
probability
0.4966
3
e 
e 
2e / 2!
3e / 3!
4
 e / 4!
0.0050
5 or more
By subtraction
?
1
1
1
2
4
Total

0.3476
0.1217
0.0284
8.65
65
Expected
Number deaths
in a corps
0
Poisson model
Expected
probability
0.4966
3
e 
e 
2e / 2!
3e / 3!
4
 e / 4!
0.0050
5 or more
By subtraction
0.0008
1
1
1
2
4
Total

0.3476
0.1217
0.0284
8.66
66
Expected Frequency
Expected frequency for no deaths 280 x 0.4966 = 139.04
Number deaths
in a corps
0
Expected
probability
0.4966
1
0.3476
2
0.1217
3
0.0284
4
0.0050
5 or more
0.0008
Total
1
Expected frequency
(Ei)
139.04
8.67
67
Expected Frequency
Expected frequency for remaining rows
280 × probability = frequency
Number deaths in
a corps
Expected
probability
Expected frequency (Ei)
0
0.4966
139.04
1
0.3476
97.33
2
0.1217
3
0.0284
Note the 34.07
two expected
frequencies
less than 5!
7.95
4
0.0050
1.39
5 or more
0.0008
0.22
1
280
Total
8.68
68
2
χ Calculation
Number Observed Expected
deaths in frequency frequency
a corps
(Oi)
(Ei)
0
144
139.04
1
91
97.33
2
32
34.07
3 or more
13
9.56
Total
280
280
O  E 
2
i
i
Ei
0.18
0.41
0.13
1.24
1.95
Pool to ensure all expected frequencies exceed 5
8.69
69
Conclusion
Here m (frequencies) = 4,
n (fitted parameters) = 1
then ν = m – n – 1 = 4 – 1 – 1 = 2
ν
p=0.1
p=0.05
p=0.025
p=0.01
p=0.005
p=0.002
2
4.605
5.991
7.378
9.210
10.597
12.429
 22 .05  5.991
2
 calc
 1.95
The hypothesis, that the data comes from a Poisson
distribution would be accepted (5.991 > 1.95).
8.70
70
Next Week
Bring your calculators next week
8.71
71
Read
Read Howitt and Cramer pages 134-152
Read Howitt and Cramer (e-text) pages
125-134
Read Russo (e-text) pages 100-119
Read Davis and Smith pages 434-448
8.72
72
Practical 8
This material is available from the
module web page.
http://www.staff.ncl.ac.uk/mike.cox
Module Web Page
8.73
73
Practical 8
This material for the practical is
available.
Instructions for the practical
Practical 8
Material for the practical
Practical 8
8.74
74
Assignment 2
You will find submission details on the
module web site
Note the dialers lower down the
page give access to your individual
assignment. It is necessary to enter
your student number exactly as it
appears on your smart card.
8.75
75
Assignment 2
As a general rule make sure you can
perform the calculations manually.
It does no harm to check your
calculations using a software package.
Some software employ non-standard
definitions and should be used with
caution.
8.76
76
Assignment 2
All submissions must be typed.
8.77
77
Whoops!
Researchers at Cardiff University School of
Social Science claim errors made by the HawkEye line - calling technology can be greater
than 3.6mm - the average error quoted by the
manufacturers.
Teletext, p388
12 June 2008
8.78
78
Whoops!
Kate Middleton 'marries Prince Harry' on souvenir mug
The Telegraph - Thursday 17 March 2011
8.79
79
Whoops!
Poldark - BBC - 8 March 2015
8.80
80