Statistical Analysis & Design in Research
Download
Report
Transcript Statistical Analysis & Design in Research
Categorical Data Analysis
PGRM 14
Statistics
in
Science
What is categorical data?
The measurement scale for the response
consists of a number of categories
Variable
Measurement Scale
Farm system
Dairy, Beef, Tillage etc.
Mortality
Dead, alive
Very soft, Soft, Hard,
Very hard
0, 1, 2, 3 and >3
Food texture
Litter size
Statistics
in
Science
Data Analysis considered:
• Response variable(s)
is categorical
• Explanatory variable(s) may be categorical
or continuous
Example: Does Post-operative survival (categorical response)
depend on the explanatory variables?
Sex (categorical)
Age (continuous)
Example: In a random sample of Irish farmers is there a
relationship between attitudes to the EU and farm system.
Farm system (categorical)
Attitude to EU (categorical/ordinal)?
(Two response variables - no explanatory variables)
Statistics
in
Science
Could one of these be regarded as explanatory?
Measurement scales for categorical data
Nominal - no underlying order
Variable
Measurement Scale
Farm system
Weed Species
Dairy, Beef, Tillage etc.
Stellaria media, Poa annua, etc.
Ordinal - underlying order in the scale
Variable
Food texture
Disease diagnosis
Education
Measurement Scale
Very soft, Soft, Hard, Very hard
Very likely, Likely, Unlikely
Primary, Secondary, Tertiary
Interval - underlying numerical distance
between scale points
Statistics
in
Science
Variable
Measurement Scale
Litter size
0, 1, 2, 3 and >3
Age class
<1, 1-2, 2-3.5, 3.5-5, >5
Education
years in education
Tables
reporting categoricaldata
1-, 2- & 3-way
Statistics
in
Science
Tables reporting count data: single level
Example:
A geneticist carries out a crossing experiment between F1
hybrids of a wild type and a mutant genotype and obtains
an F2 progeny of 90 offspring with the following
characteristics.
Wild Type
Mutant
Total
80
10
90
Evidence that a wild type is dominant,
giving on average 8:1 offspring
phenotype in its favour?
Statistics
in
Science
Tables for count data: two-way
Example:
A sample 124 mice was divided into two groups, 84
receiving a standard dose of pathogenic bacteria followed
by an antiserum and a control group of 40 not receiving
the antiserum. After 3 weeks the numbers dead and
alive in each group were counted.
antiserum
control
Total
Statistics
in
Science
Outcome
Dead Alive
19
65
18
22
37
87
Association between
mortality and treatment?
Total
84
40
124
% dead
23
45
Tables for count data: two-way
Example (Snedecor & Cochran):
The table below shows the number of aphids alive and
dead after spraying with four concentrations of
solutions of sodium oleate.
Concentration of sodium oleate (%)
Dead
Alive
Total
% Dead
0.65
1.10
1.6
2.1
Total
55
22
77
71.4
62
13
75
82.7
100
12
112
89.3
72
5
77
93.5
289
52
341
84.8
• Has the higher concentration
given a significantly different
percentage kill?
Statistics
in
Science
• Is there a relationship between
concentration and mortality?
Is this the relationship?
Note:
categorical
response
interval
categorical
explanatory
variable
?
Statistics
in
Science
Tables for count data: two-way
Example (Cornfield 1962)
BP
CHD No CHD Total % CHD
Blood pressure (BP) was
measured on a sample of males
<117
3
153
156
1.9
aged 40-59, who were also
117 - 126 17
235
252
6.7
classified by whether they
developed coronary heart
127 - 136 12
272
284
4.2
disease (CHD) in a 6-year
137 - 146 16
255
271
5.9
follow-up period.
BP:
interval categorical variable
in 8 classes
CHD:
CHD or No-CHD
147 - 156
12
127
139
8.6
157 - 166
8
77
85
9.4
167 - 186
16
83
99
16.2
>186
8
35
43
18.6
Total
92
1237
1329
1.Is the incidence of CHD independent of BP?
Statistics
in
Science
2.Is there a simple relationship between the
probability of CHD and the level of BP?
CHD v BP relationship
Statistics
in
Science
3-way table
Example: Grouped binomial (response has 2 categories) data
- patterns of psychotropic drug consumption in a sample
from West London (Murray et al 1981, Psy Med 11,551-60)
Statistics
in
Science
Sex
Age Group
Psych. case
On drugs
Total
M
M
M
M
M
M
M
M
M
M
F
F
F
F
F
F
F
F
F
F
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
9
16
38
26
9
12
16
31
16
10
12
42
96
52
30
33
47
71
45
21
531
500
644
275
90
171
125
121
56
26
588
596
765
327
179
210
189
242
98
60
Non-tabulated data
Example: Individual Legousia plants were monitored in
an experiment to see whether they survived after 3
months.
Survived -yes is scored 1
Survived -no scored 0.
Also recorded were:
CO2 treatment – 2 levels low and high
Density of Legousia
Density of companion species
Height of the plant (mm)two weeks after planting.
Statistics
in
Science
Most individuals will have a unique profile
in these three additional variables and so
tabulation of the data by them is not
feasible. The individual data is presented
Non-tabulated data
Density
Subject Surv CO2 Ht
Leg.
Comp
1
0
L
35
20
30
2
1
L
68
22
27
3
1
H
43
16
33
4
0
L
27
4
16
…
…
…
…
…
…
…
…
…
…
…
…
Response
Statistics
in
Science
1. Is survival related to the
explanatory variables:
CO2, Height, density-self,
density-companions?
2. Can the probability of
survival be predicted from
the subject’s profile?
Fixed and non-fixed margins
• One margin fixed: Samples of fixed size are
selected for one or more categories and individuals
are classified by the other category(s).
• No margin fixed: Individuals in a single sample are
simultaneously classified by several categorical
variables.
Difference between these depends on the experimental
design and how this specified the data should be
collected.
Method of analysis is the same.
Statistics
in
Science
Asking the right question
• Data summarized by counts
• Questions usually relate to %s
(equivalently proportions)
Statistics
in
Science
Hypotheses for Categorical Data
• Categorical data is summarised by counting individuals
falling into the various combinations of categories
• Hypotheses relate to:
the probability of an individual being in a particular
category
• These probabilities are estimated by the observed
proportions in the data
• Using a sample proportion, p, from a sample of size n, to
estimate a population proportion the standard error is
√(p(1 – p)/n)
eg with p = 0.5, n = 1100,
2×SE = 0.03
the often mentioned 3% margin of error
Statistics
in
Science
Example
Outcome
antiserum
control
Total
Dead
19
18
37
Alive
65
22
87
Total % dead
84
23
40
45
124
Does % dead depend
on antiserum?
Equivalently:
1. Is there an association between mortality
and antiserum?
Statistics
in
Science
2. Is mortality independent of anitserum?
Example
Outcome
antiserum
control
Total
Dead
19
18
37
Alive
65
22
87
Total % dead
84
23
40
45
124
• As usual we set up a null hypothesis and
measure the extent to which the data conflicts
with this
• Here H0:
prob of death for anti = prob of death for control
• equivalently H0:
Statistics
in
Science
– no association between mortality and antiserum
– Mortality and antiserum are independent
Example
Outcome
antiserum
control
Total
Dead Alive Total % dead
19
65
84
23
18
22
40
45
37
87
124
Expected counts when H0 is true:
The overall % dead (37/124)
would apply to antiserum & control
For the 84 antiserum this would give
(84×37)/124 dead and (84×87)/124 alive
For the 40 control this would give
(40×37)/124 dead and (40×87)/124 alive
Statistics
in
Science
E = (row total)(column total)/(table total)
Observed and expected counts
Outcome
Dead Alive
antiserum 19
65
control
18
22
Total
37
87
Total % dead
84
23
40
45
124
Outcome
Dead Alive
antiserum 25.1
58.9
control
11.9
28.1
Total
37
87
Total % dead
84
29.9
Expected
40
29.8
124
Note: some rounding error
Statistics
in
Science
Observed
Chi-squared statistic : X2
• X2 measures difference between observed counts, O,
and expected (when H0 holds) counts, E
• If LARGE provides evidence against H0, ie evidence
for an association (dependence) of mortality on
anitserum.
• X2 = ∑(O – E)2/E
• Here SAS/FREQ gives:
X2 = 6.48
p = Prob(X2 > 6.48 when H0 is true) = 0.0109
• Conclusion:
there is evidence (p < 0.05) that mortality depends
on antiserum
Statistics
in
Science
Practical Exercise
Use Excel to calculate X2 and p
Lab Session 5 exercise 5.1 (a)
Statistics
in
Science
SAS/FREQ OUTPUT
Description of cell
contents
X2 = ∑(O – E)2/E
O = Frequency
E = Expected
Row Percents make
most sense here
(% alive/dead in
each antiserum
group)
Statistics
in
Science
Table of antiserum by dead
antiserum
dead
Frequency
Expected
Row Pct
0
1 Total
antiserum
65
19
58.935 25.065
77.38 22.62
84
control
22
18
28.065 11.935
55.00 45.00
40
Total
87
37
124
SAS/FREQ OUTPUT
DF = (r–1)×(c-1)
X2 = ∑(O – E)2/E
Statistic
Ignore!
Statistics
in
Science
DF
Value
Prob
Chi-Square
1
6.4833
0.0109
Likelihood Ratio Chi-Square
1
6.2846
0.0122
Continuity Adj. Chi-Square
1
5.4583
0.0195
Mantel-Haenszel Chi-Square
1
6.4310
0.0112
Phi Coefficient
0.2287
Contingency Coefficient
0.2229
Cramer's V
0.2287
P = 0.001 with X2 = 6.48
Area
0.05
Area
0.001
68% values < 1
(not shown)
Statistics
in
Science
6.48
Aphid example (SAS/FREQ OUTPUT)
status(Outcome)
Frequency
Expected
Cell Chi-Square
Col Pct
Alive
Dead
Total
Table of status by conc
conc(Sodium oleate concentration (%))
0.65
22
11.742
8.9617
28.57
1.1
13
11.437
0.2136
17.33
1.6
12
17.079
1.5105
10.71
2.1
5
11.742
3.8711
6.49
Total
52
55
65.258
1.6125
71.43
77
62
63.563
0.0384
82.27
75
100
94.921
0.2718
89.29
112
72
65.258
0.6965
93.51
77
289
341
X2 = 17.18
Note the largest contributions (O – E)2/E
p = 0.0007 (3 df) to X2 (8.96 & 3.87) are in top corners
Statistics
in
Science
Locating the concentration effect
Table of Outcome by Sodium
Table of Outcome by Sodium
Outcome Sodium oleate(%)
Outcome Sodium oleate(%)
Total
Total Frequency
Frequency
1.6
2.1
0.65
1.1
Expected
Expected
Alive
Dead
Total
X2 = 2.71
p = 0.10
Statistics
in
Science
22
13
28.57 17.33
35
Alive
55
62
117
Dead
71.43 82.67
77
75 152 Total
X2 = 0.99
p = 0.32
12
10.71
5
6.49
100
72
89.29 93.51
112
77
17
172
189
Locating the concentration effect
Table of Outcome by Sodium
Sodium
Outcome
oleate(%)
Frequency
Col Pct
<1.5% >1.5% Total
Alive
52
35
17
23.03
8.99
Dead
117
172 289
76.97 91.01
Total
152
189 341
X2 = 12.83
p = 0.0003
Statistics
in
Science
SAS – data format for FREQ procedure
Concentration of sodium oleate (%)
Dead
Alive
Total
% Dead
0.65
55
22
77
71.4
1.10
62
13
75
82.7
2 cols identify the cell
Final column is the
‘response’
– the frequency count
for the cell
Statistics
in
Science
1.6
100
12
112
89.3
2.1
72
5
77
93.5
Total
289
52
341
84.8
Conc
status
number
0.65
d
55
0.65
a
22
1.10
d
62
1.10
a
13
1.60
d
100
1.60
a
12
2.10
d
72
2.10
a
5
Validity of chi-squared (2) test
• Test is based on an approximation leading to use of
the 2 distribution to calculate p-values
• With several DF and E 5 approximation is ok
• If E < 1 in any cell approximation may be bad
• With a number of cells in the table perhaps a third or
quarter can have E between 1 & 5 without serious
departures from 2 based p-values. (PGRM pg 14-11)
• In cases where good approximation is in doubt use
Fisher’s exact test (SAS/FREQ tables option exact)
Statistics
in
Science
Code: SAS/FREQ
proc freq data = conc;
weight number;
tables status*conc
/ chisq cellchi2 expected
norow nopercent nocum;
quit;
Statistics
in
Science
Option
To Do
chisq
Test statistics (chi-squared etc)
cellchi2
Contribution to X2 from each cell
expected
Expected values for each cell
norow
nopercent
Omit row/overall percentages
nocum
Omit cumulative frequencies
Practical Exercise
SAS/FREQ procedure
Lab Session 5 exercise 5.1 (b) – (d)
Statistics
in
Science
Logistic Regression
Statistics
in
Science
Is this the relationship?
Note:
categorical
response
interval
categorical
explanatory
variable
?
Statistics
in
Science
Why logistic and not just 2?
• For sparse data
(eg where individuals will have unique profiles)
• With many categorical explanatory variables
• With quantitative explanatory variables
In the case of a continuous response we have
looked to see if the mean, , can be expressed as
= a + bx
With categorical data we want an expression for p
(the probability of the response in one of the 2
response categories) but
p = a + bx
may give values outside the range 0 to 1!
eg p = 0.1 + 0.2x gives p = 1.1 for x = 5
Statistics
in
Science
A solution: TRANSFORM
• Use the transformation:
p = exp(a + bx)/(1 + exp(a + bx))
• i.e. log(p/(1 – p)) = a + bx
log(Odds) = a + bx
where Odds = p/(1 – p)
Note:
exp(x) = ex
Plot is for:
a = 0, b = 1
LOGIT:
logit(p) = log(p/(1-p))
Statistics
in
Science
SAS/GPLOT
logit(p) = −0.119 + 1.25 conc
Logistic Estimate of Death Probability
p
1.0
0.9
0.8
0.7
0.6
0.6
Statistics
in
Science
1.0
1.4
Sodium oleate (%)
1.8
2.2
LD50 – lethal dose for 50%
p = 0.5
p /(1 – p) = 1
logit(p) = 0 (since log(1) = 0, WNF!)
0 = −0.119 + 1.25 conc
conc = 0.119/1.25 = 0.095
Odd Ratio (OR)
log(a) – log(b)
= log(a/b)
Increasing conc by 1% increases
logit(p) by 1.25
log(Odds2) – log(Odds1) = 1.25
log(OR) = 1.25
Statistics
in
Science
OR = exp(1.25) = 3.49
SAS/GENMOD
conc dead total
0.65
1.10
1.60
2.10
53
57
95
73
77
75
112
77
proc genmod data = log;
model dead/total = conc /
pred
link = logit
dist = binomial;
output
out = p
predicted = p;
run;
Term
Function
dead/total
the proportion to be estimated
conc
the explanatory variable
pred
include predicted p’s in OUTPUT
link = logit
for modelling log(p/(1-p)) the log(ODDS)
dist = binomial the data consists of counts out of a total
out = p
predicted = p
Statistics
in
Science
output will also go to a data set work.p
in work.p a column named p will contain
predicted values
Practical Exercise
SAS/GENMOD of Logistic Regression
Lab Session 5 exercise 5.2 (a) – (g)
Statistics
in
Science
Modelling needs biological insight!
Statistics
in
Science
Stability analysis (Ex 2 pg 14-15)
Heights, diameter and whether they fell over were
recorded for 545 plants.
Aim: model the probability of stability (not falling
over) as a function of height an diameter.
diameter height stable n
.0016
0.057
1
1
Statistics
in
Science
.0018
0.084
0
1
.0018
0.221
0
1
.0018
0.038
1
1
.0019
0.058
1
1
.0019
0.067
1
1
…
…
…
…
Explanatory terms
Model 1:
h d h2 d2 hd
hopefully high order terms
will not be needed!
Model 2:
h/d2
biologist suggests this!
Model 1: h, d, h2, d2, hd
Analysis Of Parameter Estimates
Standard
Parameter DF Estimate
Error
Intercept
1 -5.3801
0.9402
Wald 95%
Confidence
Limits
Chi-Square Pr > ChiSq
-7.2228 -3.5374
32.75
<.0001
height
1 -39.1639
4.1510 -47.2998 -31.0280
89.01
<.0001
diameter
1 4958.358 654.0395 3676.464 6240.252
57.47
<.0001
h2
1
10.0396
5.0747
0.0934
19.9859
3.91
0.0479
d2
1
-560913 120280.4
-796659
-325168
21.75
<.0001
hd
1 4206.787 1502.453 1262.033 7151.540
7.84
0.0051
Scale
0
1.0000
0.0000
1.0000
How can I
describe this!
Statistics
in
Science
1.0000
Model 2: h/d2
Parameter
Intercept
Analysis Of Parameter Estimates
Wald 95%
Standard
Confidence
DF Estimate
Error
Limits
Chi-Square
1
3.3235
0.3212 2.6940 3.9529
107.09
h_d2
1
-1.7884
Scale
0
1.0000
0.1583 -2.0987 -1.4780
0.0000
1.0000
1.0000
Can understand &
even plot this!
Statistics
in
Science
127.56
Pr > ChiSq
<.0001
<.0001
SAS/GRAPH
But!
Statistics
in
Science
Linear v Quadratic in x = h/d2
?
Statistics
in
Science
Finally!
Modelling counts
Statistics
in
Science
Poisson Regression
For count data
- where eg we count all – not a subset out of a
total
To estimate the mean, μ, and its relationship with an
explanatory variable x use a log link (usually):
log(μ) = a + bx
ie
μ = exp(a + bx) (which will be >0)
= ea ebx
SAS/GENMOD
Statistics
in
Science
model count = x /
link = log
distribution = poisson;
Example: Horseshoe crabs & satellites
Each female crab had an attached male (in her nest) &
other males (satellites) residing nearby.
• Data recorded
– No satellites (response)
– Color (light medium, medium, dark medium, dark)
– Spine condition
(both good, one worn/broken, both worn/broken)
– Carapace width (cm)
– Weight (kg)
• Poisson Models:
– Log link: log(μ) = a + bx
– Identity link: μ = a + bx
Statistics
in
Science
Effect of width and colour
Statistics
in
Science
Grouping weight & number values
Statistics
in
Science
Variation in no. satellites
Statistics
in
Science
Practical exercise
SAS/GENMOD for Poisson Regression
Lab Session 5 Exercise 5.3 (a) – (e)
Statistics
in
Science