Transcript Document
Statistics for Medical Researchers
Hongshik Ahn
Professor
Department of Applied Math and Statistics
Stony Brook University
Biostatistician, Stony Brook GCRC
Contents
1.
2.
3.
4.
5.
Experimental Design
Descriptive Statistics and Distributions
Comparison of Means
Comparison of Proportions
Power Analysis/Sample Size
Calculation
6. Correlation and Regression
2
1. Experimental Design
Experiment
Treatment: something that researchers
administer to experimental units
Factor: controlled independent variable
whose levels are set by the experimenter
Experimental design
Control
Treatment
Placebo effect
Blind
single blind, double blind, triple blind
3
1. Experimental Design
Randomization
Completely randomized design
Randomized block design: if there are
specific differences among groups of subjects
Permuted block randomization: used for s
mall studies to maintain reasonably good balance
among groups
Stratified block randomization: matching
4
1. Experimental Design
Completely randomized design
The computer generated sequence:
4,8,3,2,7,2,6,6,3,4,2,1,6,2,0,…….
Two Groups (criterion: even-odd):
AABABAAABAABAAA……
Three Groups:
(criterion:{1,2,3}~A, {4,5,6}~B, {7,8,9}~C; ignore 0’s)
BCAACABBABAABA……
Two Groups: different randomization ratios(eg.,2:3):
(criterion:{0,1,2,3}~A, {4,5,6,7,8,9}~B)
BBAABABBABAABAA……..
5
1. Experimental Design
Permuted block randomization
With a block size of 4 for two groups(A,B), there are 6
possible permutations and they can be coded as:
1=AABB, 2=ABAB, 3=ABBA, 4=BAAB, 5=BABA, 6=BBAA
Each number in the random number sequence in turn
selects the next block, determining the next four participant
allocations (ignoring numbers 0,7,8 and 9).
e.g., The sequence 67126814…. will produce BBAA AABB
ABAB BBAA AABB BAAB.
In practice, a block size of four is too small since
researchers may crack the code and risk selection bias.
Mixing block sizes of 4 and 6 is better with the size kept un
known to the investigator.
6
1. Experimental Design
Methods of Sampling
Random sampling
Systematic sampling
Convenience sampling
Stratified sampling
7
1. Experimental Design
Random Sampling
Selection so that each individual member has an
equal chance of being selected
Systematic Sampling
Select some starting point and then select every
k th element in the population
8
1. Experimental Design
Convenience Sampling
Use results that are easy to get
9
1. Experimental Design
Stratified Sampling
Draw a sample from each stratum
10
2.
Descriptive Statistics & Distributions
Parameter: population quantity
Statistic: summary of the sample
Inference for parameters: use sample
Central Tendency
Mean (average)
Median (middle value)
Variability
Variance: measure of variation
Standard deviation (sd): square root of variance
Standard error (se): sd of the estimate
Median, quartiles, min., max, range, boxplot
Proportion
11
2.
Descriptive Statistics & Distributions
Normal distribution
12
2.
Descriptive Statistics & Distributions
Standard normal distribution:
Mean 0, variance 1
13
2.
Descriptive Statistics & Distributions
Z-test for means
T-test for means if sd is unknown
14
3.
Inference for Means
Two-sample t-test
Two independent groups: Control and treatment
Continuous variables
Assumption: populations are normally distributed
Checking normality
Histogram
Normal probability curve (Q-Q plot): straight?
Shapiro-Wilk test, Kolmogorov-Smirnov test,
Anderson-Darling test
If the normality assumption is violated
T-test is not appropriate.
Possible transformation
Use non-parametric alternative: Mann-Whitney Utest (Wilcoxon rank-sum test)
15
3.
Inference for Means
A clinical trial on effectiveness of drug A in prev
enting premature birth
30 pregnant women are randomly assigned to
control and treatment groups of size 15 each
Primary endpoint: weight of the babies at birth
Treatment
n
Control
15
15
mean
7.08
6.26
sd
0.90
0.96
16
3.
Inference for Means
Hypothesis: The group means are different
Null hypothesis (Ho): 1 = 2
Alternative hypothesis (H1): 1 2
Significance level: = 0.05
Assumption: Equal variance
Degrees of freedom (df): n1 n2 2
Calculate the T-value (test statistic)
T
( x1 x2 ) ( 1 2 )
s p (1 / n1 ) (1 / n2 )
P-value: Type I error rate (false positive rate)
Reject Ho if p-value <
Do not reject Ho if p-value >
17
3.
Inference for Means
Previous example: Test at 0.05
2
2
(
n
1
)
s
(
n
1
)
s
14
(.
90
)
14
(.
96
)
2
1
1
2
2
sp
0.866
n1 n2 2
15 15 2
2
t
2
( x1 x2 ) ( 1 2 )
s p (1 / n1 ) (1 / n2 )
7.08 6.26
2.413
0.866 (1 / 15) (1 / 15)
P-value: 0.026 < 0.05
Reject the null hypothesis that there is no drug effect.
18
3.
Inference for Means
Confidence interval (CI):
An interval of values used to estimate the true val
ue of a population parameter.
The probability 1- that is the proportion of
times that the CI actually contains the population
parameter, assuming that the estimation process
is repeated a large number of times.
Common choices: 90% CI ( = 10%),
95% CI
( = 5%),
99% CI ( = 1%)
19
3. Inference for Means
CI for a comparison of two means:
( x1 x2 ) E 1 2 ( x1 x2 ) E
where
E t / 2,n1 n2 2 s p (1 / n1 ) (1 / n2 )
A 95% CI for the previous example:
E t.025, 28 s p (1 / 15) (1 / 15) (2.048) .866[(1 / 15) 1 / 15)] .70
(7.08 6.26) .70 (.12,1.52)
3.
Inference for Means
SAS programming for Two-Sample T-test
Data steps :
Click ‘File’
Click ‘Import Data’
Select a data source
Click ‘Browse’ and find the path of the data file
Click ‘Next’
Fill the blank of ‘Member’ with the name of the SAS data set
Click ‘Finish’
Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ Hypothesis Tests’
Click ‘Two-Sample T-test for Means’
Select the independent variable as ‘Group’ and the dependent variable as
‘Dependent’
Choose the interested Hypothesis and Click ‘OK’
21
3.
Inference for Means
Click ‘File’ to import data and
create the SAS data set.
Click ‘Solution’ to create a
project to run statistical test
Click ‘File’ to open the SAS data
set.
Click ‘Statistics’ to select the
statistical procedure.
22
3.
Inference for Means
Mann-Whitney U-Test (Wilcoxon Rank-Sum
Test)
Nonparametric alternative to two-sample t-test
The populations don’t need to be normal
H0: The two samples come from populations
with equal medians
H1: The two samples come from populations
with different medians
23
3.
Inference for Means
Mann-Whitney U-Test Procedure
Temporarily combine the two samples into
one big sample, then replace each sample
value with its rank
Find the sum of the ranks for either one of
the two samples
Calculate the value of the z test statistic
24
3.
Inference for Means
Mann-Whitney U-Test,
Example
Numbers in parentheses
are their ranks beginning
with a rank of 1 assigne
d to the lowest value of
17.7.
R1 and R2: sum of ranks
25
3.
Inference for Means
Hypothesis: The group means are different
Ho: Men and women have same median BMI’s
H1: Men and women have different median BMI’s
n1 (n1 n2 1) 13(13 12 1)
R
169
2
2
R
z
n1n2 (n1 n2 1)
12
R R
R
(13)(12)(13 12 1)
18.385
12
187 169
0.98
18.385
p-value 0.33, thus we do not reject H0 at =0.05.
There is no significant difference in BMI between
men and women.
26
3.
Inference for Means
SAS Programming for Mann-Whitney U-Test
Procedure
Data steps :
The same as slide 21.
Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ ANOVA’
Click ‘Nonparametric One-Way ANOVA’
Select the ‘Dependent’ and ‘Independent’ variables respectively
and choose the interested test
Click ‘OK’
27
3.
Inference for Means
Click ‘File’ to open the SAS
data set.
Click ‘Statistics’ to select the
statistical procedure.
Select the dependent and independent variables:
28
3.
Inference for Means
Paired t-test
Mean difference of matched pairs
Test for changes (e.g., before & after)
The measures in each pair are correlated.
Assumption: population is normally distributed
Take the difference in each pair and perform onesample t-test.
Check normality
If the normality assumption is viloated
T-test is not appropriate.
Use non-parametric alternative: Wilcoxon signed
rank test
29
3.
Inference for Means
Notation for paired t-test
d = individual difference between the two
values of a single matched pair
µd = mean value of the differences d for the
population of paired data
= mean value of the differences d for the
paired sample data
d
d
sd = standard deviation of the differences d
for the paired sample data
n = number of pairs
30
3.
Inference for Means
Example: Systolic Blood Pressure
ID
Without OC’s
With OC’s
Difference
1
115
128
13
2
112
115
3
3
107
106
-1
4
119
128
9
5
115
122
7
6
138
145
7
7
126
132
6
8
105
109
4
9
104
102
-2
10
115
117
2
OC: Oral contraceptive
31
3.
Inference for Means
Hypothesis: The group means are different
Ho: d 0 vs. H1: d 0
Significance level: = 0.05
Degrees of freedom (df): n 1 9
Test statistic
d d
4.8
t
3.32
sd / n 4.57 / 10
P-value: 0.009, thus reject Ho at =0.05
The data support the claim that oral
contraceptives affect the systolic bp.
32
3.
Inference for Means
Confidence interval for matched pairs
100(1-)% CI:
sd
sd
, d t / 2,n 1
d t / 2,n 1
n
n
95% CI for the mean difference of the systolic bp:
d t0.025 ,9
sd
4.57
4.8 2.26
4.8 3.27
10
10
(1.53, 8.07)
33
3.
Inference for Means
SAS Programming for Paired T-test
Data steps :
The same as slide 21.
Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ Hypothesis tests’
Click ‘Two-Sample Paired T-test for means’
Select the ‘Group1’ and ‘Group2’ variables respectively
Click ‘OK’
(Note: You can also calculate the difference, and use it as the
dependent variable to run the one-sample t-test)
34
3.
Inference for Means
Click ‘File’ to open the SAS
data set.
Click ‘Statistics’ to select the
statistical procedure.
Put the two group variables into ‘Group 1’ and ‘Group 2’
35
3.
Inference for Means
Comparison of more than two means:
ANOVA (Analysis of Variance)
One-way ANOVA: One factor, eg., control, drug
1, drug 2
Two-way ANOVA: Two factors, eg., drugs, age g
roups
Repeated measures: If there is a repeated meas
ures within subject such as time points
36
3.
Inference for means
Example: Pulmonary disease
Endpoint: Mid-expiratory flow (FEF) in L/s
6 groups: nonsmokers (NS), passive smokers (PS),
noninhaling smokers (NI), light smokers (LS),
moderate smokers (MS) and heavy smokers (HS)
Group name
Mean FEF
SD FEF
n
NS
3.78
0.79
200
PS
3.30
0.77
200
NI
3.32
0.86
50
LS
3.23
0.78
200
MS
2.73
0.81
200
HS
2.59
0.82
200
37
3.
Inference for means
Example: Pulmonary disease
Ho: group means are the same
H1: not all the groups means are the same
SS
df
Between 184.38
5
36.875
1044
0.636
Within
663.87
Total
848.25
MS F statistic P-value
58.0
<0.001
P-value<0.001
There is a significant difference in the mean FEF
among the groups.
Comparison of specific groups: linear contrast
Multiple comparison: Bonferroni adjustment (/n)
38
3.
Inference for Means
SAS Programming for One-Way ANOVA
Data steps :
The same as slide 21.
Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ ANOVA’
Click ‘One-Way ANOVA’
Select the ‘Independent’ and ‘Dependent’ variables respectively
Click ‘OK’
39
3. Inference for Means
Click ‘File’ to open the SAS
data set.
Click ‘Solutions’ to select the
statistical procedure.
Select the dependent and Independent variables:
40
4.
Inference for Proportions
Chi-square test
Testing difference of two proportions
n: #successes, p: success rate
Requirement: np 5 & n(1 p) 5
H0: p1 = p2
H1: p1 p2 (for two-sided test)
If the requirement is not satisfied, use Fisher’s
exact test.
41
5.
Power/Sample Size Calculation
Decide significance level (eg. 0.05)
Decide desired power (eg. 80%)
One-sided or two-sided test
Comparison of means: two-sample t-test
Need to know sample means in each group
Need to know sample sd’s in each group
Calculation: use software (Nquery, power, etc)
Comparison of proportions: Chi-square test
Need to know sample proportions in each group
Continuity correction
Small sample size: Fisher’s exact test
Calculation: use software
42
6.
Correlation and Regression
Correlation
Pearson correlation for continuous variables
Spearman correlation for ranked variables
Chi-square test for categorical variables
Pearson correlation
Correlation coefficient (r): -1<r<1
Test for coefficient: t-test
Larger sample more significant for the same
value of the correlation coefficient
Thus it is not meaningful to judge by the
magnitude of the correlation coefficient.
Judge the significance of the correlation by pvalue
43
6.
Correlation and Regression
Regression
Objective
Find out whether a significant linear relationship exists
between the response and independent variables
Use it to predict a future value
Notation
X: independent (predictor) variable
Y: dependent (response) variable
Multiple linear regression model
y 0 1x1 ... κxk
Where
is the random error
Checking the model (assumption)
Normality: q-q plot, histogram, Shapiro-Wilk test
Equal variance: predicted y vs. error is a band shape
Linear relationship: predicted y vs. each x
44
6.
Correlation and Regression
Weight (x1) in LB
Age (x2)
Blood pressure (y)
152
50
120
183
20
141
171
20
124
165
30
126
158
30
117
161
50
129
149
60
123
158
50
125
170
40
132
153
55
123
164
40
132
190
40
155
185
20
147
45
6.
Correlation and Regression
The regression equation is
y 65.1 1.08x1 0.425x2
The mean blood pressure increases by 1.08 if weight (x1)
increases by one pound and age (x2) remains fixed.
Similarly, a 1-year increase in age with the weight held
fixed will increase the mean blood pressure by 0.425.
Predictor
Coefficient
se
T-ratio
P-value
Constant
-65.10
14.94
-4.36
0.001
x1
1.077
0.077
13.98
0.000
x2
0.425
0.073
5.82
0.000
s=2.509
R2=95.8%
Error sd is estimated as 2.509 with df=13-3=10
95.8% of the variation in y can be explained by the
regression.
46
6. Correlation and Regression
SAS Programming for Linear Regression
Data steps :
The same as slide 21.
Procedure steps :
Click ‘Solutions’
Click ‘Analysis’
Click ‘Analyst’
Click ‘File’
Click ‘Open By SAS Name’
Select the SAS data set and Click ‘OK’
Click ‘Statistics’
Click ‘ Regression’
Click ‘Linear’
Select the ‘Dependent’ (Response) variable and the ‘Explanatory’
(Predictor) variable respectively
Click ‘OK’
47
6.
Correlation and Regression
Click ‘File’ to open the SAS
data set.
Click ‘Solutions’ to select the
statistical procedure.
Select the dependent and explanatory variables:
48
6.
Correlation and Regression
Other regression models
Polynomial regression
Transformation
Logistic regression
49