Transcript Document
Statistics for clinicians
Biostatistics course by Kevin E. Kip, Ph.D., FAHA
Professor and Executive Director, Research Center
University of South Florida, College of Nursing
Professor, College of Public Health
Department of Epidemiology and Biostatistics
Associate Member, Byrd Alzheimer’s Institute
Morsani College of Medicine
Tampa, FL, USA
1
SECTION 1.1
Module Overview
and Introduction.Part 1
Introduction to biostatistics,
descriptive statistics,
SPSS, and Power Point.
Module 1 Learning Objectives:
1. Describe the characteristics of different types of variables
(e.g. nominal, ordinal, continuous, etc.)
2. Calculate proportions, ratios, and percentages
3. Calculate and interpret the measures prevalence, incidence,
and relative risk
4. Calculate and interpret descriptive statistics: mean, median,
mode, range, percentiles, variance, standard deviation, etc.
5. Explain the properties of skewness and kurtosis
6. Identify the data structure and basic features of the SPSS
software program.
7. Generate plots to depict frequency distributions (bar charts,
box plots, line graphs, scatter plots, histograms).
8. Demonstrate the basics of SPSS data manipulation and use
of the syntax editor.
Assigned Reading:
Module 1
Textbook: Essentials of Biostatistics in
Public Health
Section 1.1
Chapter 3
Chapter 4
Biostatistics:
The application of statistical principles in medicine,
public health (e.g. nursing), or biology.
• Make health-related inferences about a population
(i.e. we can’t study everyone in the population).
• Use biostatistical principles grounded in mathematical
and probability theory for “best” estimates of health
summaries and measures of effect or association.
Key Terms and Concepts:
--
--------------------
Variable types (dichotomous, nominal,
ordinal, continuous)
Proportions and percentages
Ratios
Prevalence, incidence, and relative risk
Mean, median, mode, and range
Percentiles and interquartile range
Variance, standard deviation, and standard
error of mean
Coefficient of variation
Skewness and kurtosis
Introduction to SPSS
Introduction to Power Point
Variable Types:
---
---
---
---
Dichotomous (2 categories, may signify order)
Male / female
Low / high
Nominal (2 or more categories, no order)
Male / female
Unmarried / married / divorced / widowed
Ordinal (categorical variable, with categories
ordered in a meaningful sequence)
Strongly agree / agree / undecided /
disagree / strongly disagree
Continuous (can assume one of a large or
infinite number of values)
e.g. financial gain from $0 to $50,000
Variable Types (Identify the correct type(s):
Variable
Scoring
Quality
of life
1 = Poor
2 = Fair
3 = Average
4 = Good
5 = Very Good
Ethnicity
1 = Non-Hispanic
2 = Hispanic
Race
1 = African American
2 = Caucasian
3 = Other
Diabetes
1 = Absent
2 = Present
Systolic
BP
Ranges from 95 to
190 mmHg
Dichot. Nominal Ordinal Contin.
Variable Types (Identify the correct type(s):
Variable
Scoring
Quality
of life
1 = Poor
2 = Fair
3 = Average
4 = Good
5 = Very Good
Ethnicity
1 = Non-Hispanic
2 = Hispanic
Race
1 = African American
2 = Caucasian
3 = Other
Diabetes
1 = Absent
2 = Present
Systolic
BP
Ranges from 95 to
190 mmHg
Dichot. Nominal Ordinal Contin.
●
●
●
●
●
●
Proportions and Percentages:
Persons included in the numerator are always
included in the denominator:
A
Proportion (P): -------A+B
where B = all remaining
Indicates the magnitude of a part, related to the total.
Tells us the fraction of the population that is affected.
Percentage = proportion x 100
Proportion range: 0 to 1.0 Percentage range: 0 to 100
Proportions and Percentages:
Smoking Status
N
P
%
Never
60
0.462
46.2
Former
45
0.346
34.6
Current
25
0.192
19.2
Total
130
1.0
100.0
PNEVER
PFORMER
PCURRENT
=
=
=
60 / (60 + 45 + 25)
45 / (60 + 45 + 25)
25 / (60 + 45 + 25)
Proportions and Percentages (Calculate):
Blood Pressure Status
N
Normal
40
Pre-hypertensive
75
Stage I hypertension
25
Stage II hypertension
10
Total
P
%
Proportions and Percentages (Calculate):
Blood Pressure Status
N
P
%
Normal
40
0.267
26.7
Pre-hypertensive
75
0.500
50.0
Stage I hypertension
25
0.167
16.7
Stage II hypertension
10
0.067
6.7
Total
150
1.0
100
PNormal
Ppre-hyp
PstageI
PstageII
=
=
=
=
40 / (40 + 75 + 25 + 10)
75 / (40 + 75 + 25 + 10)
25 / (40 + 75 + 25 + 10)
10 / (40 + 75 + 25 + 10)
Ratios:
Like a proportion, is a fraction, BUT without a
specified relationship between the numerator and
denominator
Example: Occurrence of Major Depression
Female cases = 240
-----------------------Male cases = 120
=
240
---120
2:1 female to male
SECTION 1.2
Epidemiological
Measures
Prevalence (proportion):
The presence (proportion) of disease or condition in a
population (generally irrespective of the duration
of the disease)
Prevalence: Quantifies the “burden” of disease.
Number of existing cases
P = -------------------------------Total population
At a set point in time (i.e. September 30, 1999)
Prevalence (proportion):
Example:
On June 30, 1999, neighborhood A has:
• Population of 1,600
• 29 current cases of hepatitis B
• 1,571 individuals without hepatitis B
So, P = 29 / 1600 = 0.018 or 1.8%
Cumulative Incidence (CI)
No. of new cases of disease during a given period
CI = -------------------------------------------------------------Total population at risk during the given period
Example: During a 1-year period, 10 out of 100 “at
risk” persons develop the disease of interest.
CI
=
10
----- =
100
0.10 or
10.0%
{“Relative Risk (RR)”}
Compares the incidence of disease (risk)
among the exposed with the incidence of
disease (risk) among the non-exposed
(“reference”) by means of a ratio.
The reference group assumes a value of 1.0
(the “null” value)
The ‘null’ value (1.0)
•
If the relative risk estimate is > 1.0,
the exposure appears to be a risk
factor for disease.
•
If the relative risk estimate is < 1.0,
the exposure appears to be protective
of disease occurrence.
Risk Ratio = IncidenceE+ / IncidenceEWhere E = exposure status
Hypothesis: Being subject to physical abuse in childhood is
associated with lifetime risk of attempted suicide
Results:
D+
D-
Of 2,240 children not subject to physical abuse,
16 have attempted suicide.
Of 840 children subjected to physical abuse, 10
have attempted suicide.
RR = IE+ / IEE+
E-
10
830
840
16
2,224
2,240
RR = (10 / 840) / (16 / 2,240)
RR = 0.0119 / 0.0071 = 1.68
Risk Ratio = IncidenceE+ / IncidenceEWhere E = exposure status
Practice:
Hypothesis: Being subject to physical abuse in childhood is
associated with lifetime risk of attempted suicide
Results:
D+
D-
Of 1,750 children not subject to physical abuse,
14 have attempted suicide.
Of 620 children subjected to physical abuse, 12
have attempted suicide.
RR = IE+ / IEE+
ERR =
Risk Ratio = IncidenceE+ / IncidenceEWhere E = exposure status
Practice:
Hypothesis: Being subject to physical abuse in childhood is
associated with lifetime risk of attempted suicide
Results:
D+
D-
Of 1,750 children not subject to physical abuse,
14 have attempted suicide.
Of 620 children subjected to physical abuse, 12
have attempted suicide.
RR = IE+ / IEE+
E-
12
608
620
14
1,736
1,750
RR = (12 / 620) / (14 / 1,750)
RR = 0.01936 / 0.008 = 2.42
Odds Ratio = Odds of ExposureD+ / Odds of ExposureDWhere D = disease (outcome) status
Hypothesis: Eating chili peppers is associated with development
of gastric cancer.
Cases:
21
12 ate chili peppers
9 did not eat chili peppers
Controls:
479
88 ate chili peppers
391 did not eat chili peppers
E+
E-
D+
12 (a)
9 (c)
21
D88 (b)
391 (d)
479
OR = (a / c) / (b / d)
OR = (12 / 9) / (88 / 391)
OR = 1.333 / 0.225 = 5.92
Odds Ratio = Odds of ExposureD+ / Odds of ExposureDWhere D = disease (outcome) status
Practice:
Hypothesis: Eating chili peppers is associated with development
of gastric cancer.
Cases:
44
14 ate chili peppers
30 did not eat chili peppers
Controls:
610
100 ate chili peppers
510 did not eat chili peppers
E+
E-
D+
(a)
(c)
D(b)
(d)
OR = (a / c) / (b / d)
OR =
Odds Ratio = Odds of ExposureD+ / Odds of ExposureDWhere D = disease (outcome) status
Practice:
Hypothesis: Eating chili peppers is associated with development
of gastric cancer.
Cases:
44
14 ate chili peppers
30 did not eat chili peppers
Controls:
610
100 ate chili peppers
510 did not eat chili peppers
E+
E-
D+
14 (a)
30 (c)
44
D100 (b)
510 (d)
610
OR = (a / c) / (b / d)
OR = (14 / 30) / (100 / 510)
OR = 0.467 / 0.196 = 2.38
SECTION 1.3
Descriptive
Statistics
Mean, Median, Mode, and Range
• Mean, median, and mode are 3 kinds of "averages".
• “Mean" is the "average" where you add up all the
numbers and then divide by the number of numbers.
• “Median" is the "middle" value in the list of numbers.
• “Mode" is the value that occurs most often.
• “Range" is the difference between largest and smallest
values.
Formula for the population mean
The population mean is calculated using a formula:
(mu) is the symbol for the population mean
“sum all the observations of x, and divide by n”
Formula for the sample mean
The sample mean is calculated using a formula:
x bar is the symbol for the sample mean
“sum all the observations of x, and divide by n”
The mean and the median are summary measures used to
describe the most "typical" value in a set of values.
Statisticians refer to the mean and median as measures of
central tendency.
Mean, Median, Mode, and Range
Child
1
2
3
4
5
6
7
8
9
10
Age (years)
8
10
9
9
10
9
11
11
11
11
Weight (lbs.)
52
64
65
70
72
76
80
84
88
94
1. Calculate Mean Weight
X = (52+64+65+70+72+76+80+84+88+94) / 10 =
745 / 10 = 74.5
2. What is the median weight? Note that Child 5 and 6 are
both in the “middle”
Median = (72 + 76) / 2) = 74
3. What is the mode age?
= 11 (4 occurrences)
4. What is the weight range?
=
94 – 52 = 42
Mean, Median, Mode, and Range (Calculate:)
Child
1
2
3
4
5
6
7
8
9
10
Age (years)
11
10
8
11
12
10
12
8
10
10
Weight (lbs.)
72
63
59
94
88
72
88
50
58
66
Note: To determine the median and range, you need to reorder
weight from lowest to highest.
Weight (lbs.)
1. Calculate Mean Weight
X=
2. What is the median weight?
=
3. What is the mode age?
=
4. What is the weight range?
=
Mean, Median, Mode, and Range (Calculate:)
Child
1
2
3
4
5
6
7
8
9
10
Age (years)
11
10
8
11
12
10
12
8
10
10
Weight (lbs.)
72
63
59
94
88
72
88
50
58
66
Note: To determine the median, you need to reorder weight
from lowest to highest
Weight (lbs.)
50
58
59
63
66
72
72
88
88
1. Calculate Mean Weight
X = (50+58+59+63+66+72+72+88+88+94) / 10 = 71.0
2. What is the median weight?
= (66 + 72) / 2) = 69
3. What is the mode age?
= 10 (4 occurrences)
4. What is the weight range?
= 94 – 50 = 44
94
Percentiles and Interquartile Range
• Percentile (or centile): Value of a variable below
which a certain percent of observations fall.
• Example: The 20th percentile is the value (or score)
below which 20 percent of the observations are found.
• The 25th percentile is known as first quartile (Q1);
50th percentile as median or second quartile (Q2);
and the 75th percentile as the third quartile (Q3).
• The interquartile range is equal to Q3 minus Q1.
Percentiles and Interquartile Range
• Tertiles:
3 equal parts
Percentile points = 33.3, 66.7
• Quartiles:
4 equal parts
Percentile points = 25, 50, 75
• Quintiles:
5 equal parts
Percentile points = 20, 40, 60, 80
• Deciles: 10 equal parts
Percentile points = 10, 20, 30, 40, 50, 60, 70, 80, 90
Percentiles and Interquartile Range (Example)
% of participants
12
6
0
45
Age in years
60
75
Cumulative
Cumulative
AGE
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
45
34
1.70
34
1.70
46
39
1.95
73
3.65
47
42
2.10
115
5.75
48
40
2.00
155
7.75
49
54
2.70
209
10.45
50
57
2.85
266
13.30
51
96
4.80
362
18.10
52
82
4.10
444
22.20
53
87
4.35
531
26.55
54
76
3.80
607
30.35
55
107
5.35
714
35.70
56
96
4.80
810
40.50
57
94
4.70
904
45.20
58
108
5.40
1012
50.60
59
85
4.25
1097
54.85
60
81
4.05
1178
58.90
61
60
3.00
1238
61.90
62
84
4.20
1322
66.10
63
75
3.75
1397
69.85
64
73
3.65
1470
73.50
65
75
3.75
1545
77.25
66
62
3.10
1607
80.35
67
60
3.00
1667
83.35
68
51
2.55
1718
85.90
69
66
3.30
1784
89.20
70
50
2.50
1834
91.70
71
49
2.45
1883
94.15
72
45
2.25
1928
96.40
73
38
1.90
1966
98.30
74
32
1.60
1998
99.90
75
2
0.10
2000
100.00
Percentile Points/Groups
Tertiles:
0 – 33.3%
>33.3 – 66.7%
>66.7 to 100%
Quartiles:
0 - 25%
>25 – 50%
>50 – 75%
>75 to 100%
Quintiles:
0 - 20%
>20 – 40%
>40 – 60%
>60 to 80%
>80 – 100%
45 to 54
55 to 62
63 to 75
45 to 52
53 to 57
58 to 64
65 to 75
45 to 51
52 to 55
56 to 60
61 to 65
66 to 75
Percentiles and Interquartile Range (Identify)
% of participants
20
10
0
42
90
Diastolic Blood Pressure mmHg
138
Cumulative
Cumulative
PE_DIASTOLIC
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
40
1
0.05
1
0.05
46
1
0.05
2
0.10
50
1
0.05
3
0.15
52
1
0.05
4
0.20
54
1
0.05
5
0.25
56
5
0.25
10
0.50
58
7
0.35
17
0.85
60
23
1.15
40
2.00
62
35
1.75
75
3.76
64
55
2.76
130
6.51
66
32
1.60
162
8.12
68
74
3.71
236
11.82
70
113
5.66
349
17.48
72
115
5.76
464
23.25
74
112
5.61
576
28.86
75
4
0.20
580
29.06
76
101
5.06
681
34.12
78
170
8.52
851
42.64
79
1
0.05
852
42.69
80
169
8.47
1021
51.15
82
162
8.12
1183
59.27
83
3
0.15
1186
59.42
84
154
7.72
1340
67.13
85
2
0.10
1342
67.23
86
137
6.86
1479
74.10
87
2
0.10
1481
74.20
88
113
5.66
1594
79.86
90
104
5.21
1698
85.07
91
1
0.05
1699
85.12
92
81
4.06
1780
89.18
94
49
2.45
1829
91.63
95
1
0.05
1830
91.68
96
39
1.95
1869
93.64
97
1
0.05
1870
93.69
98
43
2.15
1913
95.84
100
26
1.30
1939
97.14
102
17
0.85
1956
98.00
104
7
0.35
1963
98.35
105
1
0.05
1964
98.40
106
3
0.15
1967
98.55
108
6
0.30
1973
98.85
110
12
0.60
1985
99.45
112
4
0.20
1989
99.65
114
2
0.10
1991
99.75
116
1
0.05
1992
99.80
118
2
0.10
1994
99.90
130
1
0.05
1995
99.95
137
1
0.05
1996
100.00
Percentile Points/Groups
Tertiles:
0 – 33.3%
>33.3 – 66.7%
>66.7 to 100%
Quartiles:
0 - 25%
>25 – 50%
>50 – 75%
>75 to 100%
Quintiles:
0 - 20%
>20 – 40%
>40 – 60%
>60 to 80%
>80 – 100%
________
________
________
________
________
________
________
________
________
________
________
________
Cumulative
Cumulative
PE_DIASTOLIC
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
40
1
0.05
1
0.05
46
1
0.05
2
0.10
50
1
0.05
3
0.15
52
1
0.05
4
0.20
54
1
0.05
5
0.25
56
5
0.25
10
0.50
58
7
0.35
17
0.85
60
23
1.15
40
2.00
62
35
1.75
75
3.76
64
55
2.76
130
6.51
66
32
1.60
162
8.12
68
74
3.71
236
11.82
70
113
5.66
349
17.48
72
115
5.76
464
23.25
74
112
5.61
576
28.86
75
4
0.20
580
29.06
76
101
5.06
681
34.12
78
170
8.52
851
42.64
79
1
0.05
852
42.69
80
169
8.47
1021
51.15
82
162
8.12
1183
59.27
83
3
0.15
1186
59.42
84
154
7.72
1340
67.13
85
2
0.10
1342
67.23
86
137
6.86
1479
74.10
87
2
0.10
1481
74.20
88
113
5.66
1594
79.86
90
104
5.21
1698
85.07
91
1
0.05
1699
85.12
92
81
4.06
1780
89.18
94
49
2.45
1829
91.63
95
1
0.05
1830
91.68
96
39
1.95
1869
93.64
97
1
0.05
1870
93.69
98
43
2.15
1913
95.84
100
26
1.30
1939
97.14
102
17
0.85
1956
98.00
104
7
0.35
1963
98.35
105
1
0.05
1964
98.40
106
3
0.15
1967
98.55
108
6
0.30
1973
98.85
110
12
0.60
1985
99.45
112
4
0.20
1989
99.65
114
2
0.10
1991
99.75
116
1
0.05
1992
99.80
118
2
0.10
1994
99.90
130
1
0.05
1995
99.95
137
1
0.05
1996
100.00
Percentile Points/Groups
Tertiles:
0 – 33.3%
>33.3 – 66.7%
>66.7 to 100%
Quartiles:
0 - 25%
>25 – 50%
>50 – 75%
>75 to 100%
Quintiles:
0 - 20%
>20 – 40%
>40 – 60%
>60 to 80%
>80 – 100%
40 to 75
76 to 83
84 to 137
40 to 72
73 to 79
80 to 87
88 to 137
40 to 70
71 to 76
77 to 83
84 to 88
89 to 137
Variance, SD, and SE of Mean
Population variance: Average squared deviation from the
population mean, as defined by the following formula:
2
σ is the population variance
μ is the population mean
X is the ith element from the population
n is number of elements in the population.
Observations from a simple random sample can be used to
estimate the variance of a population. For this purpose,
sample variance is defined by slightly different formula,
and uses a slightly different notation:
Sample variance is calculated using a formula:
Variance is the mean of the squared deviations
of the observations
Sample Variance Calculation
ID
Age
X
(X – X)2
1
43
49.5
42.25
2
25
49.5
600.25
3
31
49.5
342.25
4
55
49.5
30.25
5
45
49.5
20.25
6
62
49.5
156.25
7
41
49.5
72.25
8
58
49.5
72.25
9
38
49.5
132.25
10
52
49.5
6.25
11
70
49.5
420.25
12
74
49.5
600.25
Total Mean: 49.5
∑ = 2,495
2,495
S2X
=
------12 - 1
S2X
=
226.8
Range = (74 – 25)
= 49 years
Sample Variance Calculation (Practice)
ID
Age
X
1
45
47.83
2
38
47.83
3
32
47.83
4
57
47.83
5
43
47.83
6
64
47.83
7
48
47.83
8
55
47.83
9
32
47.83
10
60
47.83
11
54
47.83
12
46
Mean:
47.83
47.83
Total
(X – X)2
S2X
=
Range =
∑ =
Sample Variance Calculation (Practice)
ID
Age
X
(X – X)2
1
45
47.83
8.03
2
38
47.83
96.69
3
32
47.83
250.69
4
57
47.83
84.03
5
43
47.83
23.36
6
64
47.83
261.36
7
48
47.83
0.03
8
55
47.83
51.36
9
32
47.83
250.69
10
60
47.83
148.03
11
54
47.83
38.03
12
46
Mean:
47.83
47.83
3.36
Total
∑ = 1,215.67
1,215.67
S2X
=
------12 - 1
S2X
=
110.5
Range = (64 – 32)
= 32 years
Sample A
S2X
=
226.8
Sample B
S2X
=
110.5
Question: Why is the variance for Sample A much
larger than the variance for Sample B?
Standard Deviation (SD): Square root of the variance.
σ = sqrt [ σ2 ]
• The standard deviation is a measure of variation
• Unlike variance, the SD is in the same scale as the
variable of interest (i.e. age in this example)
σ = sqrt [226.8] = 15.1
S2X =
S2X =
226.8
110.5
σ = sqrt [226.8] = 10.5
Standard Error of the Mean (SEM)
• SEM: Standard deviation (s) of the error in a sample
mean relative to the true mean, per the formula below:
• Represents how close to the population mean the
sample mean is likely to be
• Decreases with larger sample sizes, as the estimate of
the population mean improves
• Note: Standard deviation is the degree to which
individuals within a sample differ from the sample
mean – hence, not affected by sample size
Coefficient of Variation (CV)
CV: When a sample of data from the population is available, the
population CV is estimated as the ratio of the sample standard deviation
to the sample mean (see formula below)
• Provides an indication of the size of the standard deviation relative
to the mean
• Independent of the unit in which measurement was taken (i.e. mean)
• Thus, is dimensionless and can be used to compare between datasets
with widely different means
Coefficient of Variation (CV)
1
2
3
4
5
6
7
8
9
10
N
Mean
Variance
SD
SEM
CV
Age
45
26
55
46
61
57
39
50
72
38
BMI
28
23
48
31
36
22
40
25
32
42
10
48.90
172.10
13.12
4.15
0.27
10
32.70
75.34
8.68
2.74
0.27
• Note the same coefficient
of variation for age and
BMI despite much
different mean, variance,
SD, and SEM
• CV is dimensionless and
can be used to compare
between datasets or
variables with widely
different means
Skewness and Kurtosis
Skewness: Measure of asymmetry of the distribution of a continuous variable.
• Can be positive or negative, or even undefined.
• Negative skew indicates that the tail on the left side of distribution is longer than
the right side and bulk of the values lie to the right of the mean.
• Positive skew indicates that the tail on the right side is longer than the left side
and bulk of the values lie to the left of the mean.
•
General guideline for skewed distribution: absolute value > 1.
For a sample of n values,
the sample skewness is:
Skewness and Kurtosis
Kurtosis: Measure of "peakedness" of the distribution of a continuous variable, as
compared to a normal distribution.
• Similar to skewness, kurtosis is a descriptor of the shape of a probability
distribution.
For a sample of n values the sample
excess kurtosis is