Lecture 04. Organization of statistical research

Download Report

Transcript Lecture 04. Organization of statistical research

Organization of
statistical research
The role of Biostatisticians
Biostatisticians play essential roles in designing
studies, analyzing data and creating methods
to attack research problems as diverse as
 determination of major risk factors for heart
disease, lung disease and cancer
 testing of new drugs to combat AIDS
 evaluation of potential environmental factors
harmful to human health, such as tobacco
smoke, asbestos or pollutants
Applications of Biostatistics





Public health, including epidemiology, health services
research, nutrition, and environmental health
Design and analysis of clinical trials in medicine
Genomics, population genetics, and statistical genetics in
populations in order to link variation in genotype with a
variation in phenotype. This has been used in agriculture
to improve crops and farm animals. In biomedical
research, this work can assist in finding candidates for
gene alleles that can cause or influence predisposition to
disease in human genetics
Ecology
Biological sequence analysis
Applications of Biostatistics
Statistical methods are beginning to be
integrated into
 medical informatics
 public health informatics
 bioinformatics
Types of Data

Categorical data:  values belong to categories
- Nominal data: there is no natural order to the
categories e.g. blood groups
- Ordinal data: there is natural order e.g. Adverse
Events (Mild/Moderate/Severe/Life Threatening)
- Binary data: there are only two possible categories
e.g. alive/dead

Numerical data:  the value is a number
(either measured or counted)
- Continuous data: measurement is on a continuum
e.g. height, age, haemoglobin
- Discrete data: a “count” of events e.g. number of
pregnancies
Measures of Frequency of Events

Incidence
- The number of new events (e.g. death or a particular
disease) that occur during a specified period of time in
a population at risk for developing the events.

Incidence Rate
- A term related to incidence that reports the number of
new events that occur over the sum of time individuals
in the population were at risk for having the event (e.g.
events/person-years).

Prevalence
- The number of persons in the population affected by a
disease at a specific time divided by the number of
persons in the population at the time.
Measures of Association

Relative risk and cohort studies
- The relative risk (or risk ratio) is defined as the
ratio of the incidence of disease in the
exposed group divided by the corresponding
incidence of disease in the unexposed group.

Odds ratio and case-control studies
- The odds ratio is defined as the odds of
exposure in the group with disease divided by
the odds of exposure in the control group.
Measures of Association
Measures of Association




Absolute risk
- The relative risk and odds ratio provide a measure of risk
compared with a standard.
Attributable risk or Risk difference is a measure of absolute
risk. It represents the excess risk of disease in those exposed
taking into account the background rate of disease. The
attributable risk is defined as the difference between the
incidence rates in the exposed and non-exposed groups.
Population Attributable Risk is used to describe the excess
rate of disease in the total study population of exposed and
non-exposed individuals that is attributable to the exposure.
Number needed to treat (NNT)
- The number of patients who would need to be treated to
prevent one adverse outcome is often used to present the
results of randomized trials.
Terms Used To Describe The
Quality Of Measurements
Reliability is variability between subjects
divided by inter-subject variability plus
measurement error.
 Validity refers to the extent to which a test
or surrogate is measuring what we think it
is measuring.

Measures Of Diagnostic Test
Accuracy




Sensitivity is defined as the ability of the test to identify
correctly those who have the disease.
Specificity is defined as the ability of the test to identify
correctly those who do not have the disease.
Predictive values are important for assessing how
useful a test will be in the clinical setting at the individual
patient level. The positive predictive value is the
probability of disease in a patient with a positive test.
Conversely, the negative predictive value is the
probability that the patient does not have disease if he
has a negative test result.
Likelihood ratio indicates how much a given diagnostic
test result will raise or lower the odds of having a disease
relative to the prior probability of disease.
Measures Of Diagnostic Test
Accuracy
Expressions Used When
Making Inferences About Data

Confidence Intervals
- The results of any study sample are an estimate of the true value
in the entire population. The true value may actually be greater or
less than what is observed.



Type I error (alpha) is the probability of incorrectly
concluding there is a statistically significant difference in
the population when none exists.
Type II error (beta) is the probability of incorrectly
concluding that there is no statistically significant
difference in a population when one exists.
Power is a measure of the ability of a study to detect a
true difference.
Kaplan-Meier Survival Curves
Why Use Statistics?
Cardiovascular Mortality in Males
1,2
1
0,8
SMR 0,6
0,4
0,2
0
'35-'44 '45-'54 '55-'64 '65-'74 '75-'84
Bangor
Roseto
Percentage of Specimens Testing
Positive for RSV (respiratory syncytial virus)
Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun
South 2
2
5
7
20
30
15
20
15
8
4
3
North- 2
east
West 2
3
5
3
12
28
22
28
22
20
10
9
2
3
3
5
8
25
27
25
22
15
12
2
2
3
2
4
12
12
12
10
19
15
8
Midwest
Descriptive Statistics
Percentage of Specimens Testing Postive for
RSV 1998-99
South
Northeast
West
Midwest
D
ec
Ja
n
Fe
b
M
ar
A
pr
M
ay
Ju
n
Ju
l
Ju
l
A
ug
Se
p
O
ct
N
ov
35
30
25
20
15
10
5
0
Distribution of Course Grades
14
12
10
Number of
Students
8
6
4
2
0
A
A- B+ B
B- C+ C
Grade
C- D+ D
D-
F
The Normal Distribution




Mean = median =
mode
Skew is zero
68% of values fall
between 1 SD
95% of values fall
between 2 SDs
Mean, Median, Mode
.
1

2
Hypertension Trial
DRUG Baseline mean SBP F/u mean SBP
A
150
130
B
150
125
30 Day % Mortality
Study
IC STK Control
p
N
Khaja
5.0
10.0
0.55
40
Anderson
4.2
15.4
0.19
50
Kennedy
3.7
11.2
0.02 250
95% Confidence Intervals
Khaja
(n=40)
Anderson
(n=50)
Kennedy
(n=250)
-,40 -,35 -,30 -,25 -,20 -,15 -,10 -,05 ,00
,05
,10
,15
,20
Types of Errors
Truth
No
difference
Conclusion
TYPE II
ERROR ()
No
difference
Difference
Difference
TYPE I
ERROR ()
Power = 1-
ERROR ANALYSIS
Suppose we made three more series of draws,
and the results were + 16%, + 0%, and +
12%. The random sampling errors of the four
simulations would then average out to:
ERROR ANALYSIS

Note that the cancellation of the positive and
negative random errors results in a small average.
Actually with more trials, the average of the
random sampling errors tends to zero.
ERROR ANALYSIS
So in order to measure a “typical size” of a random
sampling error, we have to ignore the signs. We
could just take the mean of the absolute values
(MA) of the random sampling errors. For the four
random sampling errors above, the MA turns out to
be
ERROR ANALYSIS
The MA is difficult to deal with theoretically because
the absolute value function is not differentiable at
0. So in statistics, and error analysis in general, the
root mean square (RMS) of the random sampling
errors is generally used. For the four random
sampling errors above, the RMS is
ERROR ANALYSIS
The RMS is a more conservative
measure of the typical size of the
random sampling errors in the
sense that MA ≤ RMS.
ERROR ANALYSIS
For a given experiment the RMS of all possible
random sampling errors is called the standard
error (SE). For example, whenever we use a
random sample of size n and its percentages p to
estimate the population percentage π, we have