lecture 2: medical measurement
Download
Report
Transcript lecture 2: medical measurement
EPI-820 Evidence-Based Medicine
(EBM)
LECTURE 2: MEDICAL MEASUREMENT
Mat Reeves BVSc, PhD
Department of Epidemiology
Michigan State University
1
Objectives:
• 1. Understand biological and measurement variation
and its effects on precision and validity.
• 2. Understand the components of variability
– biological and measurement
– between- and within-person/observer
• 3. Understand measures of variation and measures
of agreement.
• 4. Understand the calculation and application of K.
• 5. Understand the consequences of variability in
clinical data and possible remedies to ameliorate
• 6. Understand regression to the mean.
2
I. Variation in Clinical Data
• 1. Biologic Variation= variation in the actual
entity being measured
• derives from the dynamic nature of physiology,
homeostasis and pathophysiology.
• within (intra-person) biologic variability and,
• between (inter-person) biologic variability
3
Within (day-to-day variation) and Between Person
Biological Variation: Coefficient of Variation (%) (see
Winkel et al, 1974)
•
•
•
•
•
•
•
•
•
•
Variable
Na
K
Cl
Ca
BUN
Creatinine
Cholesterol
SGOT (ALT)
TP
CV (Within)
0.7%
4.3%
2.1%
1.7%
12.3%
4.3%
5.3%
24.2%
2.9%
CV (Between)
0.8%
4.3%
1.2%
2.8%
16.4%
9.5%
13.6%
24.8%
5.7%
4
I. Variation in Clinical Data
• 2. Measurement Variation= variation due to
the measurement process
• inaccuracy of the instrument (instrument error),
and/or,
• inaccuracy of the person (operator error)
• can introduce both random error and bias
5
Analytical Variation - Coefficient of Variation
(%) of Duplicate Samples
•
•
•
•
•
•
•
•
•
•
Variable
Na
K
Cl
Ca
BUN
Creatinine
Cholesterol
SGOT (ALT)
TP
CV (Analytical)
1.1%
2.6%
2.1%
2.1%
2.2%
3.4%
3.1%
7.3%
1.7%
6
Validity
• Degree to which a measurement process measures
what is intended i.e., accuracy.
• Lack of systematic error or bias.
• A valid instrument will, on average, be close to the
underlying true value.
• Assessment of validity requires a “gold standard” (a
reference).
7
What if no gold standard?
(e.g., pain, nausea or anxiety)
• Use instrument or clinical scale to measure a specific
phenomenon or construct.
• Criterion Validity - the degree to which the scale predicts a
directly observable phenomenon e.g. APGAR score and
neonatal survival.
• Content Validity - the extent to which the instrument includes
all of the dimensions of the construct being measured e.g.
does APGAR include all relevant patho-physiological
parameters?
• Construct Validity - the degree to which the scale correlates
with other known measures of the phenomenon e.g. how
well does a new “Neonatal assessment scale” correlate with
8
APGAR score?
How do you measure validity?
• Dichotomous data
• sensitivity, specificity, and predictive values.
• Continuous data
• mean and standard deviation of the difference
between surrogate measure and gold standard
(see Bland and Altman, 1986).
9
Precision
(or reliability or reproducibility)
• the extent that repeated measurements of a
phenomenon tend to yield the same results
(regardless of their accuracy!).
• Precision refers to the lack of random error
• Precision ~ 1 / random error
10
Hard versus Soft Data ?
•
Blood chloride level
• Degree of depression
•
Left ventricular ejection volume
• Alzheimer severity
•
Migraine severity
• Self-reported ability to do
domestic chores
•
28-d stroke case-fatality rate
•
Indirect costs of school
absenteeism
•
Direct costs of school
absenteeism
• Self-reported ability to climb
stairs
• Patient preferences for
induced labour
• Self-reported assessment of
health
11
Hard versus Soft Data
• No specific criteria to define “hard” data,
attributes include:
• Consistency: the ability to preserve basic
evidence (repeated observations are consistent)
(most important attribute).
• Objectivity: observations are free of subjective
influences.
• Quantifiable: the ability to express the result as a
number.
12
Hard versus Soft Data
• Usually hard data are numeric measures, such as
lab data, but not always (e.g., histology, cancer
stage)
• Hard (numeric) data preferred to softer
(qualitative) measures because they are more
objective and reliable? (but see Feinstein AR et al,
1985, Will Rogers phenomenon)
13
Between and Within Person Variation
• Four categories of clinical variability:
•
•
•
•
1. Between-person biological variability
2. Within-person biological variability
3. Between-observer measurement variability
4. Within-observer measurement variability
14
ANOVA Model Conceptualization
• yijkl = i + ij + ik + il
• where:
– yijk = the observed measurement for individual i, measured at
time j, by the kth observer at the lth replication.
– i = individuals usual true mean (between person biological
variation)
– ij = perturbation due to biological variation at time j (within
person biologic variation).
– ik = perturbation due to measurement error by the kth
observer (between observer measurement variation).
– il = perturbation due to measurement error at the lth
replication (within observer measurement variation).
15
II. Statistical aspects of variability
• A. Measures of Variation
• 1. Variance and Standard Deviation
SD =
( xi - x )2
n-1
• SD = absolute value of average differences of individual
values from the overall mean.
• CLT = 68%, 95%, 99%
• Example:
– Av. US Cholesterol = 220 mg/dl, SD = 15 mg/dl
– Indv. readings expected to vary 190-250 mg/dl
16
A. Measures of Variation
• 2. Co-efficient of Variation (CV)
SD
%
X
• represents the % variation of a set of
measurements around their mean
• conceptualized as a “noise-to-signal ratio”
• useful index for comparing the precision of
different instruments, individuals and/or
laboratories.
17
B. Measures of Agreement
• 1. Correlation (r)
• Pearson product moment correlation and
Spearman’s rank correlation
• measures the degree of linear relationship
between two variables (-1, +1)
• correlation between two sets of continuous
measurements (= reliability) or extent of replication
18
1. Correlation (Cont’d)
• Two observers, same time period = inter-rater
reliability.
• Single observer, two time periods = intra-rater
reliability (test-retest reliability).
• Can have very high values of r, but little direct
agreement between raters or instruments.
• Can only be used as a test of validity if the actual
true values are known.
19
B. Measures of Agreement
2. Intra-class Correlation Coefficient
(R or reliability)
• a measure of reliability for continuous or quantitative data
• an observed value (X) consists of two parts:
• X=T+e
– where:
• T = the “True” unknown level or “error-free” score or
“steady state” or “signal”
• e = error (whether “biologic” or “measurement” error)
• true error-free value varies about some unknown mean ()
with a variance of 2T.
20
2. R (Cont’d)
• error term is regarded as iid ( = 0, 2e ).
• Variance of X (2x ) = 2T + 2e
• relative size of error variance (2e) in relation to
variance of true value (2T ) is a measure of the
imprecision.
• R = 2T.
2T + 2e
• R = the proportion of the total variance due to subject-tosubject (or between-person) variability in the “true” value.
• As random error decreases, the value of R increases
21
2. Categorical data – Kappa (K)
• A measure of reliability for categorical or qualitative
data.
• Kappa corrects for the degree of chance in the
overall level of agreement, and is preferred over
other measures (like overall percent agreement).
• K = Po - Pe = Actual agreement beyond chance
1 - Pe
Potential agreement beyond chance
• Po = the total proportion of observations on which there is
agreement
• Pe = the proportion of agreement expected by chance alone.
22
Agreement matrix for kappa statistic
(inter-rater agreement, 2 observers, dichotomous data)
OBSERVER A
OBSERVER B
Yes
No
TOTALS
Yes
a
b
f1
No
c
d
f2
TOTALS
n1
n2
N
23
Agreement matrix for kappa statistic
(2 observers, dichotomous data)
OBSERVER A
OBSERVER B
Yes
No
TOTALS
Yes
69
15
84
No
18
48
66
TOTALS
87
63
150
24
K (Cont’d)
• Observed agreement (Po) = 78%
• (69 + 48)/150 = 0.78 or 78%.
• Agreement expected dt chance (Pe) = 51%.
• Calculated by the product of the marginal totals for
cells a and d [87 x 84/150 = 48.75 + 63 x 66/150
= 27.72]
• Then divide sum [76.47] by 150 to get Pe = 0.51 or
51%.
25
K (Cont’d)
• K = Po - Pe = 0.78 - 0.51 = 0.27 = 0.55 or 55%
1 - Pe
1 - 0.51
0.47
• Kappa varies from -1 to +1, with a value of zero denoting
agreement no better than chance (negative values denotes
agreement worse than chance!)
• Value of k
<0
0 - 0.20
0.21 - 0.40
0.41 - 0.60
0.61 - 0.80
0.81 - 1.0
Strength of agreement
Poor
Slight
Fair
Moderate
Substantial
Almost perfect
26
K - Issue of Prevalence
• The prevalence of condition affects the
likelihood that observers will agree purely due
to chance - hence the importance of using
kappa.
Example:
• Observer A classified 120/150 patients
• Observer B classified 130/150 patients
• Pe is now 72%.
27
K - More Complicated Scenarios
• Overall (summary) kappa:
• several observers or raters and/or where the subjects are
classified into several different categories.
• Weighted kappa:
• measuring the relative degree of disagreement when
subjects are classified into several ordinal categories (e.g.,
normal, slightly abnormal and very abnormal).
• MacClure and Willett (1987):
• Use kappa for dichotomous data or nominal polytomous data
only.
• For ordinal data use either Spearman’s rank correlation or R.
28
IV. Consequences of variability of
clinical data
• A. Clinical impact
• Errors in diagnosis, prognosis and even treatment.
• Clinical disagreement between clinicians.
• B. Research Impact
• Between-person biological variability is a prerequisite for
etiologic studies.
• Random within-person variability (a form unreliability) results
in non-differential misclassification - with a resulting dilution
or attenuation of effect.
29
B. Research impact
• Generally, imprecision has less impact in research
setting than individual clinical setting because can
average over a large number of observations (but
still require measure to be valid).
• Variability and misclassification result in the need
for larger samples sizes (and increased costs).
• Measurement errors can introduce bias if they do
not occur at random - non-differential
misclassification
30
Regression Dilution Bias
• Example: MacMahon et al., (1990)
• imprecision resulting from a single measurement
of diastolic blood pressure resulted in a 60%
attenuation of RR’s (for the effect of elevated
blood pressure on stroke and MI).
• “regression dilution bias”.
31
C. Regression towards the mean
• Group of individuals selected based on the results of
an “abnormal” test can be divided into:
• a) those with a true underlying abnormal value, and
• b) those with a true underlying normal value (but random
fluctuations resulted in an outlying [abnormal] value).
• On retesting, patients in group b are closer to their
typical (normal) values, so, the overall mean is less
extreme (= regression to the mean).
• Occurs when repeated observations are performed
on a variable that is inherently variable.
32
C. RTTM
• Often interpreted as a sign of clinical improvement,
regardless of effectiveness of treatment (an important
explanation for the placebo effect)
• If first reading is d units higher than the true value (),
then on average, the next value will be closer to the
mean by d(1 - r) units,
• where r is the correlation between the two measurements
• RTTM increases if d is large and r is small.
• RTTM is a general tendency for describing the
average behaviour of a group, not necessarily
individuals!!
33
V. Remedies for variability of clinical
data
• A. Within-person biologic variation
• Standardized measurements: use a standard protocol i.e.,
time of day, body position etc.
• Average repeated tests e.g., take several blood pressure
reading.
• Use a less variable test e.g., for diabetes use glycosolated
Hb, rather than blood glucose.
• Plot the data - what is the trend?
• Develop reference values for each individual - especially if:
– within-person variability <<< between-person variability
– this results in a wide reference range which makes it difficult to
identify individual deviations
– e.g., body weight, PSA, EKG
34
B. Measurement Error
• Measurement imprecision corrected by
adjusting the machine or re-training the
tester, (or, average several values?).
• Measurement error that causes bias requires
quality assurance testing. Fix by recalibration (don’t average!!).
35
Sackett - Six strategies for preventing or
minimizing clinical disagreements
• 1. Match diagnostic environment to the diagnostic
task.
• 2. Corroborate key findings by:
–
–
–
–
repeating observations and questions
confirm information with other sources (e.g., family members)
confirm key findings using appropriate diagnostic tests
seek confirmation from “blinded” colleagues
• 3. Report actual findings then report inference
• 4. Use appropriate technical aids to avoid
imprecision (e.g., ruler).
• 5. “Blinded” assessments of diagnostic findings.
• 6. Apply skills of social sciences
– establish understanding, follow a logical order, listen, observe,
interrupt only where necessary).
36