lecture notes January 13 and15

Download Report

Transcript lecture notes January 13 and15

Week 2
An overview
•
•
•
•
•
•
•
•
•
•
Exposure and outcome (dependent and independent
variables)
Reliability and validity
What is “statistical significance”?
Relationships between variables
-continuous variables (t-tests and z-tests)
-continuous variables (correlations)
-the normal (gaussian) distribution
-categorical variables (chi-square tests)
Two by two tables and confidence intervals
Review of the articles
Example 1: Children crossing streets
Measures of association between variables
For next week
A somewhat advanced society has figured how to package
basic knowledge in pill form. A student, needing some
learning, goes to the pharmacy and asks what kind of
knowledge pills are available. The pharmacist says "Here's a
pill for English literature." The student takes the pill and
swallows it and has new knowledge about English literature!
"What else do you have?" asks the student. "Well, I have pills
for art history, biology, and world history, "replies the
pharmacist. The student asks for these, and swallows them
and has new knowledge about those subjects!
Then the student asks, "Do you have a pill for statistics? "The
pharmacist says "Wait just a moment", and goes back into the
storeroom and brings back a whopper of a pill that is about
twice the size of a jawbreaker and plunks it on the counter. "I
have to take that huge pill for statistics?" inquires the student.
The pharmacist understandingly nods his head and replies
"Well, you know statistics always was a little hard to swallow."
Epidemiologic study designs
1. Randomized controlled trial
•
•
•
•
•
Considered the ‘gold standard’
Exposure is assigned randomly
Participants followed over time to assess
outcome
Analytic comparison of risk or benefit in
exposed vs. not exposed
Can be applied to program evaluation
Epidemiologic study design 2
2. Cohort study
•
•
•
•
One group exposed
Other group unexposed
Participants followed over time to assess outcome
Analytic comparison of risk in exposed vs. not
exposed
• Can be applied to program evaluation
Epidemiologic study designs 3
3. Case-control study
•
•
•
Based on outcome
Exposure is compared in those with and without
outcome
Analytic comparison of risk in exposed vs. not
exposed
4. Descriptive study
•
•
•
Provides descriptive statistics of problem under study
No analytic comparison of risk / benefit
Often precedes analytic studies
Dependent vs independent variables
• Remember the exposure/outcome relationship
• Another way to describe it is to attribute
dependent and independent variables-the
outcome depends on the independent exposure
variables
• It is the association between these variables that
leads us to statistical tests
• The test we use depends on the type of variable
Statistical significance
• What is statistical significance?
• The probability that the observed relationship
could have happed by chance
• The p-value and confidence interval are the usual
measures of significance
• Set by tradition at 0.05 or 95%
• The higher the p value, the more likely it could
have happened by chance
• The wider the confidence interval, the more likely
it could have happened by chance
• Both driven by variability in the data and sample
size
Types of variables
• Continuous variables
-variables for which there is a range of responses
e.g., age, blood pressure, weight
• Categorical variables
– Variables that fall into categories
– e.g, gender, smoking status
Hypothesis testing for continuous
variables
•Mean (the average number)
-calculated by summing all the numbers and
dividing by n
-Hypothesis testing usually done using a t-test to
compare the 2 means
-Significance of t-test based on sample size and
variability within the data
•Median (the number in the middle)
•-not usually tested
•Mode (the most frequent response)
•-not usually tested
Hypothesis testing for categorical
variables
• Counts (how many fall within each category)
Compare using 2X2 table
• Proportions (what percentage fall within each
category)
• Compare 2 proportions
• Frequency distributions (comparing counts
and percentages between categories)
• Compare using chi-square test
2X2 tables: the foundation
Disease or other No disease or
outcome
other outcome
Exposed
a
b
Not exposed
c
d
2X2 tables: estimating associations
Disease or
other
outcome
No disease
or other
outcome
Exposed
a
b
a+b
Not
exposed
c
d
c+d
a+c
b+d
a+b+c+d
Odds ratios and relative risks
• Odds ratios (ad/bc) calculate the odds
of an outcome given an exposure
• Relative risk (a/a+b)/c/c+d) calculates
the relative risk of an outcome in
exposed compared to non-exposed
group
• Statistical packages calculate
confidence intervals
Confidence intervals
• Confidence intervals are used for hypothesis
testing in 2X2 tables (and others)
• The width of a confidence interval is based on
the variablility within the data and the sample
size
• An OR or RR of 1 = no association
• A confidence interval that crosses 1 is NOT
statistically significant
Regression lines and correlation
• Correlation is the measure of the way
one variable is associated with another
• Can be done with 2 continuous
variables
• The regression line is the best fit
between 2 variables
• Ranges from -1 to 1
Article review
•
•
•
•
•
•
•
•
Questions to consider:
What is the research question?
What is their study design?
What is the exposure variable(s)?
What is the outcome variable?
What are the strengths and limitations?
Who funded the study?
How compelling are the findings?
Example # 1
Statistical associations of the
number of streets crossed by
children and:
-socio-economic indicators
-child pedestrian injury rate
Background
• Child pedestrian injury rate has been
declining in many countries, including
Canada
• Concern has been expressed that the
decline is due to a reduction in
exposure to traffic (i.e., children are
driven or bussed rather than walking)
Objective
• The objective of this study was to
measure the number of streets children
cross on one day
• To see if the number of streets crossed
varies by socio-economic status
• To see if the child pedestrian injury rate
is associated with the number of streets
crossed
Variables
• Number of streets crossed as reported by
parents from a random sample of schools in
Montreal
• Socio-economic status measured by:
-car ownership
-parental education
-home ownership
• Injury rate in police district as reported by the
police
Methods
• Frequency distribution of average # of
streets crossed presented by age and
SES
• Statistical testing for the differences
between means for categorical
variables
• Scatterplot generated and regression
line calculated
Table 1 Number of Streets Crossed by Age and Socio-economic Indicators*
Age
N
Mean
SD
5&6
487
3.8
4.2
7
730
4.2
5.0
8 &9
519
4.8
5.3
10
657
5.5
5.8
11 & 12
108
6.6
6.3
0
467
5.9
5.8
1
1191
4.8
5.3
2+
815
3.8
4.8
Rent home
1213
5.5
5.6
Own home
1210
3.8
4.7
Number of cars
Home Ownership
Comparing average streets crossed by car ownership
No car
1 car
Average
streets crossed
(Mean)
Standard
deviation
5.9
4.8
5.8
5.3
Sample size
467
1171
Z Test for difference between means 13.8, p<0.001
Figure 2: Ecologic Analysis
Average Number of Main Streets Crossed and Injury Rate
By Police District
5
4.5
Injury Rate per 1,000
4
3.5
R2 = 0.62
3
2.5
Police District
2
95% Confidence Interval
(minimum)
95 % Confidence Interval
(maximum)
Linear Regression Line
1.5
1
0.5
0
0
1
2
3
4
5
6
Average Number of Main Streets Crossed
7
8
Measures of association
between variables
• Tied in to the concept of reliability and validity
• Sometimes we need to test a new variable in
relation to an old one
• For example, a new questionnaire, faster
blood test, etc.
• Several ways to measure association:
• Cronbach’s alpha, kappa, sensitivity,
specificity, positive predictive value, negative
predictive value
Cronbach’s alpha
• Measures the reliability of a psychometric
instrument
• Assesses the extent to which a set of test items
can be treated as measuring a single latent
variable
• Mean correlation between a set of items with the
mean of all the other items
• Looks at variation between individuals compared
to variation due to items
• Can be between – infinity and 1 (although usually
only between 0 and 1)
• Usually considered ‘good’ if > 0.8
Kappa
• Measures the extent to which ratings
given by 2 raters agree
• Often used when experts are assigning
scores based on opinions (e.g.,
medication errors)
• Gives credit when scores match exactly,
takes away agreement when they don’t
• Can be between 0 and 1
• Usually considered ‘good’ if > 0.7
Sensitivity and specificity
Sensitivity
• Measures the extent to which a test agrees with a
‘gold standard’
• Often used when trying out a new diagnostic test
• Reports how often the new test agrees with the
old when positive
• Captures the false negatives
• Calculated using a 2 X 2 table
• Acceptability of score depends on test qualities
Sensitivity and specificity
Specificity
• Measures the extent to which a test agrees with a
‘gold standard’
• Often used when trying out a new diagnostic test
• Captures the ‘false positives’
• Reports how often the new test agrees with the
old when negative (eg accurately reports the
absence of the condition)
• Calculated using a 2 X 2 table
• Acceptability of score depends on test qualities
2X2 tables revisited
Gold standard + Gold standard –
(has condition) (does not have
condition)
New test +
a
b
New test -
c
d
Calculating sensitivity and
specificity
Sensitivity= number who are both disease
positive and test positive/number who are
disease positive
a/a+c
Specificity = number who are both disease
negative and test negative/number who are
disease negative
d/d+b
Understanding sensitivity and
specificity
Sensitivity is high when the test picks up a lot of
the true disease (has few false negatives)
High sensitivity is important for infectious
diseases (e.g., HIV)
Specificity is high when the test does not have
false positives. This is important when the
consequences of treating the disease are
significant (e.g., cancer)
Positive and negative predictive
value
• Tells you how good a test is at
predicting whether a patient actually has
the disease
• Positive predictive value is the
probability that the patient has the
disease given a positive test
• Depends on sensitivity, specificity and
the prevalence of the disease
Overview
• Different types of variables are measured and
presented differently
• P values and confidence intervals are the
measure of statistical significance
• Tell us the probability that these results could
have happened by chance
• Cronbach’s alpha, kappa, sensitivity and
specificity tell us about relationships between
measurements
For next week 1
• Read Chapter 3 in the text
• Read the ICES privacy document
(www.ices.on.ca)
• Think about privacy and confidentiality
• What issues are relevant to you in your
current research?
For next week 2
•
•
•
•
•
•
•
•
Identify your data set
Where did it come from?
How was it collected?
What type of variables does it include?
What is your research question?
What are your exposure variables?
What is your outcome variable?
If you are not familiar with SPSS it is
STRONGLY recommended that you
complete the tutorial