This evening..... - School of Mathematics and Statistics

Download Report

Transcript This evening..... - School of Mathematics and Statistics

Introductory Statistics
John Matthews, Professor of
Medical
Statistics, School of Mathematics and Statistics
Janine Gray, Senior Lecturer and Deputy
Director, Newcastle Clinical Trials Unit
University of Newcastle-upon-Tyne
Course Outline

Data Description
 Mean, Median, Standard Deviation
 Graphs




The Normal Distribution
Populations and Samples
Confidence intervals and p-values
Estimation and Hypothesis testing
 Continuous data
 Categorical data

Regression and Correlation
Course Objectives



To have an understanding of the Normal
distribution and its relationship to common
statistical analyses
To have an understanding of basic statistical
concepts such as confidence intervals and pvalues
To know which analysis is appropriate for
different types of data
Recommended Textbooks

Swinscow TDV and Campbell MJ. Statistics at Square One
(10th edn). BMJ Books

Altman DG. Practical Statistics for Medical Research.
Chapman and Hall
Bland M. An Introduction to Medical Statistics. Oxford
Medical Publications
Campbell MJ & Machin D. Medical Statistics A
Commonsense Approach. Wiley


Other reading
 Chinn
S. Statistics for the European
Respiratory Journal. Eur Respir J 2001;
18:393-401

www.mas.ncl.ac.uk/~njnsm/medfac/MDPhD/notes.htm
 BMJ
statistics notes
Types of Data
 Numerical
Data
– discrete
 number of lesions
 number of visits to GP
– continuous
 height
 lesion area
Types of Data
 Categorical
– unordered
 Pregnant/Not pregnant
 married/single/divorced/separated/widowed
– ordered (ordinal)
 minimal/moderate/severe/unbearable
 Stage of breast cancer: I II III IV
Exercise
 What
a)
b)
c)
d)
e)
f)
type are the following variables?
sex
diastolic blood pressure
diagnosis
height
family size
cancer stage
Types of Data
 Outcome/Dependent
variable
– outcome of interest
– e.g. survival, recovery
 Explanatory/Independent
– treatment group
– age
– sex
variable
Histogram of Birthweight
(grams) at 40 weeks GA
Summary Statistics
 Location
– Mean (average value)
– Median (middle value)
– Mode (most frequently occurring value)
 Variability
– Variance/SD
– Range
– Centiles
Birthweights (g) at 40 weeks
Gestation
 mean
= 3441g
 median = 3428g
 sd = 434g
 min = 2050g
 max = 4975g
 range = 2925g
Boxplot
T4 c ells/ mm 3 blood sample
2000
3
1500
1000
T4 CELLS
23
500
0
N=
GROU P
20
20
Hodgkin 's
Non-Ho dgkin's
Symmetric Data
mean = median (approx)

 standard deviation 

Skew Data
median = "typical" value 
 mean affected by extreme
values - larger than median

 SD fairly meaningless 
 centiles (less affected by
extreme values/outliers) 

Half of all doctors are below average….
 Even
if all surgeons are equally good, about
half will have below average results, one will
have the worst results, and the worst results
will be a long way below average
 Ref.
BMJ 1998; 316:1734-1736
Discrete Data
Principal diagnosis of patients in Tooting Bec Hospital
Diagnosis
Number of patients
Schizophrenia
474
(32%)
Affective Disorders
277
(19%)
Organic Brain Syndrome 405
(28%)
Subnormality
58
(4%)
Alcoholism
57
(4%)
Other/Not Known
196
(13%)
Total
1467
Bar Chart
Princ ipal Diagn osis of Patients in Tooting Bec Hospit al
500
400
300
200
Count
100
0
Schizop hrenia
Organic Brain Syndro
Affective Disorders
Diagn osis
Alcoho lism
Subnor mality
Other/N ot Known
Summarising data - Summary
 Choosing
the appropriate summary statistics
and graph depends upon the type of variable
you have
 Categorical (unordered/ordered)
 Continuous (symmetric/skew)
The Normal Distribution




N(2
unknown population mean estimate using sample mean
unknown population SD estimate using sample SD
Birthweight is N(3441, 4342)
N(0,1) - Standard Normal
Distribution
68% within ± 1
SD Units
95% within ± 1.96
x
z

99% within ± 2.58
z - SD units
Birthweight (g) at 40 weeks
95% within 1.96 SDs
2590 - 4292 grams
99% within 2.58 SDs
2321 - 4561 grams
Further Reading

http://www.mas.ncl.ac.uk/~njnsm/medfac/docs/intro.pdf
 Altman
DG, Bland JM (1996) Presentation of
numerical data. BMJ 312, 572
 Altman DG, Bland JM. (1995) The normal
distribution. BMJ 310, 298.
Samples and Populations
 Use
samples to estimate population quantities
(parameters) such as disease prevalence, mean
cholesterol level etc
 Samples are not interesting in their own right - only
to infer information about the population from which
they are drawn
 Sampling Variation
 Populations are unique - samples are not.
Sample and Populations
 How
much might these estimates vary from
sample to sample?
 Determine
precision of estimates (how close/far
away from the population?)
(Artifical) example

Have 5000 measurements of diastolic blood pressure from
airline pilots. This accounts for ALL airline pilots and is
the population of airline pilots.

(Artificial example - if we had the whole population we
wouldn’t need to sample!!)

Since we have the population, we know the true
population characteristics. It is these we are trying to
estimate from a sample.
Population distribution of diastolic BP
from Airline Pilots (in mmHg)
True mean = 78.2
True SD = 9.4
Example

Write each measurement on a piece of paper and put
into a hat.

Draw 5 pieces of paper and calculate the mean of the
BP.

replace and repeat 49 more times

End up with 50 (different) estimates of mean BP
Sampling Distribution
 Each
estimate of the mean will be different.
 Treat this as a random sample of means
 Plot a histogram of the means.
 This is an estimate of the sampling distribution
of the mean.
 Can get the sampling distribution of any
parameter in a similar way.
Distribution of the mean
 = 78.2,  = 9.4
Population
50 samples
N=5
50 samples
N=10
50 samples
N=100
Distribution of the Mean
 BUT!
Don’t need to take multiple samples
 Standard
 SE
error of the mean =
Sample SD 2
N
of the mean is the SD of the distribution
of the sample mean
Distribution of Sample Mean
 Distribution
of sample mean is Normal
regardless of distribution of sample
(unless small or very skew sample)
 SO
Can apply Normal theory to sample mean also
Distribution of Sample Mean
 i.e.
95% of sample means lie within 1.96 SEs
of (unknown) true mean
 This is the basis for a 95% confidence interval
(CI)
 95% CI is an interval which on 95% of
occasions includes the population mean
Example

57 measurements of FEV1 in male medical
students
Example

X  4.06 litres, SD  0.67 litres
 95%
of population lie within
i.e. within 4.06 ±1.960.67,
from 2.75 to 5.38 litres
X  196
. SDs
Example
SE 
 Thus
0.67 2
 0.09
57
for FEV1 data, 95% chance that the
interval
4.06  1.96  0.09
contains the true population mean
i.e. between 3.89 and 4.23 litres
 This is the 95% confidence interval for the
mean
Confidence Intervals
 The
confidence interval (CI) measures
uncertainty. The 95% confidence interval is
the range of values within which we can be
95% sure that the true value lies for the whole
of the population of patients from whom the
study patients were selected. The CI narrows
as the number of patients on which it is based
increases.
Standard Deviations & Standard
Errors
 The
SE is the SD of the sampling distribution
(of the mean, say)
 SE = SD/√N
 Use SE to describe the precision of estimates
(for example Confidence intervals)
 Use SD to describe the variability of samples,
populations or distributions (for example
reference ranges)
The t-distribution
When N is small, estimate of SD is
particularly unreliable and the distribution of
sample mean is not Normal
 Distribution is more variable - longer tails
 Shape of distribution depends upon sample
size
 This distribution is called the t-distribution

N=2
t(1)
95% within ± 12.7
N(0,1)
t(1)
N=10
t(9)
95% within ± 2.26
N(0,1)
t(9)
N=30
t(29)
95% within ± 2.04
t-distribution
As N becomes larger, t-distribution becomes
more similar to Normal distribution
 Degrees of Freedom (DF)sample size - 1
 DF measure of amount of information
contained in data set

Implications

Confidence interval for the mean
» Sample size < 30
Use t-distribution
» Sample size > 30
Use either Normal or t distribution

Note: Stats packages (generally) will
automatically use the correct distribution for
confidence intervals
Example
Numbers of hours of relief obtained by 7 arthritic
patients after receiving a new drug: 2.2, 2.4, 4.9,
3.3, 2.5, 3.7, 4.3
 Mean = 3.33, SD = 1.03, DF = 6, t(5%) = 2.45
 95% CI = 3.33 ± 2.451.03/ 7
2.38 to 4.28 hours
 Normal 95% CI = 3.33 ± 1.961.03/ 7
2.57 to 4.09 hours
TOO NARROW!!

Hypothesis Testing
 Enables
us to measure the strength of evidence
supplied by the data concerning a proposition
of interest
 In a trial comparing two treatments there will
ALWAYS be a difference between the
estimates for each treatment - a real difference
or random variation?
Null Hypothesis
 Study
hypothesis - hypothesis in the mind of
the investigator (patients with diabetes have
raised blood pressure)
 Null hypothesis is the converse of the study
hypothesis - aim to disprove it (patients with
diabetes do not have raised blood pressure)
 Hypothesis of no effect/difference
Two-Sample t-test
 Two
independent samples
 Can the two samples be considered to be the
same with respect to the variable you are
measuring or are they different?
 Sample means will ALWAYS be different real difference or random variation?
 ASSUMPTION: Data are normally distributed
and SD in each group similar
Two-Sample t-test
 24
hour total energy expenditure (MJ/day) in
groups of lean and obese women
 Do the women differ in their energy
expenditure?
 Null hypothesis: energy expenditure in lean
and obese women is the same
Boxplot of energy expenditure
MJ/day
14
12
13
12
10
8
1
6
4
N=
GROUP
13
9
lean
obese
Two-sample t-test
 Summary
statistics
lean
obese
Mean
8.1
10.3
 SD
1.2
1.4
N
13
9
 Difference in means = 10.3 - 8.1 = 2.2
 SE difference = 0.57 (weighted average)
Two Sample t-test
Test statistic is 2.2/0.57 = 3.9
 N1 + N2 - 2 DF (= 20)
 Calculate the probability of observing a value at least
as extreme as 3.9 if the null hypothesis is true
 If the null hypothesis is true, the test statistic should
have a t-distribution with 20 df (df = N1+N2-2)

Two Sample t-test
95% of values from t-distribution with 20 DF lie
between -2.09 and +2.09
 Probability of observing a value as extreme or more
extreme than 3.9 in a t-distribution with 20 df is 0.001
 Only a very small probability that the value of 3.9 fits
reasonably with a t-distribution with 20 df
 Conclude that energy expenditure is significantly
different between lean and obese women

The P-value

The P-value is the probability of observing a test
statistic at least as extreme as that observed if the null
hypothesis is true
t distribution with 20 df
.4
Probability
.3
.2
.1
0
-4
-3
-2
-1
0
x
1
2
3
4
Confidence Interval for the
difference in two means
 95%
CI =
2.2 - 2.090.57 to 2.2 +2.090.57
or from 1.05 to 3.41 MJ/day
 Thus we are 95% confident that obese women use
between 1.05 and 3.41 MJ/day energy more than
lean women
Confidence Interval or P-value?
 Confidence
interval!!!
 P-value will tell you whether or not there is a
statistically significant difference
 confidence interval will give information about
the size of the difference and the strength of
the evidence
Paired t-test
Obvious pairing between observations
– two measurements on each subject (before-after
study)
– case-control pairs
 Assumption - paired data are normally distributed
 Example - Systolic blood pressure (SBP) measured in
16 middle aged men before and after a standard
exercise. Post-exercise SBP - Pre-exercise SBP
calculated for each man

Boxplot of differences
20
10
0
-10
N=
16
Paired t-test
 Mean
difference = 6.6
 SE(Mean) = 1.5
 t = 6.6/1.5 = 4.4
 Compare with t(15)
 P < 0.001
 Conclusion- mean systolic blood pressure is
higher after exercise than before
Paired t-test
 95%
confidence interval for the mean
difference
 6.6  2.13×1.5 = 3.4 to 9.8
Categorical Variables
 To
investigate the relationship between two
categorical variables form contingency table
 Hypothesis tests
– Chi-squared test (2 test)
– Fisher’s exact test (small samples)
– McNemar’s test (paired data)
Chi-squared test
 Used
to test for associations between
categorical variables (2 or more distinct
outcomes)
 Example - a comparison between
psychotherapy and usual care for major
depression in primary care
Patient Reported Recovery at 8
months
Recovered Not
Recovered
Total
47 (51%)
46 (49%)
93
Usual Care 18 (20%)
73 (80%)
91
119 (65%)
184
Psychotherapy
Total
65 (35%)
P<0.001, Chi-square test
Patient Reported Recovery at 8
months
 Difference
between means 30.8%
 95% confidence interval for difference 17.7%
to 43.8%
Larger tables
 Similar
methods can be applied to larger tables
to test the association between two categorical
variables
 Example - Is there an association between
housing tenure and time of delivery of baby
(preterm/term).
 Null hypothesis: There is no relationship
between housing tenure and time of delivery
Relationship between housing
tenure and time of delivery
Housing Tenure
Preterm
Term
Total
Owner-occupier
50 (61.7)
849 (837.3)
899
Council Tenant
29 (17.7)
229 (240.3)
258
Private Tenant
11 (12.0)
164 (163.0)
175
Lives with Parents
6 (4.9)
66 (67.1)
72
Other
3 (2.7)
36 (36.3)
39
Total
99
1344
1443
Relationship between housing
tenure and time of delivery
Test Statistic 
.......
 50  61.7 2
 3  2 .7 2
2 .7
61.7


 849  837 .3 2
 36  36.3 2
36.3
837 .3
 .......
 10.5
DF = (5-1)(2-1) = 4
 P = 0.03
 Thus we strong evidence of a relationship between
housing tenure and time of delivery

Notes
 Chi-squared
test not valid if expected values
are small (<5)
– Combine rows or columns to obtain a
smaller table with larger expected values
– Use Fisher’s exact test for small tables
McNemar’s test
 Appropriate
for use with paired or matched
(case-control) data with a dichotomous
outcome
Example - McNemar’s test
 Skaane
compared the use of mammography
and ultrasound in the assessment of 327 (228
palpable and 99 non-palpable) consecutive
malignant tumours confirmed at histology.
Acta radiologica vol 40;486-490 (1999)
McNemar’s test - example
Mammogram
US
Yes
No
Tot.
Yes
267
11
278
No
41
8
49
Tot.
308
19
327
McNemar’s test - example
 308/327
(94%) were picked up by
mammograpy compared with 278/327 (85%)
picked up by ultrasound
 P<0.001
 Conclusion: Mammography is significantly
more sensitive in diagnosing tumours than
ultrasound in a population of mixed malignant
tumours
Hypothesis testing - summary
Type of data
Paired Design
Unpaired Design
Continuous
Quantitative data
Paired (one-sample) ttest
Wilcoxon Signed rank
test
Wilcoxon signed rank
test
Unpaired (independent
samples) t-test
Mann-Whitney U test
Ordered Categorical
data
Unordered Categorical McNemar's test (2
data
categories only)
Mann-Whitney U test
Chi-squared test
Fisher's exact test
Adapted from Chinn S. Statistics for the European Respiratory Journal.
Correlation and Regression
 Relationship
– regression
– correlation
between two continuous variables
Relationship between two
continuous variables
3
main purposes for doing this
– to assess whether the two variables are associated
(correlation)
– to enable the value of one variable to be predicted
from any known value of the other variable
(regression)
– to assess the amount of agreement between two
variables (method comparison study)
Example
 Women
from a pre-defined geographical area
were invited to have their haemoglobin (Hb)
level and packed cell volume measured. They
were also asked their age.
Haemoglobin and packed cell
volume
18
16
14
12
10
8
20
30
Packed Cell Volume (%)
40
50
60
Example - relationships between
variables
 Association
between Hb and PCV?
Hb affects PCV or PCV affects Hb?
 Use correlation to measure the strength of an
association
 Association between Hb and age?
age must affect Hb and not vice versa
 Use regression to predict Hb from age
Correlation
 Not
interested in causation
i.e. does a high PCV cause a high Hb level
 Interested in association
i.e. is a high PCV associated with a high Hb
level?
 sample correlation coefficient
– summarises strength of relationship
– can be used to test the hypothesis that the
population correlation coefficient is 0
Correlation Coefficient
 dimensionless,
from -1 to 1
 measures the strength of a linear relationship
 +ve - high value of one variable associated
with high value of the other
 -ve - high value of one variable associated
with low value of the other
 +1 = exact linear relationship
 strictly called Pearson correlation coefficient
Example Data
r = -0.4
r=1
10
20
18
16
0
14
12
10
-10
8
Y
Y
6
4
1
2
3
4
5
6
7
8
-20
1
9
2
3
4
5
6
7
8
9
X
X
r=0
r = 0.7
30
8
6
20
4
2
10
0
0
Y
Y
-2
1
X
2
3
4
5
6
7
8
9
-4
1
X
2
3
4
5
6
7
8
9
When not to use the correlation
coefficient
 If
the relationship is non-linear
 with caution in the presence of outliers
 when the variables are measured over more
than one distinct group (i.e. disease groups)
 when one of the variables is fixed in advance
 Assessing agreement
Correlation - example data
11
9
10
8
9
7
y1
y2
8
6
7
5
6
4
5
3
4
4
9
4
14
9
13
13
12
12
11
11
10
10
y4
y3
14
x2
x1
9
9
8
8
7
7
6
6
5
5
4
9
x3
14
10
15
x4
20
Is there an alternative?
 If
the data are non-linear or there is an outlier
– use spearman rank correlation coefficient
Haemoglobin and Packed Cell
Volume
Without outlier
Pearson=0.67
Spearman=0.63
18
16
14
12
With outlier
Pearson=0.34
Spearman=0.48
10
8
6
4
2
20
30
Packed Cell Volume (%)
40
50
60
Regression
 Assume
a change in x will cause a change in y
 predict y for a given value of x
 usually not logical to believe y causes x
 y is the dependent variable (vertical axis)
 x is the independent variable (horizontal axis)
Example - Haemoglobin vs Age
18
16
14
12
10
8
10
20
Age (Years)
30
40
50
60
70
Regression
 Logical
to assume that increasing age leads to
increasing Hb
 Not logical to assume Hb affects age!
 Assume underlying true linear relationship
 Make an estimate of what that true linear
relationship is
Estimating a regression line
 How
do I identify the ‘best’ straight line?
 least squares estimate
 straight line determined by slope and
intercept
 y = a + bx
 a and b are estimates of the true intercept
and slope and are subject to sampling
variation
Regression line of haemoglobin on
age
18
16
14
12
10
8
10
20
Age (years)
30
40
50
60
70
Regression of haemoglobin on age

Variable(s) Entered on Step Number
1..
AGE
Age (Years)
Multiple R
.87959
R Square
.77367
Adjusted R Square
.76110
Standard Error
1.17398

Analysis of Variance
Regression
Residual
F =
61.53133
DF
1
18
Sum of Squares
84.80397
24.80803
Signif F =
.0000
Mean Square
84.80397
1.37822
Regression of haemoglobin on age

---------------------- Variables in the Equation ------------Variable
B
SE B
95% Confdnce Intrvl B
AGE
.134251
.017115
.098295
.170208
(Constant)
8.239786
.794261
6.571104
9.908467

----------- in -----------Variable
T Sig T
AGE
7.844 .0000
(Constant)
10.374 .0000
What does this tell us?
Hb = 8.2 + 0.13 AGE
 95% CI for the slope goes from 0.098 to 0.170
 P < 0.0001
 Significant relationship between Hb and age
 77% of the variability in Hb can be accounted
for by age
 Mean
How can it be used?
 Predict
 Eg.
mean Hb for a given age
What is the mean Hb of a 50 year old?
 Mean Hb = 8.2 + 0.1350 = 14.7 g/dl
 95% CI for the estimate from 14.4 to 15.5
g/dl
How can it be used?
 To
calculate reference ranges for the
population
 E.g.
What range would you expect 95% of 50
year olds to lie within? (reference range)
 Between 12.4 to 17.5 g/dl
95% Confidence Interval for the Mean & 95%
prediction interval for individuals
20
18
16
14
12
10
8
10
20
Age (years)
30
40
50
60
70
Definitions
 Predicted
value
– the value predicted by the regression line
– an estimate of the mean value
 Residual
– Observed value - predicted value
What assumptions have I made?
 The
relationship is approximately linear
 The residuals have a normal distribution
Multiple Regression
 One
outcome variable with multiple predictor
variables
 Residuals assumed to be normally distributed
 Predictor variables can be continuous or
categorical
 No assumptions made about distribution of
continuous predictor variables
Multiple Regression
 Example.
Does the value of packed cell
volume improve the prediction of hb?
 Model fitted
Mean Hb = 5.2 + 0.1age(years) + 0.1packed
cell volume(%)
R2 = 83%
Knowledge of packed cell volume improves the
prediction of haemoglobin
Summary
 Regression
can be used to estimate the
numerical relationship between an outcome
variable and one or more predictor variables
 Correlation coefficient alone is of limited use