Transcript PPT

CTSI BERD Research Methods Seminar Series
Statistical Analysis I
Mosuk Chow, PhD
Senior Scientist and Professor
Department of Statistics
December 8, 2015
Biostatistics, Epidemiology, Research Design(BERD)



BERD Goals:
Match the needs of investigators to the
appropriate
biostatisticians/epidemiologists/methodologists
Provide BERD support to investigators
Offer BERD education to students and
investigators via in-person, videoconferenced,
and on-line classes
http://ctsi.psu.edu/ctsi-programs/biostatisticsepidemiologyresearch-design/
Statistics Encompasses
Study design




Data collection
Summarizing data





Selection of efficient design (cohort study/case-control
study)
Sample size
Randomization
Important first step in understanding the data collected
Analyzing data to draw conclusions
Communicating the results of analyses
Keys to Successful Collaboration Between Statistician
and Investigator: A Two-Way Street

Involve statistician at beginning of project
(planning/design phase)

Specific objectives

Communication

avoid jargon

willingness to explain details
Keys to Successful Collaboration: A Two-Way Street

Respect




Knowledge
Skills
Experience
Time

Embrace statistician as a member of the
research team

Fund statistician on grant application for best
collaboration

Most statisticians are supported by grants, not by
Institutional funds
Statistical Analysis

Describing data


Statistical Inference




Numeric or graphic
Estimation of parameters of interest
Hypothesis testing
Regression modeling
Interpretation and presentation of the
results
Describing data: Basic Terms





Measurement – assignment of a number to
a characteristic of an object or event
Data – collection of measurements
Sample – collected data
Population – all possible data
Variable – a property or characteristic of the
population/sample – e.g., gender, weight,
blood pressure.
Example of data set/sample
Data on
albumin and
bilirubin
levels
before and
after
treatment
with a study
drug
ID
6
7
8
11
13
16
21
2
15
19
24
34
43
DRUG
0
0
0
0
0
0
0
1
1
1
1
1
1
BILI ALBUMIN
BASE_BIL
BASE_ALB
0.7
4.2
0.8
3.98
1.2
3.59
1
4.09
1.3
3.08
0.3
4
2.1
3.58
1.4
4.16
1.1
3.39
0.7
3.85
0.6
3.8
0.7
3.66
1.7
3.22
0.6
3.83
3.6
2.92
1.1
4.14
1.2
3.72
0.8
3.87
0.4
3.92
0.7
3.56
3.6
3.66
2.1
4
0.8
3.85
0.8
3.7
0.7
3.78
1.1
3.64
Describing Data



Types of data
Summary measures (numeric)
Visually describing data (graphical)
Types of Variables

Qualitative or Categorical



Binary (or dichotomous) True/False, Yes/No
Nominal – no natural ordering Ethnicity
Ordinal – Categories have natural ranks



Degree of agreement (strong, modest, weak)
Size of tumor (small, medium, large)
Quantitative


Ratio - Ordered, constant scale, natural zero (age, weight)
Interval-Ordered, constant scale, no natural zero


Differences make sense, but ratios do not
Temperature in Celsius (30°-20°=20°-10°, but 20°/10°
is not twice as hot)
Types of Measurements for Quantitative
Variables


Continuous: Weight, Height, Age
Discrete: a countable number of values


The number of births, Age in years
Likert scale: “agree”, “strongly agree”, etc.
Somewhere between ordinal and discrete


Scales with <= 4 possibilities are usually
considered to be ordinal.
Scales with >=7 possibilities are usually considered
to be discrete.
Descriptive Statistics
Quantitative variable
 Measure(s) of central location/tendency




Mean
Median
Mode
Measure(s) of variability (dispersion)

describe the spread of the distribution
Descriptive Statistics (cont.)

Summary Measures of dispersion/variation



Minimum and Maximum
Range = Maximum – Minimum
Sample variances (abbreviated s2) and
standard deviation (s or SD) with
denominator=n-1
Other Measures of Variation



Interquartile range (IQR):
75th percentile – 25th percentile
MAD: median absolute deviation
CV: Coefficient of variation
s
CV = ´100%
X




Ratio of SD over sample mean
Measure relative variability
Independent of measurement units
Useful for comparing two or more sets of data
Describing data graphically
Tell whole story of data, detect outliers
 Histogram
 Stem and Leaf Plot

Box Plot
Histogram
10
5
• The height
represents the
number of
individuals in
that range of
SBP.
0
Number of Men
• Each bar
spans a width
of 5 mmHg.
15
20
• 113 men
80
100
120
140
Systolic BP (mmHg)


Divide range of data into intervals (bins) of equal width.
Count the number of observations in each class.
160
4
2
0
0
20
40
Number of Men
60
6
Histogram of SBP
80
100
120
140
160
80
100
120
140
Systolic BP (mmHg)
Systolic BP (mmHg)
Bin Width = 20 mmHg
Bin Width = 1 mmHg
160
Stem and Leaf Plot


Provides a good summary of data structure
Easy to construct and much less prone to error
than the tally method of finding a histogram
2889
301112334455556667777899
4001111122333444455567789
5011234
“stem”: the first digit or digits of the number.
“leaf” : the trailing digit.
Box Plot: SBP for 113 Males
Boxplot of Systolic Blood Pressures
160
Sample of 113 Men
Largest Observation
120
25th Percentile
80
Sample Median
Blood Pressure
100
140
75th Percentile
Smallest Observation
Descriptive Statistics (cont.)
Categorical variable
 Frequency (counts) distribution
 Relative frequency (percentages)
 Pie chart
 Bar graph
Describe relationship between two variables
One quantitative and one categorical
 Descriptive statistics within each category
 Side by side boxplots/histograms
Both quantitative
 Scatter plot
Both categorical
 Contingency table
Statistical Inference
A process of making inference (an estimate, prediction, or decision)
about a population (parameters) based on a sample (statistics) drawn
from that population.
Sample
Inference
Population
20
15
0
5
10
.2
.1
0
Percentage
.3
Parameters
(Fixed,
unknown)
Number of Men
.4
Statistics (Vary from sample to sample)
80
100
120
140
Systolic BP (mmHg)
160
180
80
100
120
Systolic BP (mmHg)
140
160
Statistical Inference
Questions to ask in selecting appropriate methods







Are observation units independent?
How many variables are of interest?
Type and distribution of variable(s)?
One-sample or two-sample problem?
Are samples independent?
Parameters of interest (mean, variance, proportion)?
Sample size sufficient for the chosen method?
(see decision making flow chart in the handout)
Estimation of population mean






We don’t know the population mean μ but
would like to estimate it.
We draw a sample from the population.
We calculate the sample mean X.
How close is X to μ?
Statistical theory will tell us how close X is to μ.
Statistical inference is the process of trying to
draw conclusions about the population from the
sample.
Key Statistical Concept


Question: How close is the sample
mean to the population mean?
Statistical Inference for sample mean



Sample mean will change from sample to
sample
We need a statistical model to quantify the
distribution of sample means (Sampling
distribution)
Sometimes, need “normal distribution” for
the population data
Normal Distribution

Normal distribution, denoted by N(µ, 2), is characterized by
two parameters
µ: The mean is the center.
: The standard deviation measures the spread
(variability).
Probability density function
Standard
Deviation
Mean
Standard
Deviation
Mean
Distribution of Blood Pressure in Men (population)
.4
Y: Blood pressure
Y~ N(µ, 2)
Parameters:
Mean, µ= 125 mmHg
SD,  = 14 mmHg
.3
68%
.2
95%
99.7%
.1
0
83
97
111
125
139
153
167
The 68-95-99.7 rule for normal distribution applied to the
distribution of systolic blood pressure in men.
Sampling Distribution



The sampling distribution refers to the distribution of
the sample statistics (e.g. sample means) over all
possible samples of size n that could have been
selected from the study population.
If the population data follow normal distribution N(µ,
2), then the sample means follow normal
distribution N(µ, 2/n).
What if the population data do not come from
normal distribution?
Central Limit Theorem (CLT)


If the sample size is large, the distribution of sample
means approximates a normal distribution.
~ N(µ, 2/n)
The Central Limit X
Theorem works even when the
population is not normally distributed (or even not
continuous).http://onlinestatbook.com/stat_sim/sampling_dist/index.h
tml
For sample means, the standard rule is n > 60 for the Central Limit Theorem
to kick in, depending on how “abnormal” the population distribution is. 60
is a worst-case scenario.
Sampling Distribution




By CLT, about 95% of the time, the sample mean
will be within two standard errors of the population
mean.
 This tells us how “close” the sample statistic
should be to the population parameter.
Standard errors (SE) measure the precision of
your sample statistic.
A small SE means it is more precise.
The SE is the standard deviation of the sampling
distribution of the statistic.
Standard Error of Sample Mean

The standard error of sample mean
(SEM) is a measure of the precision of
the sample mean.

SEM =

n
: standard deviation (SD) of population
distribution.
The standard deviation is not the
standard error of a statistic!
Example

Measure systolic blood pressure on random sample
of 100 students
Sample size
n = 100
Sample mean x = 125 mm Hg
Sample SD
s = 14.0 mm Hg
14
 1.4 mmHg
SEM =
100

Population SD () can be replaced by sample
SD for large sample
Confidence Interval for population
mean

An approximate 95% confidence interval for population mean
µ is:
X ± 2×SEM or precisely X ±1.96 SEM



X is a random variable (vary from sample to sample), so
confidence interval is random and it has 95% chance of
covering µ before a sample is selected.
Once a sample is taken, we observe X  x , then either µ is
within the calculated interval or it is not.
The confidence interval gives the range of plausible values
for µ.