Transcript t 0

Hygiene and Preventive Medicine Institute
University of Sassari Medical School
Italy
Simple statistics for clinicians on
respiratory research
By
Giovanni Sotgiu
What are your expectations?
Too difficult to explain medical
statistics in 30 min…..
What is medical statistics?
What is medical statistics?
• “..Discipline concerned with the treatment of numerical
data derived from groups of individuals..” P Armitage
• “..Art of dealing with variation in data through collection,
classification and analysis in such a way as to obtain
reliable results..” JM Last
What is medical statistics?
Collection of statistical procedures
well-suited to the analysis of healthcare-related data
Why we need to study statistics in the
field of medicine……..
Why we need to study statistics…
1) Basic requirement of medical research
2)Update your medical knowledge
3)Data management and treatment
Road map
1) Basic concepts
2) Sample and population
3)Probability
4) Data description
5) Measures of disease
Basic concepts
Basic concepts
1. Homogeneity
All individuals have similar values or belong to the same category
Ex.: all individuals are Chinese,
….women,
….middle age (30~40 years old),
….work in the same factory
homogeneity in nationality, gender, age and occupation
Basic concepts
1. Variation
Differences in height, weight, treatment…
1. Variation
• Toss a coin The mark face may be up or down
• Treat the patients suffering from TB with the
same antibiotics: a part of them recovered and
others didn’t
1. Variation
no variation, no statistics
What is the target of our studies?
Population
2. Population
the whole collection of individuals that one
intends to study
2. Population
 economic issues
 short time
2. Population and sample
2. Sample
a representative part of the population
Sampling
By chance!
Random
• Random event
 the event may occur or may not occur in one
experiment
before one experiment, nobody is sure whether the
event occurs or not
Random
Please, give some examples of random event…
The mathematical procedures whereby we convert information about the sample into
intelligent guesses about the population fall under the section of inferential
Statistics (generalization)
Probability
3. Probability
Measure the possibility of occurrence of a random event
P(A) = The Number Of Ways Event A Can Occur
The total number Of Possible Outcomes
Estimation of Probability Frequency
Number of observations: n (large enough)
Number of occurrences of random event A: m
P(A)  m/n relative frequency theory
3. Probability
A random event
P(A) Probability of the random event A
P(A)1 , if an event always occurs
P(A)0, if an event never occurs
Please, give some examples for
probability of a random event and frequency of
that random event
Parameters and statistics
4. Parameter
A measurement describing some characteristic of a population
or
A measurement of the distribution of a characteristic of a
population
Greek letter (μ,π, etc.)
Usually unknown
to know the parameter of a population
we need a sample
4. Statistic
A measurement describing some characteristic of a sample
or
A measurement of the distribution of a characteristic of a
sample
Latin letter (s, p, etc.)
4. Statistic
Please give an example for parameter and statistics
Does a parameter vary?
Does a statistic vary?
Sampling Error
5. Sampling Error
Difference between observed value and true value
5. Sampling Error
1) Systematic error (fixed)
2) Measurement error (random)
3) Sampling error (random)
Sampling error
• The statistics different from the parameter!
• The statistics of different samples from same
population different each other!
Sampling error
The sampling error exists in any sampling research
It can not be avoided but may be estimated
Nature of data
Variables and data
• Variables are labels whose value can literally vary
• Data is the value you get from observing
measuring, counting, assessing etc.
Data
Nominal Data
Categorical Data
Ordinal Data
Data
Discrete Data
Metric Data
Continuous Data
Nominal or categorical data
• It can be allocated into one of a number of
categories
• Blood type, sex, Linezolid treatment (y/n)
• Data cannot be arranged in an ordering scheme
Ordinal categorical data
• It can be allocated to one of a number of categories
but it has to be put in meaningful order
• Differences cannot be determined or are meaningless
• Very satisfied, satisfied, neutral, unsatisfied, very
unsatisfied (new treatment)
Discrete metric data
• Countable variables  number of possible values
is a finite number
• Numbers of days of hospitalization
• Numbers of men treated with isoniazid
Continuous metric data
• Measurable variables
• Infinitely many possible values  continuous
scale covering a range of values without gaps
• Kg, m, mmHg, years
Describing data…..
with tables
Describing data with tables
1) actual frequency
2) relative and cumulative frequency
3) grouped frequency
4) open- ended groups
5) cross-tabulation
1) Frequency table
Frequency distribution
variables
frequency
TB mortality (%)
Tally
No. of wards
11.2-15.1
1, 1, 1, 1, 1, 1, 1, 1, 1
9
15.2-20.1
1, 1, 1, 1, 1, 1, 1, 1
8
20.2-25.1
1, 1, 1, 1, 1
5
25.2-30.1
1, 1, 1
3
30.2-35.1
1,
1
2) Relative frequency, cumulative frequency
Relative frequency proportion of the total
No. of resistances
No. of patients
Relative frequency
(%)
Cumulative frequency
(%)
0
5
12.5
12.5
1
6
15
27.5
2
14
35
62.5
3
10
25
87.5
4
3
7.5
95
7
1
2.5
97.5
8
1
2.5
100
3) Grouped frequency
Grouped frequency  works for continuous metric data
Birth weight
A group width
of 300g
The class
lower limit
The class
upper limit
No. of infants born
from mothers with TB
2700-2999
2
3000-3299
3
3300-3599
9
3600-3899
9
3900-4199
4
4200-4499
3
General rules
• Frequency table
nominal, ordinal and discrete metric data
• Grouped frequency table
continuous metric data
4) Open-ended group
• One or more values which are called outliers,
long away from the general mass of the data
• Use ≤ or ≥
5) Cross-tabulation
•
Two variables within a single group of individuals
TB/HIV+
Pulmonary
mass
Yes
No
Benign
21
11
32
Malignant
4
4
8
Totals
Totals
25
15
40
Describing data…..
with charts
3. Describing data with charts
1)
Charting nominal data
a)
b)
c)
d)
pie chart
simple bar chart
cluster bar chart
stacked bar chart
2) Charting ordinal data
a)
b)
c)
pie chart
bar chart
dotplot
3) Charting discrete metric data
4) Charting continuous metric data
histogram
5) Charting cumulative ordinal or discrete metric data
step chart
6) Charting cumulative metric continuous data
cumulative frequency or ogive
7) Charting time based
time –series chart
1-a) Pie chart
• 4-5 categories
• One variable
• Start at 0° in the same order as the table
Adverse events of ethionamide
Neuropathy;4;
4%
Cough; 55;
55%
Hepatitis; 21;
21%
Rash; 20;
20%
1-b) Simple bar chart
• Same widths, equal spaces b/w bars
n
1-c) Clustered bar chart
1-d) Stacked bar chart
2-3) Dot-plot
Useful with ordinal variables if the number of
categories is too large for a bar chart
4) Histogram
%
Percentage of age distribution of pregnant TB women
40
35
30
25
20
TB cases
15
10
5
0
<19
20-24
25-29
30-34
>35
6) Cumulative frequency curve
Percentage of cumulative frequency curves of age for males and
females who develop TB
100
80
60
40
20
0
> 85
75-84
65-74
55-64
45-54
35-44
25-34
15-24
Describing data from its distributional shape
Describing data from its distributional shape
Symmetric mound-shaped distributions
Skewed distributions
Age distribution for migrants who develop TB
160
140
120
100
80
60
40
20
0
15-
25-
35-
45-
55-
65-
75-
>
24
34
44
54
64
74
84
85
Bimodal distributions
A bimodal distribution is one with two distinct humps
Normal-ness
• Symmetric
• Same mean, median, mode
Describing data with numeric summary value
Describing data with numeric
summary value
• 1. numbers, proportions (percentages)
• 2. summary measures of location
• 3. summary measures of spread
Numbers and proportions
• Numbers  actual frequencies
• Percentage is a proportion multiplied by 100
1) Prevalence
2) Incidence
Prevalence
-nature relative frequency
number of existing cases in some population at a given time
disease
health
t0
Prevalence
No. of existing cases of a disease at t0
= 0…..1
total population
A (N=6)
B (N=4)
fa=1
fa=1
No comparison
fr=0.17
fr=0.25
Comparison
Disease
Health
Prevalence
P=
=0
P=
= 0.25
P=
=1
Disease
Health
Prevalence
Prevalence data:
- Highlight the time of the evaluation
Example:
P (2010)= 0.17
P (2010)= 17 per 100 individuals
Incidence
estimates the risk of developing disease
People at risk (healthy)
Disease
t0
Health
t1
Incidence
No. of new cases during given t0- t1
total population at risk
- Measures the probability or risk of developing disease during given time period
- Absolute risk probabilityof developing an adverse event
Incidence
-Assess the health status at baseline
esclude prevalent cases at t0
-Define a follow-up for the cohort
 Healthy people followed-up for a given time period
Cohort
Closed Population
adds no new members over time, and loses
members only to disease/death
Open Population
may gain members over time, through
immigration or birth, or lose members through
emigration
Cumulative incidence
- Closed population
- Individual time period at risk same period for all the members
P
e
A>
B>
o
p
l
e
C>
D>
E>
t0
t1
0
3
time
Cumulative incidence
No. of new cases during given t0- t1
total population at risk
Cumulative incidence
Example: t0 = 24; new cases= 3; follow-up = 3 years
CI in 3 years = 0.125 new cases per 1 individual at risk enrolled at t0
12.5 new cases in 100 individuals at risk enrolled at t0
P
e
o
p
l
e
t0
t1
0
3
time
Cumulative incidence…critical features
- Closed popularion rare
- Short follow-up and enrollment of a few individuals
- Open population
Open population
-Non cases (drop-out) and cases during the follow-up
- Enrollment of new individuals during the follow-up
- Length of follow-up not uniform
Open population
P
e
A>
B>
C>
o
p
l
e
D>
E>
F>
G>
H>
I>
t1
t0
Drop-out
Case
time
Coorte dinamica
Individual time period at risk not uniform
 Estimate the population at risk:
- Total person-time
- Estimate of the total person-time
Coorte dinamica
Total person-time  S individual time period at risk
Person-time: days-, months-, years
Density of incidence
No. of new cases during given t0- t1
total person-time
Density of incidence
N
Individual time period at
Person-years
Person-years
risk
1 (A)
5
1 person x 5 years
5 person-years
2
3 person x 2 years
6 person-years
2 (E, F)
2.5
2 person x 2.5 years
5 person-years
2 (G, H)
1.5
2 person x 1.5 years
3 person-years
1 (I)
3
1 person x 3 years
3 person-years
3 (B, C, D)
Total person-time
22 person-years
Density of incidence
1 new case
0,045 new cases
=
22 person-years
 45 per 1000 person-years
= 0,045
1 person-years
Open population
Estimate of the total person-time
 Individual time period at risk not known for all
-Migration
Movement of the cohort in the middle of the follow-up
Estimate of the total person-time
(P0 + Pt)/2 x follow-up
Estimate of the total person-time
At t0: 100 people
Follow-up: 3 years
New cases: 3
Drop-out: 17
Enrollment during the follow-up: 16
>>>P0 = 100; Pt = (100-3-17+16) = 96
(P0 + Pt)/2 x follow-up
(100 + 96)/2 x 3 = 294 person-years
Estimate of the total person-time
At t0: 100 people
Follow-up: 3 years
New cases: 3
Drop-out: 17
Enrollment during the follow-up: 16
Test the estimate:
80 people x 3 years = 240 person-years
Movement of the cohort
(17 x 1.5) + (3 x 1.5) + (16 x 1.5) = 54 person-years
240 + 54 = 294 person-years
Incidence rate
No. of new cases during given t0- t1
estimate of total person-time
3 new cases/ 294 person-years x 1000 = 10.2
Summary measures of location
1) mode: category or value occurs the most often, typicalness.
Categorical, metric discrete
2) median: middle value in ascending order, central-ness.
ordinal and metric data
3) mean (average): divide the sum of the values by the
number of values
4) percentile: divide the total number of the values into 100
equal-sized groups.
Choosing the most appropriate
measure
Mode
Median
Mean
Nominal
yes
no
no
Ordinal
yes
yes
no
Metric
discrete
yes
Yes, when markedly
skewed
yes
Metric
continuous
yes
Yes, when markedly
skewed
yes
Summary measure of spread
• Range
distance from the smallest value to the largest
• IQR (interquartile range)
spread of the middle half of the values
• Boxplot
 graphical summary of the three quartile values,
the minimum and maximum values, and outliers.
Standard deviation
• Average distance of all the data values from
the mean value
• The smaller the average distance is, the
narrower the spread, and vice versa
• Used metric data only
1. Subtract the mean from each of
the n value in the sample, to give
the different values
2. Square each of these differences
3. Add these squared values together
(sum of squares)
4. Divide the sum of squares by 1 less
than the sample size. (n-1)
5. Take the square-root
Standard deviation and the normal
distribution
The Basic Steps of Statistical Work
1. Design of study
Professional design:
Research aim
Subjects,
Measures, etc.
• Statistical design:
Sampling or allocation method,
Sample size,
Randomization,
Data processing, etc.
2. Collection of data
• Source of data
Government report system
Registration system
Routine records
Ad hoc survey
• Data collection 
complete, in time
accuracy,
Protocol: Place, subjects, timing;
training; pilot; questionnaire;
instruments; sampling method and
sample size; budget
Procedure: observation, interview
filling form, letter
telephone, web
3. Data Sorting
• Checking
Hand, computer software
• Amend
• Missing data?
• Grouping
According to categorical variables (sex,
occupation, disease…)
According to numerical variables (age, income,
blood pressure …)
4. Data Analysis
• Descriptive statistics (show the sample)
mean, incidence rate …
-- Table and plot
• Inferential statistics (towards the
population)
-- Estimation
Hypothesis test (comparison)
Definition of Selection Bias
Selection bias:
Selection biases are distortions that result from
procedures used to select subjects and from factors
that influence study participation. The common
element of such biases is that the association
between exposure and disease is different for those
who participate and those who should be
theoretically eligible for study, including those who
do not participate.
Definition of Selection Bias
It is sometimes (but not always) possible to
disentangle the effects of participation from
those of disease determinants using standard
methods for the control of confounding. One
example is the bias introduced by matching in
case-control studies.
Definition of Confounding
Confounding:
bias in estimating an epidemiologic measure
of effect resulting from an imbalance of other
causes of disease in the compared groups.
(mixing of effects)
Characteristics of a Confounder
• associated with disease (in non-exposed)
• associated with exposure (in source population)
• not an intermediate cause