Transcript Bar chart
Descriptive Statistics
F. Farrokhyar, MPhil, PhD, PDoc
Department of Surgery
Department of Clinical Epidemiology and Biostatistics
March 18, 2009
Objectives
To understand and recognize different types of variables
To learn how to explore your data
◙ How to display data with numbers and tables
◙ How to display data using graphs
To understand the fundamental concept of variability
To learn the notion of the distribution of a variable
Why and how are statistics relevant to medicine?
Prevention – What causes a disease?
Diagnosis – What symptoms and signs do patients with a given
disease present with?
Treatment – What treatments are effective for a given disease and
for which patients?
Prognosis – How will specific patients with a given disease fare in
the long term?
Statistics – Why do we need it?
B
AEW
DSAQP
BBWEONF
O H E E R D T TY E
D TEQONEGGOL
TSDGFEWGEGGVB
AYAO E E DYH E J U E G D
ETEWWETHEFEOPLUMR
Descriptive and Inferential statistics?
Descriptive statistics are concerned with the
presentation, organization, and
summarization of data
Inferential statistics allow us the
generalization from a sample to a larger
group of subjects.
What is data?
Data is collected for some purpose and each
collected information have a meaning in some
context.
Data is a set of information or observation about a
group of individuals or subjects.
This information is organized in form of variables.
A variable is any characteristic of a person or a
subject that can be measured or categorized and its
value varies from individual to individual.
Dependent and Independent Variables?
Dependent variable
Is the outcome of interest, which changes in response to some
intervention or exposure.
mortality, survival, post-op pain, quality of life, post-op
complications
Independent variable
Is the explanatory variable that explains the changes in the
dependent variable
demographics (age, gender, height), risk factors (diabetes, CAD)
Is the intervention or exposure that causes the changes in the
dependent variable.
drug, surgery, radiation, smoking …
Type of variables …?
Qualitative or attribute variable
Categorical variables…
Nonnumeric
gender, severity of injury, type of injury, tumour grade
Quantitative variable
Numeric
Discrete variable can assume only whole numbers: number of
accidents, number of injuries, pain score
Continuous variable may take any value, within a defined range:
weight, height, age, blood pressure, level of cholesterol, pain
score
Level of measurement …
There are four level of measurement:
◙ Nominal
◙ Ordinal
◙ Interval
◙ Ratio
Qualitative/Categorical
Quantitative/Numeric
Level of measurement … cont’d
Variable type:
Assumptions:
◙ Nominal
◙ Named categories
◙ Ordinal
.
◙ Same as nominal plus
ordered categories
◙ Interval
.
◙ Same as ordinal plus equal
intervals
◙ Ratio
◙ Same as interval plus
meaningful zero
Level of measurement … cont’d
A nominal variable: consists of named categories,
with no implied order among the categories.
- gender, mortality ---- dichotomous or binary
- type of injury, type of fracture, blood type
An ordinal variable: consists of ordered categories,
where the differences between categories cannot be
considered to be equal.
- Tumour stage – I, II, III, IV, tumour grade – I II, III, IV
- Likert scale – excellent, very good, good, fair, poor
Level of measurement … cont’d
An interval variable: has equal distances between
values with no meaningful ‘zero’ value.
- IQ test (the differences between numbers are meaningful
but the ratios between them are not)
An ratio variable: has equal intervals between values
and a meaningful zero point. The ratio between them
makes sense.
- height, weight, laboratory test values, age
For example
Primary objective: To compare the post-operative pain
between laparoscopic and open surgery in
patients with colorectal cancer
Secondary objective: To compare the post-operative
complications between laparoscopic and
open surgery in patients with colorectal
cancer
Independent (Explanatory)
variables:
Age, Sex, Pre-op pain
Severity
Independent
(Comparison)
variable
Dependent/outcome
variables:
Changes in pain,
Complication
Data Editing
Validity edits: Ensure that:
essential fields have been completed and there are no
missing information
◘ specified units of measure have been properly used and
the measurements are within the acceptable range.
Duplication edits: Ensure that each case/patient have been
entered into the database only once.
Statistical edits: Identify and double check all the extreme
values, suspicious data and outliers.
Descriptive Statistics
… are a means of organizing and summarizing observations.
We examine variables in order to describe their main features.
It is the basic strategies that help us organize our exploration of
a set of data:
◙ Begin by examining each variable.
◙ Examine the distribution of each variable by creating
frequency tables, numerical summaries and graphs.
◙ Study the relationships between the variables.
Examining Distributions: Categorical …
Numbers
Frequencies (counts), cumulative frequencies
Relative frequencies (%), cumulative relative
frequencies (%)
Graphs
Bar charts
Pie charts
Cross-tabulation of categorical data
Se verity of disease
Valid
0
1
2
Total
Frequency
7
13
10
30
Percent
23.3
43.3
33.3
100.0
Valid Percent
23.3
43.3
33.3
100.0
Cumulative
Percent
23.3
66.7
100.0
Cross-tabulation of categorical data
Type of surgery
Complications
No
Yes
Total
Open
Count
Column N %
13
86.7%
2
13.3%
15
100.0%
Lap
Count
Column N %
11
73.3%
4
26.7%
15
100.0%
Examining Distributions: Categorical …
Numbers
Frequencies (counts), cumulative frequencies
Relative frequencies (%), cumulative relative
frequencies (%)
Graphs
Bar charts
Pie charts
Bar Charts
Bar Charts
Bar charts …
A bar chart can be used to depict any levels of
measurement (nominal, ordinal, interval, or ratio).
A series of separated bars (vertical or Horizontal), one per
category.
Bars represent frequency (counts) or relative frequency
(percent or proportion) of each category.
A Bar chart is also useful for showing data for more than
one group.
Pie Charts
Pie charts …
Used primarily for nominal and ordinal data.
Used to display relative frequency distribution.
The circle is divided proportionally using relative frequency
of each category.
A pie chart is useful for showing data for one group but it is
useless for graphic illustration of two or more groups.
Examining Distributions: Quantitative …
Numbers
Measures of central tendency – mean, median, mode
Measures of variation around mean – variance, standard
deviation, standard error of mean
Measures of variation around median – percentiles, quintiles,
quartiles
Graphs
Histograms
The five-number summary Box plots
Measures of central tendency
Mean: sum of observations divided by number of
observations
n
∑xi
X = i=1
n
Median: is a midpoint of a distribution after
arranging all observations in order of size, from
smallest to largest.
Mode: most frequent value – the highest peak
Properties of mean …
It is used for interval or ratio data.
A set of data has only a mean.
All values are included in the computation.
It is the only measure of central tendency where the sum of
deviations of each value from the mean will always be zero.
n
_
∑( Xi - X)
i=1
The mean is a useful measures for comparing two or more sets of
data.
The mean is sensitive toward extreme values.
Properties of median …
It is used for interval or ratio data.
There is a unique median for each data set.
The median is not necessarily equal to one of the sample
values.
It is resistant (insensitive) toward extreme values.
It is useful for summarising skewed data.
Measures of variation around mean
Variance: the average of the squares of the deviations of
the data from their mean
2
(
x
x
)
σ2 = ∑ i
i=1 n - 1
n
Standard deviation:
square root of variance
( xi - x )2
σ= ∑
i=1 n - 1
Standard error:
σ
s.e. =
n
n
Properties of variance …
All values are used on calculation.
The units are not the same as data, they are the square of
the original units.
Properties of standard deviation …
The units are the same as data
It is used for Empirical Rule.
For any symmetrical distribution:
◘ About 68% of the observations will lie within 1 s. d. of the mean.
◘ About 95% of the observations will lie within 2 s. d. of the mean.
◘ About 99.8% of the observations will lie within 3 s. d. of the
mean.
The Empirical Rule
Measures of variation around median
Percentiles:
Arrange the observations from smallest to largest.
Divide into 100 equal parts;
for example; the 5th percentiles of a distribution is the value
which 5% of the observations fall below and 95% fall above.
Quartiles: 25th, 50th and 75th percentiles
Quintiles: 20th, 40th, 60th, and 80th percentiles
Deciles: 10th, 20th, 30th, 40th, 50th,……10th percentiles
Statistics
Age
N
Valid
Mi ssing
Mean
St d. E rror of M ean
Median
Mode
St d. Deviat ion
Variance
Range
Mi nimum
Maxim um
Percentiles
25
50
75
30
0
63.87
1.494
64.00
58 a
8.182
66.947
38
44
82
58.75
64.00
69.50
a. Multipl e m odes exi st. The s mallest value is shown
Examining Distributions: Quantitative …
Numbers
Measures of central tendency; mean, median, mode
Measures of variation around mean – variance, standard
deviation, standard error of mean
Measures of variation around median – percentiles, quintiles,
quartiles
Graphs
Histograms
The five-number summary Boxplot
Histogram
Histograms …
Used for interval and ratio data.
A histogram is a graph in which each bar (horizontal axis)
represent a range of numbers called interval width. The
vertical axis represents the frequency of each interval.
There are no spaces between bars.
Histogram is useful for graphic illustration of one group.
Box plot: 5 – number summary
100th
Whiskers
Outliers
Inner fence Range = Max - Min
Q3
Median/Q2 IQR = Q3 – Q1
Q1
Whiskers
1st
Inner fence
Box plot of change in pain score
Box Plots …
Used for interval and ratio data.
Uses the five-number summary measures
Median, Q1, Q3, minimum and maximum.
It is useful in detecting outliers
It is useful to illustrate the distribution of more than
on group.
What are outliers … ?
Outliers are extreme data values that fall outside
of distribution of the data set.
Box plot: 5 – number summary
100th
Whiskers
Inner fence
Q3
Median/Q2 IQR = Q3 – Q1
Q1
Whiskers
1st
Inner fence
1.5 IQR Criterion for Outliers
Interquartile range (IQR) is the distance between the
first and third quartiles. IQR = Q3 – Q1
From data
Q1 = 59 yrs, Q3 = 70 yrs,
IQR = 70 – 59 = 11
1.5 IQR = 1.5 11 = 16.5
Q1 – IQR = 59 – 16.5 = 42.5
Q3 + IQR = 70 + 16.5 = 86.5
From data: Min= 44 and Max = 82
Properties of quartiles, quintiles…
It is used for interval or ratio data.
It is resistant (insensitive) to extreme values.
It is useful for summarising skewed data.
How to deal with skewed data
Transform the data:
Square/square root – (Poisson) count data
Log(x) or ln(x) – data is skewed toward right
Reciprocal (1/X) - data is skewed toward left
Transformation:
Make skewed data more symmetric
Makes distribution more normal
Stabilize variability
Liberalize a relationship between two or more variables
Show summary stat in original but analyse on the transformed data
Summary of what we have learned ….
Always plot your data: make a graph, e.i. histogram, box plot
Look for overall pattern (shape, centre and spread) and for striking
deviations such as outliers
Check to see if overall pattern of distribution can be described by
normal distribution.
If not uniform, transform data to make skewed data more symmetric
Calculate an appropriate numerical summary to describe centre and
spread