data_259_2007
Download
Report
Transcript data_259_2007
Describing Data
Where does data come from?
• Chapter 2 of Van Belle has a nice description of
study types.
• In theory, if you can gather data on EVERYONE
of interest (e.g., all people with a disease), you
are not doing statistics, you are describing
parameters in the population.
• In reality, you only sample a fraction of the
population of interest. The people who could
have been included in your sample are called
the sampling frame.
Abusing a Sampling Frame
• Look VERY carefully at the selection criteria for a study.
• If you randomize enough people into drug and placebo
groups, you can find effects if they exist in the
population. Right? Wrong!
• If the sampling frame for a study does not include people
at risk or only includes people who are at far less risk
than the population in general, you can not find
differences, regardless of randomization.
– People with high risk of cardiovascular problems should be kept
out of the study so the differences in the rate of cardiovascular
problems between the Vioxx patients and the others “would not
be evident”
• Mathews A, Martinez B (November 1, 2004) E-mails suggest Merck
knew Vioxx's dangers at early stage. Wall Street Journal.
Organizing Data
• When you collect data, you store it in a
grid/matrix where each row represents one
measurement time on one individual and
the columns represent different types of
information. You may have a column for
last name and another for CD4 count. The
values in the columns vary from row to row
(aka from record to record). Therefore, the
columns are called variables.
Types of Variables
• Computer programmers differentiate between
lots of different types of variables.
– They pay attention to the differences between whole
numbers vs. lots of decimals and single letters vs.
long strings of characters because they want to make
the columns use as little space as possible.
• Statistical programmers and statisticians think
about character variables (letters and words
which they call strings of letters) vs. categorical
factors vs. numeric variables because there are
some things you just don’t want to do to a bunch
of letters (like get an average).
Taxonomy of Variables
• In 1946 Stevens suggested a taxonomy of variable
types. Each type affords different summary
statistics and graphics.
– Nominal
• named categories
– Ordinal
• ordered categories but distances between categories are not equal
– Interval
• ordered categories with equal distance between the points
– Ratio
• continuous scale with meaningful ratios and a meaningful zero
• You will think a lot about nominal, ordinal and
continuous variables.
Another Popular Taxonomy
Categorical
binary
nominal
Quantitative
ordinal
discrete
continuous
2 categories +
more categories +
order matters +
numerical +
uninterrupted
Describing Data
• For every variable you play with, you want
to know two things: its variability and its
central tendency.
• Never EVER use a numeric summary of
data without a plot. A good plot shows
you both the variability and central
tendency at once.
Same Mean, Different
Variability
Data A
11
12
13
14
15
16
17
18
19
20 21
Mean = 15.5
S = 3.338
20 21
Mean = 15.5
S = 0.926
20 21
Mean = 15.5
S = 4.570
Data B
11
12
13
14
15
16
17
18
19
Data C
11
12
13
14
15
16
17
18
19
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Central Tendency
• Mean
– The arithmetic mean is the “add up the values and
divide by N” formula (number of records). There are
other means!
• Median
– Order the data from low to high and take the middle
value or the average of the middle 2 values if you
have an even number of records.
• Mode
– The most frequently occurring value
Variability
•
•
•
•
The actual values…
Range
Limits
IQR
– Difference between 75th and 25th percentiles
• The absolute deviation
• The standard deviation/variance
Rat brain weights in 4
treatments (Original plot)
Rat brain weights in 4
treatments (alternate plot)
Bars show the mean and
dots indicate each animal.
The Average Variability
• It frequently makes sense to use the mean to
describe the average value but the average
variability around the mean is zero (give or take
rounding error). There are alternatives.
– First, calculate the differences between the observed
and mean values and then take the absolute value
(strip off the negative signs). Calculate the average of
those values.
– First, calculate the differences between the observed
and mean values and square these differences.
Calculate the average of those values. This is the
variance.
The Joys of Excel
13
7
5
12
9
15
6
11
9
7
12
Average 9.636363636
All those lovely
extra digits and still
rounding error
3.363636364
-2.636363636
-4.636363636
2.363636364
-0.636363636
5.363636364
-3.636363636
1.363636364
-0.636363636
-2.636363636
2.363636364
-3.22974E-16
Average
Difference
3.363636364
2.636363636
4.636363636
2.363636364
0.636363636
5.363636364
3.636363636
1.363636364
0.636363636
2.636363636
2.363636364
2.694214876
Absolute
Difference
11.31404959
6.950413223
21.49586777
5.58677686
0.404958678
28.76859504
13.2231405
1.859504132
0.404958678
6.950413223
5.58677686
9.32231405
10.25454545
Variance
3.2
Standard
Deviation
Errr, ummm… Why the N-1?
• The denominator is actually the degrees of
freedom.
– It considers the fact that you have already included
one estimate (the mean) in the formula for the
variance. Basically, you bump up the estimated
variability a bit because you guessed on the mean.
– You use up one DF for every parameter estimate in a
formula.
– Why call it degrees of freedom? You can vary most of
the data going into a formula and still get the same
answer.
Why call it degrees of freedom?
• Say you have 5
numbers and the
mean is 10. What
must the total have
been? The sum is
ten.
Degrees of freedom is the
sample size, N, minus the
number of parameters, P,
estimated from the data.
5
=50
5
12
=50
5
12
8
5
12
8
14
=50
5
12
8
14
11 =50
=50
We can freely vary 4 of the 5
numbers and still come up
with the same mean. The DF
on a mean with sample size
N is N - 1
The Variance Formula
sum of squares
variance
degrees of freedom
or
if you prefer
hieroglyphics…
A bar over a variable
means the mean.
variance s
2
( y y)
n 1
2
Secret Decoder Ring
• S2 = Sample variance
• S = Sample standard dev
• 2 = Population (true or theoretical)
variance
• = Population standard dev.
• X = Sample mean
• µ = Population mean
• IQR = interquartile range (middle
50%)
Nominal Data
• If a variable represents categories,
summarize with frequency counts.
• Graph it with a dot plot or bar graph.
• Pie charts are all bad. Waffle plots are
Data on the number of
better.
hospice referrals received
from physicians after a
visit by a hospice
marketing nurse
Bar plots are not too good.
• Look at the ink-to-information ratio….
Three numbers are shown with LOTS of
ink.
Dot Plots in R
library(gdata)
hospice = read.xls("C:\\Projects\\classes\\hrp223-2007\\hospice.xls")
library(lattice)
trellis.par.set(list(fontsize=list(points=20)))
trellis.par.set(list(fontsize=list(text=25)))
dotplot(table(hospice$Practice), xlim = c(-1, 21), xlab = "Frequency Count")
oncologist
internal medicine
family practice
0
5
10
15
Frequency Count
20
Bad Plots
• Pies are great for twisting the truth. The
false 3rd dimension makes the front piece
look bigger. I can’t tell if there is a
difference in the sizes. Rotating the pie
can affect your judgment of the piece
sizes.
NEVER trust a glossy pie.
Ordinal Data
Serum Samples in Each Trimester
• Summary
tables can
include
cumulative
percentages
and similar
plots.
• The data is
ordered, so get
your figure
categories in
the same
order.
Interval and Ratio Data
• People automatically draw histograms to
describe data that is on a continuous
scale. Histograms show you the shape of
the empirical distribution but they do
nothing to convey things like the mean,
median or quantiles. They also have
issues where re-binning the data changes
perception.
Mean, median, mode?
The same data rendered by R and SAS
affords different interpretations about a
bimodal distribution, and good luck finding
the median or mean.
6
4
2
0
Frequency
8
10
12
Histogram of drug$BPChange
-10
0
10
20
drug$BPChange
30
40
50
Use Boxplots
1.5 * IQR = upper fence
75th percentile
Median
Mean
25th percentile
1.5 * IQR = lower fence
Box Plots and Histograms: for
Continuous Variables
• To show the distribution (shape, center,
range, variation) of continuous variables,
use both box plots and histograms.
Histogram of SI
25.0
Bins of size 0.1
Note the “right skew”
Percent
16.7
8.3
0.0
0.0
0.7
1.3
SI
2.0
Box Plot: Shock Index
Shock Index Units
2.0
maximum (1.7)
Outliers
1.3
Q3 + 1.5IQR =
.8+1.5(.25)=1.175
“whisker”
0.7
75th percentile (0.8)
median (.66)
25th percentile (0.55)
interquartile range
(IQR) = .8-.55 = .25
minimum (or Q11.5IQR)
0.0
SI
Histogram
6.0
100 bins (too much detail)
Percent
4.0
2.0
0.0
0.0
0.7
1.3
SI
2.0
Histogram
200.0
2 bins (too little detail)
Percent
133.3
66.7
0.0
0.0
0.7
1.3
SI
2.0
Box Plot: Shock Index
Shock Index Units
2.0
Also shows the “right
skew”
1.3
0.7
0.0
SI
Box Plot: Age
100.0
maximum
More symmetric
66.7
75th percentile
Years
interquartile range
median
25th percentile
33.3
minimum
0.0
AGE
Variables
Histogram: Age
Not skewed, but not
bell-shaped either…
14.0
Percent
9.3
4.7
0.0
0.0
33.3
66.7
AGE (Years)
100.0
Numeric Summaries
• You can always calculate the mean,
median, mode and standard deviation on
continuous data but you don’t want to.
• The mean and standard deviation may not
be good descriptions of the data if you
have outliers, skewed data or a bimodal
distribution.
Leukemia Onset Age
0.04
0.06
0.08
• Say you are studying a disease whose
age of onset is bimodal like Leukemia.
You can describe it with a mean but you
are not representing the data.
0.00
0.02
the mean
20
30
40
leuk2
50
60
Density Function
• In theory, there is a continuous density
function that describes the pattern in the
histogram. The most famous is the bell
shaped curve but there are others that are
at least as important.
– Is the density shape Gaussian, skewed,
bimodal exponential or something weirder?
– Does it contain outliers?
– Are there data points that don’t make sense?
Thoughts on Outliers
• Work like crazy to identify them.
• Do analyses with and without them and see if
the inferences change.
• If one data point changes the inferences and
you decide to exclude it, be sure to include the
value in your plots with a special plotting symbol.
• True outlier values bring Nobel prizes.
• Statistics based on ranks or percentiles are
relatively insensitive to outliers. The median
income for Washington state was $48,397 in
2000 but the mean was $96,200.
Mean and SD
• The mean and the SD play a huge role in
statistics because they describe the normal
curve. Much more on this later, but…
• No matter what and are, the area between
- and + is about 68%; the area between 2 and +2 is about 95%; and the area
between -3 and +3 is about 99.7%. Almost
all values fall within 3 standard deviations.
68-95-99.7 Rule
68% of
the data
95% of the data
99.7% of the data
Huff – How to Lie with Statistics
• Worry about broken, stretched or broken/split axes.
• If people use “images” to display numbers, they are
trying to exaggerate. They increased the vertical height
of the image but actually are increasing the AREA.
• Nobody would use areas to show a one-dimensional
measurement like size. Nobody would design a program
that represents data like this. Right? Nobody…
…except Microsoft.
Expect lies when you
see 3D effects on plots
or pie charts. Exploded
pie charts are great for
lying.
10000000000
9800000000
9600000000
9400000000
9200000000
Area/bubble charts are
GREAT for hiding
differences.
9000000000
8800000000
8600000000
8400000000
thing1
thing2
thing3
thing4
Read William
Cleveland's books
Visualizing Data and
The Elements of
Graphing Data.
Trust nothing you can’t see.
• If a study has a clinically interesting effect
with a statistically interesting p-value, it
had better have a clear graphic!
– Lots more on p-values later.
• A good graphic will show the effect with a
point estimate (mean, for example) and
the variation (standard deviation).