Transcript Chapter 2
Stat 281: Ch. 2--Presenting Data
An engineer, consultant and statistician were driving down a steep
mountain road. Suddenly, the brakes failed and the car careened
down the road out of control. But half way down, the driver
somehow managed to stop the car by running it against the
embankment, narrowly avoiding going over a very steep cliff. They
all got out, shaken, but otherwise unharmed.
The consultant said: "To fix this problem we need to organize a
committee, have meetings, write several interim reports and
develop a solution through a continuous improvement process."
The engineer said: "No! That would take too long, and besides
that method has never really worked. I have my trusty penknife
here and will take apart the brake system, isolate the problem and
correct it."
The statistician said: "No - you're both wrong! Let's all push the
car back up the hill and see if it happens again. We only have a
sample size of 1 here!!"
Fizzy Cola Sales
(Showing first 8 of 50)
Employee
Gallons Sold
P.P.
95.00
S.M.
100.75
P.T.
126.00
P.U.
114.00
M.S.
134.25
F.K.
116.75
L.Z.
97.50
F.E.
102.25
The Goal
Display
data in ways that elucidate
the information contained in them
Raw Data actually contains all the
information available, but it may not
be easy to understand
It’s not so much the information
available that counts—it’s the
information you get out!
Ranked Fizzy Cola Sales
Rank Empl. Gal. Sold
Rank Empl.
Gal. Sold
1 T.T.
82.50
43 R.O.
133.25
2 A.D.
88.50
44 M.S.
134.25
3 E.I.
91.00
45 O.U.
135.00
4 A.S.
93.25
46 G.H.
135.50
5.5 P.P.
95.00
47 R.T.
136.00
5.5 E.Y.
95.00
48 A.T.
137.00
7 L.Z.
97.50
49 O.O.
144.00
8 T.N.
99.50
50 R.N.
148.00
Viewing Data Directly
Ranked
Data (aka an Array)
– Still contains all the information
– Can quickly see range (max and min)
– May also easily determine median,
quartiles, etc.
Stem
and Leaf
– Arranges ranked data into chart-like
form
Fizzy Cola Stem & Leaf
8 28
9 135579
10 0234556789
11 02344555667889
12 124455688
13 2345567
14 48
More Complex Stem & Leaf
(MiniTab Style)
Stem-and-Leaf of C1 N=16
Leaf Unit=0.010
1 59 7
4 60 148
(5) 61 02669
7 62 0247
3 63 58
1 64 3
Dot Plot for Fizzy Cola Sales
Dot plots display vertically stacked dots
for each data value.
They tend to bring out any “clustering”
behavior in the data.
Stem & Leaf and Dot Plots begin to give
us a picture of the Distribution of Data.
Summarized Data
Frequency
Tables
– Grouped or ungrouped
– Frequency Distribution
– Relative Frequency Distribution
Bar
Graphs
Histogram (Numeric Data Only)
Pie
Charts
Often used for Categorical Data
Fizzy Cola Frequency Table
Number of Employees in each
Sales Range
Gallons Sold
80-90
Employees
2
>90-100
>100-110
>110-120
6
10
14
>120-130
>130-140
>140-150
9
7
2
Histogram of Fizzy Cola Sales
Constructing a Histogram
1. Identify the high (H) and low (L) scores. Find the range.
Range = H - L.
2. Select a number of classes and a class width so that the
product is a bit larger than the range.
3. Pick a starting point a little smaller than L. Count from L by
the width to obtain the class boundaries. Observations
that fall on class boundaries are placed into the class
interval to the right.
Note:
1. The class width is the difference between the upper- and
lower-class boundaries.
2. There is no best choice for class widths, number of classes,
or starting points.
Terms Used With Histograms
Symmetrical: The sides of the distribution are
mirror images. There is a line of symmetry.
Uniform (rectangular): Every value appears with
equal frequency.
Skewed: One tail is stretched out longer than the
other. The direction of skewness is on the side of
the longer tail (Positively vs. negatively skewed).
J-shaped: There is no tail on the side of the class
with the highest frequency.
Bimodal: The two largest classes are separated by
one or more classes. Often implies two
populations are sampled.
Normal: The distribution is symmetric about the
mean and bell-shaped.
Bimodal Distribution
Frequency
15
10
5
0
4.2
5.2
6.2
7.2
Blood Test
8.2
9.2
Left-Skewed Distribution
Ages of Nuns
Frequency
200
100
0
25
35
45
55
Age
65
75
85
Distribution of Categorical Data
Cars Sold in One Week
Day
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Number Sold
15
23
35
11
12
42
Basic Pie Chart
Cars Sold in One Week
Monday
11%
Saturday
30%
Friday
9%
Thursday
8%
Tuesday
17%
Wednesday
25%
Pie Charts focus our attention on fractions
of the whole, especially for the largest classes.
Three-D Pie Chart
Cars Sold in One Week
Saturday
30%
Monday
11%
Friday
9%
Thursday
8%
Tuesday
17%
Wednesday
25%
Three-D Pie Charts are “pretty” but can
also be used to distort the image.
Manipulating 3-D Pie Charts
Cars Sold in One Week
Thursday
8%
Friday
9%
Wednesday
25%
Saturday
30%
Tuesday
17%
Monday
11%
Changing the angle or turning the pie
may affect our perception of size.
Bar Charts for Categorical Data
Cars Sold in One Week
45
40
35
30
25
20
15
10
5
y
S
at
ur
da
rid
ay
F
hu
rs
da
y
T
ay
W
e
dn
es
d
ue
sd
ay
T
M
on
d
ay
0
(Bar charts for categorical data are drawn with bars
separated, while bars in histograms touch.)
Manipulating Bar Charts
Cars Sold in One Week
40
35
30
25
20
15
Sa
tu
rd
ay
Fr
id
ay
Th
ur
sd
ay
W
ed
ne
sd
ay
Tu
es
da
y
M
on
da
y
10
Cutting off the vertical axis distorts our
perception of the differences between bars.
Manipulating Bar Charts
Cars Sold in One Week
42
35
23
Sa
tu
rd
ay
12
Fr
id
ay
Th
ur
sd
ay
W
ed
ne
sd
ay
11
Tu
es
da
y
M
on
da
y
15
Removal of labels on the vertical axis allows bars
to be stretched upward to hide the differences.
Hmmm…
It
is proven that the celebration of
birthdays is healthy. Statistics show
that people who celebrate the most
birthdays become the oldest.
In
earlier times, they had no
statistics, so they had to fall back on
lies. (Stephen Leacock)
Measures of Central Tendency
Statistics used to locate the middle of a set of
numeric data, or where the data is clustered.
The term average is often associated with all
measures of central tendency (book, not me).
The mode for discrete data is the value that
occurs with greatest frequency.
The modal class of a histogram is the class with
the greatest frequency.
A bimodal distribution has two high-frequency
classes separated by classes with lower
frequencies.
Summation Notation
5
i 1 2 3 4 5 15
i 1
5
2
i
1 4 9 16 25 55
i 1
n
x
i 1
i
x1 x2 xn
The Mean
Mean: The “regular” average. The sum of all the
values divided by the total number of values.
The population mean, m, (lowercase Greek mu) is
the mean of all x values for the population. It
is a parameter of the distribution.
1 N
1
m xi ( x1 x2 xN )
N i 1
N
We usually cannot measure m but would like to
estimate its value.
The Sample Mean
The sample mean, x, (read x-bar) is the
mean of all x values for the sample. It is
a statistic.
1 n
1
x xi ( x1 x2
n i 1
n
xn )
The mean can be greatly influenced by
outliers.
E.g. Bill Gates moves to town.
Median
Median: The value of the data that occupies the middle
position when the data are ranked according to size.
The sample median (statistic) may be denoted by “x tilde”:
~x .
The population median (parameter), M, (uppercase Greek
mu), is the data value in the middle of the population.
To find the median:
1. Rank the data.
2. Determine the depth of the median. d ( ~
x ) n 1
2
3. Determine the value of the median.
Mode
Mode: The mode is the value of x that
occurs most frequently.
Note: If two or more values in a sample are
tied for the highest frequency (number of
occurrences), there is no mode.
Midrange: The number exactly midway
between a lowest value data L and a
highest value data H. It is found by
averaging the low and the high values.
Dispersion
How spread apart are the data?
Two populations with the same mean can
have very different distributions—would
like to take measure spread somehow.
Range (max-min)
– Values in middle are ignored
– Dispersion of middle could be very different
Use the idea of deviation from the mean:
– MAD
– Variance
– Standard Deviation
x
Deviations from the Mean
8
deviations
7
6
5
4
3
mean
2
x-values
1
0
0
1
2
3
4
5
6
7
Observation Number
8
9
10
11
Some example data
Obs
1
Data
x
2
2
4
3
5
4
9
Total
Calculate the mean
Obs
1
Data
x
2
Mean
x
5
2
4
5
3
5
5
4
9
5
Total
20
Deviation From the Mean
Obs
1
Data
x
2
Mean
x
5
Deviation
x- x
-3
2
4
5
-1
3
5
5
0
4
9
5
4
Total
20
20
0
Mean Absolute Deviation (MAD)
Obs
1
Data Mean Deviation Absolute
Deviation
x
x- x
x
2
5
-3
3
2
4
5
-1
1
3
5
5
0
0
4
9
5
4
4
Sum of Absolute Deviations
8
MAD
2
(divide sum by n)
Formula
1 n
Mean Absolute Deviation | xi x |
n i 1
Use of Squared Deviations
Obs
1
Data Mean Deviation Squared
Deviation
x
x- x
x
2
5
-3
9
2
4
5
-1
1
3
5
5
0
0
4
9
5
4
16
Sum of Squared Deviations: SS(x)
Variance
(Divide Sum by n-1)
Standard Deviation
(Take Square Root)
26
8.67
2.94
Sums of Squares
The sum of squared deviations is
denoted by SS(x) and often called the
“Sum of Squares for x.”
There are also other notations used,
including SSx and Sxx
n
SS ( x) ( xi x )2
i 1
Variance
The Variance is the statistician’s favorite
measure of dispersion, but in reports or
“everyday use” the standard deviation is
more commonly given.
The Standard Deviation is the square root
of the variance.
The Variance may be thought of as the
average squared deviation from the mean.
For a sample, divide by n-1.
For a population, divide by N.
Formulas
SS ( x) ( x x ) x nx x
2
2
2
2
x
2
n
SS( x)
Sample Variance: s
n 1
2
x
2
2
x
x
nx
1
2
n
Alternately: s 2
(
x
x
)
n 1
n 1
n 1
2
Sample Standard Deviation: s s
2
2
Formulas
1
Population Variance :
N
2
Population Standard Deviation:
(x m)
2
1
2
(
x
m
)
N
Example: Find the variance and standard deviation for
the data {5, 7, 1, 3, 8}.
x 1(5 7 1 38) 48
.
5
x
Sum
x2
25
49
1
9
64
148
5
7
1
3
8
24
s2 1 (148 24
2
) 8.2
x x
0.2
2.2
-3.8
-1.8
3.2
0
4
5
s2 1 (1485(4.8)2) 8.2
4
( x x)2
0.04
4.84
14.44
3.24
10.24
32.80
s2 1 (32.8) 8.2
4
s 8.2 2.86
Interpretation of s
Need
to get a sense of the meaning
of different values of dispersion
measures.
Are units same as data or squared?
Empirical Rule: 68%, 95%, 99.7%
Test of Normality
Range as estimator of s
z-Scores
Also “standardized scores” or just
“standard scores.”
Expresses a quantity in terms of its
distance from the mean in standard
deviation units.
value mean x x
z
st.dev.
s
More z-Scores
The z-score measures the number of standard
deviations away from the mean.
z-scores typically range from -3.00 to +3.00.
z-scores may be used to make comparisons of
raw scores.
You can calculate back from z-score to raw data
value by using the inverse:
xx
z
sz x x x sz x
s
Percentiles
Values
of the variable that divide a
set of ranked data into 100 equal
subsets.
– Each set of data has 99 percentiles.
– The kth percentile, Pk, is a value such
that at most k% of the data are smaller
than Pk and at most (100k)% are
larger.
Procedure for finding Pk
1. Rank the n observations, lowest to highest.
2. Compute A = (nk)/100.
3. If A is an integer:
d(Pk) = A.5 (depth)
Pk is halfway between the value of the datum in the
Ath position and the value of the next datum.
If A is a fraction:
d(Pk) = B, the next largest integer.
Pk is the value of the data in the Bth position.
Some programs like Excel also do interpolation
Quartiles
Like percentiles except dividing the data
set into 4 equal subsets.
The first quartile, Q1, is the same as the
25th percentile, and
The third quartile, Q3, is the same as the
75th percentile.
The second quartile is the 50th percentile,
which is the median.
Sometimes finding Q1 and Q3 is described
as finding the medians of the bottom half
and top half of the data, respectively.
Five Number Summary
The
Min, Q1, Median, Q3, and Max
Indicate how the data is spread out
in each quarter.
Interquartile Range is the distance
between Q1 and Q3.
The Midquartile is the average of Q1
and Q3, another measure of central
tendency.
Box and Whisker Plots
Weights from Sixth Grade Class
60
70
80
90
Weight
100
110
Hmmm…
What
did the Box Plot say to the
outlier?
“Don’t you dare get close to my
whisker!”