Transcript Slide 1
Looking at data: distributions
- Describing distributions with numbers
IPS section 1.2
© 2006 W.H. Freeman and Company (authored by
Brigitte Baldi, University of California-Irvine; adapted by
Jim Brumbaugh-Smith, Manchester College)
Objectives
Describing distributions with numbers
Describe center of a set of data
Describe positions within a set of data
Represent quartiles graphically
Identify outliers mathematically
Describe amount of variation (or “spread”) in a set of data
Choose appropriate summary statistics
Describe effects of linear transformations
Terminology
Measures of center
mean ( x )
median (M)
mode
Measures of position
percentiles
quartiles (Q1 and Q3)
Five-number summary
Boxplot (regular and modified)
Measures of spread
range
interquartile range (IQR)
variance (s2)
standard deviation (s)
Measure of center: the mean
The mean (or arithmetic average)
To calculate the mean ( x) add all
values, then divide by the number of
observations.
Sum of heights is 1598.3
divided by 25 women = 63.9 inches
58.2
59.5
60.7
60.9
61.9
61.9
62.2
62.2
62.4
62.9
63.1
63.9
63.9
64.0
64.1
64.5
64.8
65.2
65.7
66.2
66.7
67.1
67.8
68.9
69.6
Mathematical notation
n
number of values (i.e., observations) in data set
xi
data value number i
x1, x2, , xn
Σ
sum up the expression that follows
(Σ is the Greek upper case “sigma”)
woman
(i)
height
(x)
woman
(i)
height
(x)
i=1
x1= 58.2
i = 14
x14= 64.0
i=2
x2= 59.5
i = 15
x15= 64.1
i=3
x3= 60.7
i = 16
x16= 64.5
i=4
x4= 60.9
i = 17
x17= 64.8
i=5
x5= 61.9
i = 18
x18= 65.2
i=6
x6= 61.9
i = 19
x19= 65.7
i=7
x7= 62.2
i = 20
x20= 66.2
i=8
x8= 62.2
i = 21
x21= 66.7
i=9
x9= 62.4
i = 22
x22= 67.1
i = 10
x10= 62.9
i = 23
x23= 67.8
i = 11
x11= 63.1
i = 24
x24= 68.9
i = 12
x12= 63.9
i = 25
x25= 69.6
i = 13
x13= 63.9
n=25
S=1598.3
Mathematical notation:
x1 x2 ... xn
x
n
1 n
x xi
n i 1
1598 .3
x
63.9
25
Your numerical summary must be meaningful.
Height of 25 women in a class
x 63.9
Here the shape of
the distribution is
wildly irregular.
Why?
Could we have
more than one
plant species or
phenotype?
The distribution of women’s
heights appears coherent and
fairly symmetrical. The mean is a
good numerical summary.
x 69.6
Height of Plants by Color
x 63.9
5
x 70.5
x 78.3
red
Number of Plants
4
pink
blue
3
2
1
0
58
60
62
64
66
68
70
72
74
76
78
80
82
Height in centimeters
A single numerical summary here would not make sense.
84
Measure of center: the median
The median (M) is the midpoint of a distribution—the number such
that half of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.5
1.6
1.9
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
Sort observations in increasing order
n = number of observations
______________________________
If n is odd, the median is the
exact middle value.
n = 25
(n+1)/ = 26/ = 13
2
2
Median = 3.4
If n is even, the median is the
mean of the two middle observations.
n = 24
n/ = 12
2
Median = (3.3+3.4)/2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
0.6
1.2
1.5
1.6
1.9
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
Comparing the mean and the median
The mean and the median are approximately equal if the distribution is
roughly symmetrical. The median is resistant to skewness and outliers,
staying near the main peak. The mean is not resistant, bring pulled in
the direction of outliers or skewness.
Mean and median for a
symmetric distribution
Mean
Median
Mean and median for
skewed distributions
Left skew
Mean
Median
Mean
Median
Right skew
Mean and median of a distribution with outliers
x 4.1
Percent of people dying
x 3.3
Without the outliers
With the outliers,
14 and 14
The mean is pulled quite a bit
The median is only slightly
to the right by the two high
pulled to the right by the outliers
outliers (from 3.3 up to 4.1).
(from 3.4 up to 3.6).
Impact of skewed data
Mean and median of
symmetric data
Disease X:
M 3.4
x 3.3
Mean and median are nearly
the same.
… and for right-skewed
distribution
Multiple myeloma:
M 2.5
x 3.4
Mean is pulled toward the
skewness (i.e., longer tail).
Measure of spread: the quartiles
The first quartile, Q1, is a value that has
25% (one fourth) of the data at or below it
(it is the median of the lower half of the
sorted data, excluding M).
M = median = 3.4
The third quartile, Q3, is a value that has
75% (three fourths) of the data at or
below it (it is the median of the upper half
of the sorted data, excluding M).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
Q3= third quartile = 4.35
Five-number summary and boxplot
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6.1
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
Largest = max = 6.1
BOXPLOT
7
upper “whisker”
Q3= third quartile
= 4.35
M = median = 3.4
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
4
3
2
1
lower “whisker”
Q1= first quartile
= 2.2
Smallest = min = 0.6
0
Disease X
Five-number summary:
min Q1 M Q3 max
Boxplots for skewed data
Years until death
Comparing box plots for a symmetric
and a right-skewed distribution
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Boxplots remain
true to the data and
clearly depict
symmetry or
skewness.
Disease X
Multiple Myeloma
IQR Test for Outliers (or “1.5 IQR Criterion”)
Outliers are troublesome data points; it is important to be able to identify them.
In a boxplot, outliers are far beneath or far above the box (i.e., far below Q1 or
above Q3). Define the interquartile range (IQR) to be the height of the box:
IQR = Q3 − Q1 (distance between Q1 and Q3).
We identify an observation as an outlier if it falls more than 1.5 times the
interquartile range (IQR) below the first quartile or above the third quartile.
If X < Q1 − 1.5(IQR) then X is considered a low outlier
If X > Q3 + 1.5(IQR) then X is considered a high outlier
Create a modified boxplot by plotting outliers separately and extending the
whiskers to the lowest and highest non-outliers.
12
11
10
9
8
7
6
5
4
3
2
1
12
11
10
9
8
7
6
5
4
3
2
1
7.9
6.1
5.6
5.3
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
8
7.575
7
Q3 = 4.35
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
6.1
5
Interquartile range
IQR = Q3 – Q1
= 4.35 − 2.2 = 2.15
4
3
2
1.5(IQR) = 1.5(2.15)
= 3.225
1
Q1 = 2.2
0
Disease X
Observation #25 has a value of 7.9 years, a possible high outlier.
Q3 + 1.5(IQR) = 4.35 + 3.225 = 7.575
Since 7.9 > 7.575 it is considered an outlier, so use modified plot.
Measures of spread: the standard deviation
Measures of variation or spread answer the question,
“How much is the data set as a whole spread out?”
Range – distance from smallest data value to largest
range = max – min
Highly sensitive to outliers since depends solely on
the two most extreme values.
Interquartile range
IQR = Q3 − Q1
Better than overall range since
Variance and standard deviation
Each measures variation from the mean.
Standard deviation
The standard deviation (s) describes variation above and below the
mean. Like the mean, it is not resistant to skewness or outliers.
1. First calculate the variance s2.
n
1
2
s2
(
x
x
)
i
n 1 i 1
x
2. Then take the square root to get
the standard deviation s.
1 n
2
s
(
x
x
)
i
n 1 i 1
Calculations …
1 n
2
s
(
x
x
)
i
n 1 i 1
Mean = 63.4
Sum of squared deviations from mean = 85.2
Degrees freedom (df) = n − 1 = 13
s2
= variance =
85.2/
13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
Women’s height (inches)
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
-4.4
19.0
2
60
63.4
-3.4
11.3
3
61
63.4
-2.4
5.6
4
62
63.4
-1.4
1.8
5
62
63.4
-1.4
1.8
6
63
63.4
-0.4
0.1
7
63
63.4
-0.4
0.1
8
63
63.4
-0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
Mean
63.4
SPSS output for summary statistics:
From menu:
Analyze
Descriptive Statistics
Explore
Displays common
statistics of your
sample data: x , M, s2, S,
min, max, range, IQR
Descriptives
Height
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
63.3571
61.8789
Std. Error
.68426
64.8354
63.3413
63.0000
6.555
2.56026
59.00
68.00
9.00
3.50
.177
-.360
.597
1.154
Comments on standard deviation
Standard deviation is generally positive (and never negative!)
(s = 0 only when data values are identical— not very interesting data!)
Larger standard deviation more variation in the data
(i.e., data is spread out farther from the mean)
Standard deviation has the same units as the original data
(while variance does not)
Choosing measures of center and spread:
Mean and standard deviation are more precise (since based on actual
data values); have nice mathematical properties but not resistant.
Median and IQR are less precise (since based only on positions); are
resistant to outliers, errors and skewness.
Choosing among summary statistics
Since the mean and std. deviation
are not resistant, use only to
Height of 25 Women
describe distributions that are
69
fairly symmetrical with no outliers.
68
If clear outliers or strong
skewness are present use the
median and IQR.
Don’t mix & match; use either
x
and s, or M and IQR.
67
Height in Inches
65
64
62
61
60
median and quartiles, the mean
59
by using error bars.
x
63
Similar to a boxplot representing
and std. dev. can be represented
xs
66
xs
58
Box Plot
Boxplot
x +/ SD
s
Mean
Mean or Median #1
Which should you use (and why) – mean or median?
Middletown is considering imposing an income tax on citizens. City
hall wants a numerical summary of its citizens income to estimate
the total tax base.
In a study of standard of living of families in Middletown, a
sociologist desires a numerical summary of “typical” family income in
that city.
Mean or Median #2
You are planning to buy a home in Middletown. You ask your real
estate agent what the “average” home value is in the neighborhood
you are considering.
Which would be more useful to you as the home buyer – the
mean or the median?
Which might the real estate agent be tempted to tell you is the
“average” home value? Why?
Changing the unit of measurement
Variables can be recorded in different units of measurement. Most
often, one measurement unit is a linear transformation of another
measurement unit:
xnew = a + bx.
Temperatures can be expressed in degrees Fahrenheit (F) or degrees
Celsius (C).
C = (5/9)* F − 160/9
Linear transformations do not change the basic shape of a distribution
(skewness, symmetry, modes, outliers). But they do change the
measures of center and spread:
Multiplying each observation by a positive number b multiplies both
measures of center (mean, median) and spread (IQR, s) by b.
Adding the same number a (positive or negative) to each observation
adds a to all measures of center and quartiles but it does not change
measures of spread (IQR, s).
Changing degrees Fahrenheit to Celsius
Fahrenheit
Celsius
Mean
25.73
(5/9)*25.73 − 160/9 = −3.48
Std Dev
5.12
(5/9)*5.12 = 2.84