x - Columbus State University
Download
Report
Transcript x - Columbus State University
Chapter 3
Describing Distributions
with Numbers
Overview
Measures of Center
Measures of Variation
Measures of Relative Standing
Exploratory Data Analysis (EDA)
Thinking Challenge
$400,000
$70,000
$50,000
$30,000
$20,000
... employees cite low pay
-- most workers earn only
$20,000.
... President claims
average pay is $70,000!
Numerical Data Properties
Central Tendency
(Center)
Variation
(Spread)
Shape
Numerical Data
Properties and Measures
Numerical Data
Properties
Measures of
Center
Mean
Measures of
Variation
Range
Median
Interquartile Range
Mode
Variance
Standard Deviation
Shape
Symmetric
Skew
Mean
Measure of the center or central tendency
Most common measure
Affected by extreme values (‘outliers’)
Example
Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
x1 x2 x3 x4 x5 x6
Mean
6
10.3 4.9 8.9 11.7 6.3 7.7
6
8.30
Median
Measure of the center or central tendency
Middle value in an ordered sequence
If Odd n, Middle Value of Sequence
If Even n, Average of 2 Middle Values
Position of median in the sequence
n 1
Positioning point
2
Not affected by extreme values
Median of a Data Set
Median Odd-Sized Sample
Raw Data: 24.1 22.6 21.5 23.7 22.6
Ordered: 21.5 22.6 22.6 23.7 24.1
Position: 1
2
3
4
5
n 1 5 1
Position
3
2
2
Median = 22.6
Median Even-Sized Sample
Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
Position: 1 2 3 4 5 6
n 1 6 1
Position
3.5
2
2
7.7 8.9
8.3
Median =
2
Mode
Measure of the center or central tendency
Value that occurs most often
Not affected by extreme values
May be no mode or several modes
May be used for numerical and categorical
data
Mode
Mode Example
No Mode
Raw Data:
10.3 4.9 8.9 11.7 6.3 7.7
One Mode
Raw Data:
6.3 4.9 8.9
6.3 4.9 4.9
More Than 1 Mode
Raw Data:
21
28
28
41
43
43
Mean versus Median
Selecting an Appropriate
Measure of Center
a) A student takes four exams in a biology class. His grades are 88,
40, 95, and 100. If asked for his grade in the class, which
measure of center is the student likely to report?
b) The National Association of REALTORS publishes data on
resale prices of U.S. homes. Which measure of center is most
appropriate for such resale prices?
c) In the 2003 Boston Marathon, there were two categories of
official finishers: male and female, of which there were 10,737
and 6,309, respectively. Which measure of center should be used
here?
Population Mean - Sample Mean
Possible interpretations for the mean of a data set
Notation for Sample Mean
n
x
x
i 1
n
i
x1 x2 x3
xn1 xn
n
Notation
denotes the sum of a set of values.
x is the variable usually used to represent the
individual data values.
n represents the number of values in a
sample.
N represents the number of values in a
population.
Notation for Population Mean
Notation used for a sample
and for the population
Best Measure of Center
Measuring Spread or Variation
Range
Measure of spread, variation or dispersion
Difference between largest and smallest
observations
Range Largest ( X i ) Smallest ( X i )
Ignores how data are distributed
7 8 9 10
7 8 9 10
Quartiles and Boxplots
Quartiles
Measure of Spread, variation or dispersion
Split Ordered Data Set into 4 Quarters
25%
Min
25%
Q1
25%
Q2
25%
Q3
Max
How To Calculate the Quartiles
Quartile (Q2) Example
Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
Position: 1 2 3 4 5 6
Q2 = 8.3
Quartile (Q1) Example
Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
Position: 1 2 3 4 5 6
Q1 = 6.3
Quartile (Q3) Example
Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
Position: 1 2 3 4 5 6
Q3 = 10.3
Notice that,
Q1 (First Quartile) separates the bottom 25%
of sorted values from the top 75%.
Q2 (Second Quartile) same as the median;
separates the bottom 50% of sorted values
from the top 50%.
Q3 (Third Quartile) separates the bottom 75%
of sorted values from the top 25%.
Percentiles
Just as there are three quartiles separating data
into four parts, there are 99 percentiles, denoted
P1, P2, . . . P99, which partition the data into 100
groups.
The kth percentile, Pk
is the value for which k % of all observations
are below that value.
For instance, Q1= P25 , Q2= P50 , and Q3= P75
Finding the Percentile
of a Given Score
The following formula gives the percentile that
a given score represents.
Notice that the data set must be ordered.
Round the result to the nearest integer
number of values less than x
Percentile of value x
100
total number of values
Example: Ages of Best Actresses
Original Data
Sorted Data
number of values less than 30
Percentile of value 30
100
total number of values
26
100 = 34%
76
Interpretation: The age of 30 years is the 34th
percentile, that is, P34 = 30
Converting from the kth Percentile
to the Corresponding Data Value
then ask the question,
Example: Ages of Best Actresses
Refer to the sorted ages of Best Actresses given
below to find the value of the 20th percentile, P20
Original Data
Sorted Data
P20 is the value for which 20 % of all
observations are below that value.
Example: Ages of Best Actresses
Refer to the sorted ages of Best Actresses given
below to find the value of the 20th percentile, P20
Original Data
Sorted Data
k
20
L
n
76 15.2
Therefore, L 16
100
100
and the 16th value in the sorted list is P20 .
Example: Ages of Best Actresses
Refer to the sorted ages of Best Actresses given
below to find the value of the 75th percentile, P75
Original Data
k
75
L
n
76 57
100
100
Sorted Data
Therefore, L 57
and the average between the 57th and 58th
values in the sorted list is P75 39.5.
The Interquartile Range IQR
Measure of spread, variation or dispersion
Also called midspread
Difference between third and first quartiles
IQR Q3 Q1
Spread in middle 50%
Not affected by extreme values
The Interquartile Range IQR
Preferred measure of variation when the median is
used as the measure of center.
Like the median, the interquartile range is a resistant
measure.
Outliers
The Five-number Summary
Example:
Supermarket Spending
M = Q3 = 28
Q1 = 19
Q3 = 45
Max
Min
The Five-number Summary is:
$3
$19 $28
$45
$93
Boxplot
Min
5
Q1 Median Q3
6
7
9
Max
10
Example
20 customer satisfaction ratings:
1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10
M = (8+8)/2 = 8
Q1 = (7+8)/2 = 7.5
Q3 = (9+9)/2 = 9
IQR = Q3 - Q1 = 9 - 7.5 = 1.5
Boxplot
Boxplot
Distribution shapes and boxplots
Modified Boxplots
Some statistical packages provide modified boxplots
which represent outliers as special points. A modified
boxplot is constructed with these specifications:
A special symbol (such as an asterisk) is used to
identify outliers.
The solid horizontal line extends only as far as the
minimum data value that is not an outlier and the
maximum data value that is not an outlier.
Example
Variance and
Standard Deviation
Measures of spread, variation or dispersion
Most common measures
Consider how data are distributed
Show variation about mean
Sample Variance and
Sample Standard Deviation
Properties of the Standard Deviation
The idea behind the variance and the standard
deviation as measures of spread is as follows:
The deviations xi − x display the spread of the
values xi about their mean x.
Some of these deviations will be positive and
some negative because some of the observations
fall on each side of the mean. In fact, the sum of
the deviations of the observations from their
mean will always be zero.
Properties of the Standard Deviation
Squaring the deviations makes them all positive,
so that observations far from the mean in either
direction have large positive squared deviations.
The variance is the average squared deviation.
Therefore both, s2 and s will be large if the
observations are widely spread about their mean,
and small if the observations are all close to the
mean.
Properties of the Standard Deviation
s measures spread about the mean and should be
used only when the mean is chosen as the measure
of center.
s = 0 only when there is no spread. This happens
only when all observations have the same value.
Otherwise, s > 0. As the observations become more
spread out about their mean, s gets larger.
s is not resistant. A few outliers can make s very
large.
Example - Metabolic Rate
A person’s metabolic rate is the rate at which the
body consumes energy. This rate is important in
studies of weight gain, dieting, and exercise.
Here are the metabolic rates of 7 men who took
part in a study of dieting. The units are calories
per 24 hours. These are the same calories used to
describe the energy content of foods.
1792 1666 1362 1614 1460 1867 1439
Notice that x 1600
Example - Metabolic Rate
The table shows the observations xi , their deviations
from the mean and the square of these deviations.
xi x
xi
xi x
1792
192
36864
1666
66
4356
1362
-238
56644
1614
14
196
1460
-140
19600
1867
267
71289
1439
-161
25921
x 1600
2
7
x x
i 1
i
2
214870
1 7
2
s xi x 35811.78
6 i 1
2
1 7
2
s
xi x 189.24
6 i 1
Example - Metabolic Rate
The figure plots these data as dots on the calorie scale,
with their mean marked by an asterisk (∗).
The arrows mark two of the deviations from the mean.
∗
Metabolic rates for seven men, with the mean (∗) and the deviations of two
observations from the mean
Choosing measures of center and
spread
How do we choose between the five-number
summary and x and s to describe the center and
spread of a distribution?
Because the two sides of a strongly skewed
distribution have different spreads, no single
number such as s describes the spread well.
The five-number summary, with its two
quartiles and two extremes, does a better job.
Choosing a summary
The five-number summary is usually better than
the mean and standard deviation for describing a
skewed distribution or a distribution with strong
outliers.
Use x and s only for reasonably symmetric
distributions that are free of outliers.
Remarks
The idea of the variance is straightforward: it is
the average of the squares of the deviations of the
observations from their mean. The details we have
just presented, however, raise some questions.
Why do we square the deviations?
Why do we emphasize the standard deviation
rather than the variance?
Why do we average by dividing by n −1 rather
than n in calculating the variance?
Remarks
Why do we square the deviations? Why not just
average the distances of the observations from
their mean? There are two reasons, neither of them
obvious.
First, the sum of the squared deviations of any set
of observations from their mean is the smallest that
the sum of squared deviations from any number
can possibly be. This is not true of the unsquared
distances. So squared deviations point to the mean
as center in a way that distances do not.
Remarks
Second, the standard deviation turns out to be the
natural measure of spread for a particularly
important class of symmetric unimodal
distributions, the normal distributions. We will
meet the normal distributions in a later section. We
commented earlier that the usefulness of many
statistical procedures is tied to distributions of
particular shapes. This is distinctly true of the
standard deviation.
Remarks
Why do we emphasize the standard deviation rather
than the variance? One reason is that s, not s2, is the
natural measure of spread for normal distributions.
There is also a more general reason to prefer s to s2.
Because the variance involves squaring the deviations,
it does not have the same unit of measurement as the
original observations. The variance of the metabolic
rates, for example, is measured in squared calories.
Taking the square root remedies this. The standard
deviation s measures spread about the mean in the
original scale.
Remarks
Why do we average by dividing by n −1 rather than
n in calculating the variance? Because the sum of
the deviations is always zero, the last deviation can
be found once we know the other n − 1. So we are
not averaging n unrelated numbers. Only n−1 of the
squared deviations can vary freely, and we average
by dividing the total by n −1. The number n − 1 is
called the degrees of freedom of the variance or
standard deviation. Many calculators offer a choice
between dividing by n and dividing by n − 1, so be
sure to use n − 1.
Population Standard Deviation
Standardized Variables
We can associate with any variable x a new
variable z, called the standardized version of x
or the standardized variable, defined as follows.
Example
Consider a simple variable x, namely, one with
possible observations shown in the first row of
following table.
Example - Continued
a. Determine the standardized version of x.
b. Find the observed value of z corresponding to an
observed value of x of 5.
c. Obtain all possible observations of z.
d. Find the mean and standard deviation of z.
e. Obtain dotplots of the distributions of both x and
z. Interpret the results.
Example - Continued
a. Determine the standardized version of x.
Using the definitions of µ and σ we find that the
mean and standard deviation of the variable x are
µ = 3 and σ = 2. Therefore, the standardized
version of x is
Example - Continued
b. Find the observed value of z corresponding to an
observed value of x of 5.
The observed value of z corresponding to an
observed value of x of 5 is
Example - Continued
c. Obtain all possible observations of z.
Applying the formula z = (x − 3)/2 to each
observation of the variable x shown in the first
row of the table we obtain t each observation of
the standardized variable z shown in the second
row.
Example - Continued
d. Find the mean and standard deviation of z.
From the second row of the table
we get
Example - Continued
e. Obtain dotplots of the distributions of both x and
z. Interpret the results.
The dotplots of the distributions of x and z are
Standard Scores or z-Scores
An important concept associated with standardized
variables is that of the z-score, or standard score,
which we now define.
Standard Scores or z-Scores
The standard score or z-score, represents the number
of standard deviations that a data value, x, falls from
the mean, µ.
z
That is,
x
x z
Empirical Rule (68-95-99.7%)
For data with a (symmetric) bell-shaped distribution,
the standard deviation has the following characteristics.
1) About 68% of the data lie within one standard
deviation of the mean.
2) About 95% of the data lie within two standard
deviations of the mean.
3) About 99.7% of the data lie within three standard
deviation of the mean.
Empirical Rule (68-95-99.7%)
99.7% within 3
standard deviations
95% within 2 standard
deviations
68% within 1
standard
deviation
34%
34%
2.35%
2.35%
13.5%
–4
–3
–2
–1
13.5%
0
1
2
3
4
Empirical Rule (68-95-99.7%)
Interpreting z-Scores
Ordinary values: z-score between -2 and 2
Unusual Values: z-score < -2 or z-score > 2
Using the Empirical Rule
The mean value of homes on a street is $125 thousand with
a standard deviation of $5 thousand. The data set has a bell
shaped distribution. Estimate the percent of homes
between $120 and $130 thousand.
68%
105
110
115
120
125
130
µ–σ
µ
µ+σ
135
140
145
68% of the houses have a value between $120 and $130 thousand.
Standard Scores – Example 1
The weight data for the 2003 U.S. Women’s World Cup soccer team
is given in the fourth column of the following table.
Standard Scores – Example 1
So, in this case, the standardized variable is
a. Find and interpret the z-score of Tiffany Roberts’s
weight of 51 kg.
b. Find and interpret the z-score of Cindy Parlow’s
weight of 70 kg.
c. Construct a graph showing the results obtained in
parts (a) and (b).
Standard Scores – Example 1
a. The z-score for Tiffany’s weight of 51 kg is
Which means that Tiffany’s weight is 2.36 standard deviations
below the mean.
b. The z-score for Cindy’s weight of 70 kg is
Which means that Cindy’s weight is 1.52 standard deviations
above the mean.
Standard Scores – Example 1
c. In the figure, we marked Tiffany’s weight of 51 kg
with a color dot and Cindy’s weight of 70 kg with a
black dot. Additionally, we located the mean, µ =
62.55 kg, and measured intervals equal in length to
the standard deviation, σ = 4.9 kg.
Dotplot for the weight data for the Women’s World Cup soccer team
68 59 61 68 51 58 67 61 61 61 59 70 57 61 61 61 66 64 64 73
51 57 58 59 59 61 61 61 61 61 61 61 64 64 66 67 68 68 70 73
Standard Scores – Example 2
John received a 75 on a test whose class mean was 73.2 with a
standard deviation of 4.5. Samantha received a 68.6 on a test
whose class mean was 65 with a standard deviation of 3.9.
Which student had the better test score?
John’s z-score
Samantha’s z-score
z x 75 73.2
4.5
z x 68.6 65
3.9
0.4
0.92
John’s score was 0.4 standard deviations higher than the mean,
while Samantha’s score was 0.92 standard deviations higher
than the mean. Samantha’s test score was better than John’s.
Shape
Skewness