Chapter 3: Displaying and Describing Categorical Data

Download Report

Transcript Chapter 3: Displaying and Describing Categorical Data

Chapter 5: Describing Distributions
Numerically
Important Values
• The Minimum and Maximum
• (Extremes)
• The Midrange measures the average of the
maximum and minimum value
– Do not use the midrange to describe the distribution.
• The range of the data is defined as the difference
between the maximum and minimum values
• The Median is the middle value that divides the
histogram into two equal areas
• The quartiles are the points that divide the data
into quarters.
• Interquartile Range (IQR) of a data set is a
measure of variation that gives the range of the
middle portion (about half) of the data.
More on Quartiles
• One quarter of the data lies below the lower quartile
(also known as the 25th percentile)
• One quarter of the data lies above the upper quartile
(also known as the 75th percentile)
• Half the data lies between the lower quartile and the
upper quartile
• The difference between the quartiles is called the
interquartile range (IQR)
Finding the important numerical
values of a Data Set
1.) Order the data set numerically
2.) Find the Extremes, Midrange and range.
3.) Find the median 𝑄2 . “cue-two”
4.) Find 𝑄1 - It is the median of the data entries to
the left of 𝑄2 .
5.) Find 𝑄3 - It is the median of the data entries to
the right of 𝑄2 .
6.) Find the IQR, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 and any outliers.
Identify Outliers
𝑑𝑎𝑡𝑎 𝑒𝑛𝑡𝑟𝑦 < 𝑄1 − 1.5(𝐼𝑄𝑅)
𝑄3 + 1.5 𝐼𝑄𝑅 < 𝑑𝑎𝑡𝑎 𝑒𝑛𝑡𝑟𝑦
Practice
The number of nuclear power plants in the top 15
nuclear power-producing countries in the world
are listed. Find the 5 number summary.
7 20 16 6 58 9 20 50 23 33 8 10 15 16 104
Reorder
7 20 16 6 58 9 20 50 23 33 8 10 15 16 104
6 7 8 9 10 15 16 16 20 20 23 33 50 58 104
Minimum: 6
Maximum: 104
Midrange: 55
Range: 98
6 7 8 9 10 15 16 16 20 20 23 33 50 58 104
𝑄2 = 16
𝑄1 = 9
𝑄3 = 33
𝐼𝑄𝑅 = 24
Outliers
No data entries are less then -27.
104 > 69
The country with 104 nuclear power plants is an
outlier.
Use a box plot to display your data
6 7 8 9 10 15 16 16 20 20 23 33 50 58 104
By hand first
Then by
calculator 
And by
alcula
Interpret
• The box represents about half of the data,
which means about 50% of the data entries
are between 9 and 33.
• The left whisker represents about 25% of the
data entries are less than 9.
• The right whisker represents about onequarter of the data, so about 25% of the data
entries are greater than 33. Also, the data
entries that are above the 75th percentile.
More Interpretations
• The length or height of the box is the IQR.
• If the median is roughly in the middle of the
box then the distribution is symmetric. If not
then the distribution is skewed.
Summary
The number of power plants in the top 15 nuclear
power producing countries in the world.
As of May 2016, 30 countries worldwide are
operating 444 nuclear reactors for electricity
generation and 63 new nuclear plants are under
construction in 15 countries.
In 2015, 13 countries relied on nuclear energy to
supply at least one-quarter of their total
electricity.
Choose the top 15 to compare.
Country
Number of Operating Nuclear
Power Plants
USA
104
France
59
Japan
45
Russian Federation
43
China
55
Republic of Korea
28
India
27
Canada
19
Ukraine
17
United Kingdom
15
Sweden
10
Germany
8
Belgium
7
Spain
7
Czech Rep.
6
Homework Answers: Dollars for Students
Homework Answers: Dollars for students
Comparing Groups with Boxplots
• When we plot two (or more) boxplots side-byside on the same axis, we can “see” a lot
– Which group has the greater median?
– Which group has the higher IQR?
– Which group has the bigger range?
– Do the groups have similar spreads?
• Symmetry?
• Spread?
• Outliers?
2006
Min: 6
Q1: 9
Q2: 16
Q3: 33
Max: 104
Outlier: 104
2016
Min: 6
Q1: 8
Q2: 19
Q3: 45
Max: 104
Outlier: 104
Comparing the number of power plants in the top 15
nuclear power producing countries in the world from the
year 2006 and 2016.
The distributions are skewed to the right because USA has
104 nuclear power plants.
The IQR from 2016 ranges from having 8 to 45 nuclear
power plants which is significantly higher then from 2006.
This is because China had 33 and now 55 nuclear power
plants.
Removing Outliers
2006
Min: 6
Q1: 9
Q2: 16
Q3: 23
Max: 58
Outliers: 50, 58
2016
Min: 6
Q1: 8
Q2: 18
Q3: 43
Max: 59
When removing the outlier, USA, in 2006 you can see
that France and Japan were nearly above the 75th
percentile. Now, in 2016, if we remove USA from the
data, France and Japan are still above the 75th
percentile but not considered to be an outlier. In
conclusion, countries have built more nuclear power
plants with in the past 10 years.
Comparing the number of power plants in the top 15
nuclear power producing countries in the world from the
year 2006 and 2016, there is evidence that countries have
and are pursuing to build more.
The distributions of both data sets are skewed to the right
because the USA has 104 nuclear power plants.
The IQR from 2016 ranges from having 8 to 45 nuclear
power plants which is significantly higher then from 2006
having 9 - 33. This is because China had 33 and now has 55
nuclear power plants.
When removing the outlier, USA, in 2006 you can see that
France and Japan were nearly above the 75th percentile.
Now, in 2016, if we remove USA from the data, France and
Japan are still above the 75th percentile but not considered
to be an outlier. In conclusion, these top 15 countries have
built more nuclear power plants with in the past 10 years.
We may expect an even higher accumulation in the next
10, considering those countries that have not begun
building nuclear power plants.
The Formula for Averaging
• While we know how to find the mean, the
notation here is key:
Sigma means to sum the observations
 y total
y

n
n
pronounced “y-bar”
in general, a bar over any
symbol/variable denotes
finding its mean
The mean is located
at the balancing
point of the
histogram. Since
the distribution is
skewed to the left,
the mean is lower
than the median.
# of Countries
Mean or
Median?
60
50
40
30
20
10
0
HALE (yr)
When data is skewed, it’s better to report the
median than the mean as a measure of center
Standard Deviation
• The IQR is a reasonable summary of spread, but
because it uses only the two quartiles of data, it
ignores much of the information about how
individual values vary.
• The standard deviation takes into account how
far each value is from the mean.
• Just like the mean, the standard deviation is only
appropriate for symmetric data.
Variance
If we summed the deviations from the mean,
however, we would get 0 (which won’t help
much).
However, when we add the squared deviations
from the center and find their average (almost –
we divide by n – 1 instead of n), we call the
result the variance.
Some Formulas

 y y
• Variance: s 
n 1
We use n – 1 instead of n
2
because there is 1 degree of
freedom. Degrees of freedom
comes up in depth in a later
chapter

Subtract the mean from
each data value and
square the result.
Then, sum the squared
differences
2

 y y
• Standard Deviation: s 
n 1
Standard deviation is the square
root of the variance.

2
Guidelines
Finding the Sample Variance and Standard Deviation
𝑥
𝑥=
𝑛
𝑥−𝑥
1.) Find the mean of the
sample data set.
2.) Find the deviation of each
entry.
(𝑥 − 𝑥)2
3.) Square each deviation.
4.) Add to get the sum of
squares.
5.) Divide by n-1 to get the
sample variance.
6.) Find the square root of the
variance to get the sample
standard deviation
𝑆𝑆𝑥 =
(𝑥 − 𝑥)2
(𝑥 − 𝑥)2
𝑠 =
𝑛−1
2
𝑠=
(𝑥 − 𝑥)2
𝑛−1
Thinking about Variance
• Always report the spread along with any
summary of the center
• If data values are close to the center, the
measures of spread (variance and standard
deviation) will be small
• If data values are far from the center, the
measures of spread will be large
Just Checking
1. The U.S. Census Bureau reports the median family
income in its summary of census data. Why do you
suppose they use the median instead of the mean?
What might be the disadvantages of reporting the
mean?
2. You’ve just bought a new car that claims to get a
highway fuel efficiency of 31 mpg. Of course, your
mileage will “vary.” If you had to guess, would you
expect the IQR of gas mileage attained by all cars like
yours to be 30 mpg, 3 mpg, or 0.3 mpg? Why?
Just Checking
3. A company selling a new MP3 player
advertises that the player has a mean
lifetime of 5 years. If you were in charge of
quality control at the factory, would you
prefer that the standard deviation of
lifespans of the players you produce be 2
years or 2 months? Why?
More S.O.C.S.
How do we know which “center” to report?
• If the shape is skewed, report the median and IQR. You
may want to include the mean and standard deviation,
but you should point out why the mean and median
differ. In fact, when the mean and median differ, it’s a
sign that the distribution may be skewed.
• If the shape is symmetric, report the mean, standard
deviation, and possibly the median and IQR. For
symmetric data, the IQR is usually a little larger than
the standard deviation.
More S.O.C.S.
• If there are any clear outliers and you are reporting the
mean and standard deviation, report them with the
outliers present and with the outliers removed. The
differences may be revealing. The median and IQR are
less sensitive to the outliers.
• We always pair the median with the IQR and the mean
with the standard deviation. It’s not useful to report
one without the other.
• Reporting a center without a spread (and vice versa) is
dangerous.
What Can Go Wrong?
• Don’t forget to do a reality check.
– Verify that your results make sense; it’s easy to make a quick
calculator error!
• Don’t forget to sort the values when finding the median or
percentiles by hand.
• Don’t compute numerical summaries of a categorical data
(even if it has numbers in it!).
• Watch out for multiple modes
– Consider separating the data into different groups
What (Else) Could Go Wrong?
• Be aware of slightly different methods
– While it won’t make a difference in our course, different
statistics packages/books do things differently
• Beware of outliers
• ALWAYS make a picture!!
• Be careful when comparing groups that have very
different spreads
– We can re-express data to address major differences in
spread as well as shape