Transcript Document

FIND THE STANDARD DEVIATION FOR THE
FOLLOWING DATA OF GPA: 4, 3, 2, 3.5, 3
1.
2.
3.
4.
5.
3.1
1.48
2.2
0.74
0.44
20%
1.
20%
20%
2.
3.
20%
4.
20%
5.
UPCOMING IN CLASS

Homework #2 due Monday at 5pm

Quiz #1 in class Jan. 26th (open book)

Part 1 of the Data Project due Jan. 31st
Slide
4- 2
CHAPTER 5
Understanding and Comparing
Distributions
WHY WE NEED TO UNDERSTAND AND
COMPARE DISTRIBUTIONS?


In order to examine your model, you need to
know what your data looks like. It is a connection
between your data and statistical results.
Understanding the distributions provides us the
preliminary descriptive data information and
help you get a sense of models for further
explanations.
THE BIG PICTURE


We can answer much more interesting questions about variables
when we compare distributions for different groups.
Below is a histogram of the Average Wind Speed for every day in
1989.
THE FIVE-NUMBER SUMMARY

The five-number
summary of a
distribution reports
its median, quartiles,
and extremes
(maximum and
minimum).

Example: The fivenumber summary for
the daily wind speed
is:
Max
8.67
Q3
2.93
Median
1.90
Q1
1.15
Min
0.20
COMPARING GROUPS
It is always more interesting to compare groups.
 With histograms, note the shapes, centers, and spreads
of the two distributions.


What does this graphical display tell you?
DAILY WIND SPEED: MAKING BOXPLOTS


A boxplot is a graphical display of the fivenumber summary.
Boxplots are particularly useful when comparing
groups.
Slide
1- 8
MEN VS WOMEN ECO 138


Men
Min
0.059
Q1
0.895
Q2 - Median
0.962
Q3
1
Max
1
Women
Min
0.775
Q1
0.993
Q2 - Median
1
Q3
1
Max
1
MEN VS WOMEN STARTING SALARIES


Men
Min
18,000
Q1
25,000
Q2 - Median
45,000
Q3
65,000
Max
70,000
Women
Min
18,000
Q1
35,000
Q2 - Median
42,000
Q3
45,000
Max
50,000
CONSTRUCTING BOXPLOTS
1.
Draw a single
vertical axis
spanning the range
of the data. Draw
short horizontal
lines at the lower
and upper quartiles
and at the median.
Then connect them
with vertical lines
to form a box.
CONSTRUCTING BOXPLOTS (CONT.)
2.
Erect “fences” around the
main part of the data.

The upper fence is 1.5
IQRs above the upper
quartile.

Q3 + 1.5*IQR

The lower fence is 1.5
IQRs below the lower
quartile.

Q1 - 1.5*IQR

Note: the fences only
help with
constructing the
boxplot and should
not appear in the
final display.
Slide
1- 12
CONSTRUCTING BOXPLOTS (CONT.)
3.
Use the fences to
grow “whiskers.”


Draw lines from the
ends of the box up
and down to the
most extreme data
values found within
the fences.
If a data value falls
outside one of the
fences, we do not
connect it with a
whisker.
Slide
1- 13
CONSTRUCTING BOXPLOTS (CONT.)
4. Add the outliers by
displaying any data
values beyond the
fences with special
symbols.

We often use a
different symbol for
“far outliers” that
are farther than 3
IQRs from the
quartiles.
Slide
1- 14
WIND SPEED: MAKING BOXPLOTS (CONT.)


Compare the histogram and boxplot for daily wind
speeds:
How does each display represent the distribution?
Slide
1- 15
COMPARING GROUPS (CONT)



Boxplots offer an ideal balance of information and simplicity,
hiding the details while displaying the overall summary
information.
We often plot them side by side for groups or categories we
wish to compare.
What do these boxplots tell you?
Slide
1- 16
WHAT ABOUT OUTLIERS?


If there are any clear outliers and you are
reporting the mean and standard deviation,
report them with the outliers present and with
the outliers removed. The differences may be
quite revealing.
Note: The median and IQR are not likely to be
affected by the outliers.
WHAT CAN GO WRONG? (CONT.)
Beware of outliers
 Be careful when
comparing groups that
have very different
spreads.


Consider these side-byside boxplots of
cotinine levels:
A CLASS OF FOURTH GRADERS TAKES A
DIAGNOSTIC READING TEST, AND THE SCORES
ARE REPORTED BY READING GRADE LEVEL.
THE 5-NUMBER SUMMARY FOR THE BOYS AND
GIRLS ARE SHOWN BELOW.
Girls
 Min: 2.5
 Q1: 3.7
 Q2: 4.3
 Q3: 4.7
 Max: 5.8

Boys
 Min: 2.7
 Q1: 4.1
 Q2: 4.5
 Q3: 4.9
 Max: 5.9

WHICH GROUP HAD THE HIGHEST SCORE
1.
2.
Girls
Boys
50%
50%
Slide
1- 20
1
2
WHICH GROUP HAD THE GREATEST RANGE
1.
2.
3.
Girls
Boys
They are the same
33%
33%
33%
Slide
1- 21
1
2
3
WHICH GROUP HAD THE GREATEST IQR
1.
2.
3.
Girls
Boys
They are the same
33%
33%
33%
Slide
1- 22
1
2
3
WHICH GROUP’S SCORES APPEAR MORE
SKEWED?
1.
2.
3.
4.
The boy’s scores are more skewed. The quartiles
are the same distance from the mean.
The girl’s scores are more skewed. The quartiles
are not the same distance from the median.
The boy’s scores are more skewed. The quartiles
are not the same distance from the median.
The girl’s scores are more skewed. The quartiles
are the same distance from the median.
Slide
1- 23
WHICH GROUP GENERALLY DID BETTER ON
THE TEST?
1.
2.
3.
4.
Girls did better b/c the mean
25%
for girls was higher.
Girls did better b/c the
median for girls was higher.
Boys did better b/c the mean
for boys was higher.
Boys did better b/c the
median for boys was higher.
25%
25%
25%
Slide
1- 24
1
2
3
4
TIMEPLOTS: ORDER, PLEASE!

For some data sets, we are interested in how the data
behave over time. In these cases, we construct
timeplots of the data.
WHAT CAN GO WRONG? (CONT.)
Avoid inconsistent
scales, either within
the display or when
comparing two
displays.
 Label clearly so a
reader knows what
the plot displays.


Good intentions, bad
plot:
Slide
1- 26
*RE-EXPRESSING SKEWED DATA TO
IMPROVE SYMMETRY
Slide
1- 27
TRANSFORMING DATA
y=Log(x)
 To
get original data back
x=10^y =10y
y=Sqrt(x)
 To
get original data back
x=y^2 = y*y
Slide
1- 28
*RE-EXPRESSING SKEWED DATA TO
IMPROVE SYMMETRY (CONT.)


One way to make a skewed
distribution more symmetric
is to re-express or transform
the data by applying a simple
function (e.g., logarithmic
function).
Note the change in skewness
from the raw data (previous
slide) to the transformed data
(right):
NEXT TIME

Chapter 6

How we use the Standard Deviation to make
comparisons….