IT 130 – Internet and the Web Louis Ibarra Winter 2008

Download Report

Transcript IT 130 – Internet and the Web Louis Ibarra Winter 2008

Module #1 contd
Center of a distribution
Spread of a distribution
Quartiles
5-Number Summary and Boxplot
Outliers
Learning Objectives
By the end of this lecture, you should be able to:
– Recognize how scales, mislabeled axes, etc on charts can be misleading
– Describe the two most common statistics to describe the center of a
dataset, and when they should be used
– Describe two common statistics used to describe the spread of a
dataset, and when they should be used
– Understand boxplots and the 5-number summary
– Describe what is meant by an outlier and describe two techniques for
identifying outliers.
– Describe and apply the 1.5*IQR rule for outliers
Misleading chart through poor choice of scale/axis
3
Scales matter
Death rates from cancer (US, 1945-95)
How you stretch the axes and choose your scales can
give a different impression.
Death rate (per thousand)
250
Death rates from cancer (US, 1945-95)
Death rate (per
thousand)
250
200
150
100
200
150
100
50
50
0
1940
1950
1960
1970
1980
1990
0
1940
2000
1960
1980
2000
Years
Years
Death rates from cancer (US, 1945-95)
250
Death rates from cancer (US, 1945-95)
220
Death rate (per thousand)
Death rate (per thousand)
200
150
100
50
0
1940
200
Years
1980
2000
BUT
180
160
140
120
1940
1960
A picture is worth a
thousand words,
1960
1980
Years
2000
There is nothing like
hard numbers.
 Look at the scales.
4
Outliers
• This is a very important topic.
• Outliers refer to values that seem somehow ‘extreme’ or
well outside the typical range of values in your dataset.
• How to deal with outliers is a very involved subject, and while
it certainly merits much discussion, we will not delve into it
too much today.
• Your goal for today is to identify outliers. That is, to develop
some ability to look at a number and make a reasonably
educated decision as to whether or not that value is an
outlier.
• We will discuss two techniques for doing so shortly:
– Examination of a histogram
– Using the “1.5 * IQR” Rule
5
Describing the center and spread of a distribution
• A distribution is best described through a combination of visuals (e.g.
graphs), and numbers.
• Two key numeric descriptions are:
– Center: e.g. the mean
– Spread (aka Variation)
• Center:
– Statistics for describing the center: Mean, Median, Mode
• Mean: Most of us are familiar with the ‘mean’ (average). However, we should typically only use
the mean if the dataset has no outliers, and is not highly skewed.
• Median: a better choice for the center of a distribution that has outliers, or is skewed
• Mode: Will discuss later
• Spread (Variation)
– Statistics for describing the spread: Percentiles, Quartiles, Standard Deviation
– We will discuss these shortly
6
Measure of center: the mean
The mean or arithmetic average
To calculate the average, or mean, add
all values, then divide by the number of
individuals. It is the “center of mass.”
Sum of heights is 1598.3
divided by 25 women = 63.9 inches
Heights of 25 women in inches
58 .2
59 .5
60 .7
60 .9
61 .9
61 .9
62 .2
62 .2
62 .4
62 .9
63 .9
63 .1
63 .9
64 .0
64 .5
64 .1
64 .8
65 .2
65 .7
66 .2
66 .7
67 .1
67 .8
68 .9
69 .6
7
Another measure of center: the median
The median is the midpoint of a distribution—the number such
that half of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
1. Sort observations by size.
n = number of observations
______________________________
2.a. If n is odd, the median is
observation (n+1)/2 down the list
 n = 25
(n+1)/2 = 26/2 = 13
Median = 3.4
2.b. If n is even, the median is the
mean of the two middle observations.
Survival years for
Disease X
n = 24 
n/2 = 12
Median = (3.3+3.4) /2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6 8
9
‘Resistant’ is an important term. We say that the median is ‘resistant’ to outliers
because the presence of 1 or 2 outliers does not affect the median dramatically.
Conversely, the mean is not resistant to outliers.
Consider a series of incomes (in thousands) taken from a graduate classroom:
18, 24, 37, 41, 62, 63, 2000
The median income is the middle value in the dataset: $41,000
However, the mean is dramatically higher: $320,000 since the one individual who
made $2 million dollars pulls the mean disproportionally in the high direction. As a
result, we end up with a ‘center’ value that probably does not truly represent the
‘average’ income of our sample.
So we say that:
• The median is resistant to outliers
• The mean is not resistant to outliers
10
Effect of outliers
on the mean and
median
Percent of people dying
x  3.4
x  4.2
Without the outliers
With the outliers
Note the presence of outliers – those two fortunate people who managed to live several
years longer than the others. These two large values moved the mean up from 3.4 to 4.2
However, the median , the number of years it takes for half the people to die only went from
3.4 to 3.6.
Note that this says that the median is fairly resistant, but not 100% resistant. The median is
not sensitive to the size of the outlier, rather, iIt is sensitive to the number of outliers.
This is typical behavior for the mean and median. The mean is sensitive to outliers, because
when you add all the values up to get the mean the outliers are weighted disproportionately
by their large size.
However, when you get the median, they are just another two points to count –the actual
size of those values does not affect things.
11
Measures of spread / variation




Most people intuitively ‘get’ the benefit of knowing the center of a
distribution (e.g. the ‘average’ salary of first-year doctors). However, a piece
of data that is sadly neglected but is EVERY bit as important, is the spread
of the data (also known as the variation).
Just as there are different ways of describing the center of a distribution
(e.g. mean, median, mode), there are different techniques for describing the
spread of a distribution.
As with the center, you must know which description of the spread is the
best of the most accurate tool for describing the spread.
Common techniques for describing the variation in a dataset:
 Range: the highest and lowest values in the dataset. Important, but
outliers can give people a highly inaccurate picture (imagine if you
looked at the range of salaries).
 Quartiles – dividing the range into four
 Standard Deviation / Variance: this is one of the most effective means
of describing the spread, and a tool that we will come back to constantly
throughout this course.
12
Percentiles and Quartiles
• The xth percentile (e.g. the 38th percentile) is the value at which ‘x’ percent
of observations fall below it.
– Example: If your height is said to be in the 80th percentile, it means that 80%
of the people measured were shorter than you.
• Two commonly used percentiles are the first quartile and the third
quartile. These refer to the 25th and 75th percentiles respectively.
– Q1 (first quartile): Refers to the 25th percentile. Ie: 25% of observations are
below this value.
– Q2 (second quartile): Refers to the 50th percentile. In other words, the
median!
– Q3 (third quartile): Refers to the 75th percentile. Ie: 75% of observations fall
below this value.
13
5-Number Summary and Box Plot
• Once you have divided your dataset into quartiles, you now
have one technique for creating a neat little summary. It is
called the ‘5 Number Summary’ and is made up of:
–
–
–
–
–
Lowest number
First (lower) quartile
Median (not the mean!)
Third (upper) quartile
Highest number
• Once you have this summary in hand, you can even ‘draw’ it
using a simple (but very convenient) plot known as a box plot.
Determining the quartiles:
Start by finding the median. (This is Q2).
Then find the middle value between the lowest
number and the median (excluding the median
itself). This is the first quartile, Q1. It is the
value in the sample that has 25% of the
observations (data points) at or below it.
Then find the middle value between the
median and the highest number. This is the
third quartile, Q3. It is the value in the sample
that has 75% of the data at or below it. (It is
the median of the upper half of the sorted data,
excluding M).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Survival time (years)
n=25
Q1= first quartile = 2.2
M = median = 3.4
Q3= third quartile = 4.35
15
Determining the Five Number Summary
The five number summary is made up of:
1.
2.
3.
4.
5.
Minimum number
Q1
Median (Q2)
Q3
Maximum number
For this dataset, the summary is:
0.6, 2.2, 3.4, 4.35, 6.1
Again, the five number summary is a
good tool for summarizing the center
and spread of skewed distributions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
M = median = 3.4
Q3= third quartile = 4.35
16
The boxplot is a graph of the 5-Number summary
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6.1
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
Largest = max = 6.1
BOXPLOT
7
Q3= third quartile
= 4.35
M = median = 3.4
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
4
3
2
1
Q1= first quartile
= 2.2
Smallest = min = 0.6
0
Disease X
Five-number summary:
min Q1 M Q3 max
Boxplots for skewed data
Years until death
Comparing box plots for a normal
and a right-skewed distribution
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Boxplots remain
true to the data and
depict clearly
symmetry or skew.
Disease X
Multiple Myeloma
OUTLIERS – Identification of the Outlier
At what point do we typically label a datapoint as an outlier? We will discuss
two methods here:
1.
One way is to look at a chart and see if any values appear to be “off the
chart” relative to the large majority of values.
2.
Another tool is the “1.5 IQR” Rule for outliers.
Identifying outlier(s) on a histogram
The overall pattern is fairly
symmetrical except for 2
states that clearly do not
belong to the main trend.
Alaska and Florida have
unusual representation of
the elderly in their
population.
Alaska
Florida
A large gap in the
distribution is typically a
sign of an outlier.
20
Again, we are NOT currently
interested in what to do with
outliers; merely in how to
identify them.
Identification of outliers using the 1.5 IQR Rule
1.
Determine the distance between Q1 and Q3 – this is called the
Interquartile Range, or IQR.
2.
Multiply by 1.5
3.
Determine the distance from the suspicious data point to the nearest
quartile (Q1 or Q3).
4.
Determine the distance between Q1 and Q3, called the interquartile
range, or IQR.
5.
We call an observation a suspected outlier if it falls more than 1.5 times
the size of the interquartile range (IQR) below the first quartile or above
the third quartile.
This technique is called the “1.5 * IQR rule for outliers.”
Example of the 1.5 IQR Rule
Here is the 5-number summary for the
dataset discussed earlier:
0.6, 2.2, 3.4, 4.35, 6.1
Would a value of 7.5 be an outlier?
What about 8?
• IQR = 4.35-2.2 = 2.15
• 1.5*IQR
= 3.23
• For a number to be an outlier on the
high side, it must be greater than 4.35
+3.23: 7.58
• So, 7.5 would not be considered an
outlier by this criteria. However, 8
would.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
Q3= third quartile = 4.35
22
Remember that a histogram
does not give you ALL the
data - it is merely a
summary (albeit a good
one!) of the distribution.
However, to be able to do
statistics using specific
numbers (e.g. to calculate a
5-number summary) you
wold need to see the actual
dataset.
For this example, I will
provide you with Q1 and Q3:
Q1: 19.27
Q3: 45.40
IQR = 45.40 – 19.27 = 26.13
1.5*IQR = 39.2
Any amount more than
84.60 is a suspected
outlier.
23
How to deal with OUTLIERS
Outliers are data points that require some thought. The first step is to decide whether a data point should indeed
be labeled as an outlier. We will discuss this momentarily. Once you have decided that it is an outlier, the next
question is what you want to do with it.
There are two options for dealing with outliers – you can include them in your analysis, or you can leave them
out.
•
Exclude outliers: Suppose you have a datapoint that is extremely high – and you think it was recorded in
error. In this case, you would not want to include this value in your calculations since values like mean and
standard deviation would be thrown off by this bad datapoint.
•
However, if you choose to leave out a datapoint, you MUST include in your paper a discussion of your
reasons for doing so.
•
Include outliers: The other option, of course, is to include the outlier(s) in your calculations and analysis. In
this case, you have to decide which statistics to use (mean vs median, etc)
•
Discussion question: Suppose we wanted to determine the average height of DePaul students and we use
our class as a sample. However, that particular day, we are being visited by an incoming freshman who just
happens to be the tallest person in the world. Would you include him/her in your analysis?
–
I would probably leave him out of the analysis since he does not represent the ‘typical’ DePaul student.
–
However, when reporting my decision, I MUST report that I did so, and explain my decision.