STA 291-021 Summer 2007 - University of Kentucky
Download
Report
Transcript STA 291-021 Summer 2007 - University of Kentucky
Lecture 5
Dustin Lueker
Mean - Arithmetic Average
Mean of a Sample - x
Mean of a Population - μ
Median - Midpoint of
the observations when
they are arranged in
increasing order
Notation: Subscripted variables
n = # of units in the sample
N = # of units in the population
x = Variable to be measured
xi = Measurement of the ith unit
Mode - Most frequent value.
STA 291 Summer 2008 Lecture 5
2
The pth percentile (Lp) is a number such that
p% of the observations take values below it,
and (100-p)% take values above it
◦ 50th percentile = median
◦ 25th percentile = lower quartile
◦ 75th percentile = upper quartile
The index of Lp
◦ (n+1)p/100
STA 291 Summer 2008 Lecture 5
3
25th percentile
◦ lower quartile
◦ Q1
◦ (approximately) median of the observations
below the median
75th percentile
◦ upper quartile
◦ Q3
◦ (approximately) median of the observations
above the median
STA 291 Summer 2008 Lecture 5
4
Find the 25th percentile of this data set
◦ {3, 7, 12, 13, 15, 19, 24}
STA 291 Summer 2008 Lecture 5
5
Use when the index is not a whole number
Want to go closest index lower then go the
distance of the decimal towards the next
number
If the index is found to be 5.4 you want to go
to the 5th value then add .4 of the value
between the 5th value and 6th value
◦ In essence we are going to the 5.4th value
STA 291 Summer 2008 Lecture 5
6
Find the 40th percentile of the same data set
◦ {3, 7, 12, 13, 15, 19, 24}
Must use interpolation
STA 291 Summer 2008 Lecture 5
7
Five Number Summary
◦
◦
◦
◦
◦
Minimum
Lower Quartile
Median
Upper Quartile
Maximum
◦
◦
◦
◦
◦
minimum=4
Q1=256
median=530
Q3=1105
maximum=320,000.
Example
What does this suggest about the shape of the distribution?
STA 291 Summer 2008 Lecture 5
8
The Interquartile Range (IQR) is the difference
between upper and lower quartile
◦ IQR = Q3 – Q1
◦ IQR = Range of values that contains the middle 50%
of the data
◦ IQR increases as variability increases
Murder Rate Data
◦ Q1= 3.9
◦ Q3 = 10.3
◦ IQR =
STA 291 Summer 2008 Lecture 5
9
Displays the five number summary (and
more) graphical
Consists of a box that contains the central
50% of the distribution (from lower quartile to
upper quartile)
A line within the box that marks the median,
And whiskers that extend to the maximum
and minimum values
This is assuming there are no outliers in the data set
STA 291 Summer 2008 Lecture 5
10
An observation is an outlier if it falls
◦ more than 1.5 IQR above the upper quartile
or
◦ more than 1.5 IQR below the lower quartile
STA 291 Summer 2008 Lecture 5
11
Whiskers only extend to the most extreme
observations within 1.5 IQR beyond the
quartiles
If an observation is an outlier, it is marked by
an x, +, or some other identifier
STA 291 Summer 2008 Lecture 5
12
Values
Min = 148
Q1 = 158
Median = Q2 = 162
Q3 = 182
Max = 204
Create a box plot
STA 291 Summer 2008 Lecture 5
13
Value that occurs most frequently
◦ Does not need to be near the center of the distribution
Not really a measure of central tendency
◦ Can be used for all types of data (nominal, ordinal, interval)
Special Cases
◦ Data Set
{2, 2, 4, 5, 5, 6, 10, 11}
Mode =
◦ Data Set
{2, 6, 7, 10, 13}
Mode =
STA 291 Summer 2008 Lecture 5
14
Mean
◦ Interval data with an approximately symmetric
distribution
Median
◦ Interval or ordinal data
Mode
◦ All types of data
STA 291 Summer 2008 Lecture 5
15
Mean is sensitive to outliers
◦ Median and mode are not
Why?
In general, the median is more appropriate
for skewed data than the mean
◦ Why?
In some situations, the median may be too
insensitive to changes in the data
The mode may not be unique
STA 291 Summer 2008 Lecture 5
16
“How often do you read the newspaper?”
Response
Frequency
every day
969
a few times a
week
452
once a week
261
less than once a
week
196
Never
76
TOTAL
1954
• Identify the
mode
• Identify the
median
response
STA 291 Summer 2008 Lecture 5
17
Statistics that describe variability
◦ Two distributions may have the same mean
and/or median but different variability
Mean and Median only describe a typical value, but
not the spread of the data
◦
◦
◦
◦
Range
Variance
Standard Deviation
Interquartile Range
All of these can be computed for the sample or
population
STA 291 Summer 2008 Lecture 5
18
Difference between the largest and smallest
observation
◦ Very much affected by outliers
A misrecorded observation may lead to an outlier, and
affect the range
The range does not always reveal different
variation about the mean
STA 291 Summer 2008 Lecture 5
19
Sample 1
◦ Smallest Observation: 112
◦ Largest Observation: 797
◦ Range =
Sample 2
◦ Smallest Observation: 15033
◦ Largest Observation: 16125
◦ Range =
STA 291 Summer 2008 Lecture 5
20
The deviation of the ith observation xi from
the sample mean x is the difference between
them, ( xi x )
◦ Sum of all deviations is zero
◦ Therefore, we use either the sum of the absolute
deviations or the sum of the squared deviations as
a measure of variation
STA 291 Summer 2008 Lecture 5
21
Variance of n observations is the sum of the
squared deviations, divided by n-1
s
2
(x
i
x)
2
n 1
STA 291 Summer 2008 Lecture 5
22
Observation
Mean
Deviation
Squared
Deviation
1
3
4
7
10
Sum of the Squared Deviations
n-1
Sum of the Squared Deviations / (n-1)
STA 291 Summer 2008 Lecture 5
23
About the average of the squared deviations
◦ “average squared distance from the mean”
Unit
◦ Square of the unit for the original data
Difficult to interpret
◦ Solution
Take the square root of the variance, and the unit is
the same as for the original data
Standard Deviation
STA 291 Summer 2008 Lecture 5
24
s≥0
◦ s = 0 only when all observations are the same
If data is collected for the whole population
instead of a sample, then n-1 is replaced by n
s is sensitive to outliers
STA 291 Summer 2008 Lecture 5
25
Sample
◦ Variance
s2
2
(
x
i
x
)
n 1
◦ Standard Deviation
Population
◦ Variance
2
s
2
(
x
i
x
)
n 1
2
(
x
i
)
◦ Standard Deviation
N
2
(
x
i
)
N
STA 291 Summer 2008 Lecture 5
26
Population mean and population standard deviation
are denoted by the Greek letters μ (mu) and σ
(sigma)
◦ They are unknown constants that we would like to estimate
Sample mean and sample standard deviation are
denoted by x and s
◦ They are random variables, because their values vary
according to the random sample that has been selected
STA 291 Summer 2008 Lecture 5
27
If the data is approximately symmetric and
bell-shaped then
◦ About 68% of the observations are within one
standard deviation from the mean
◦ About 95% of the observations are within two
standard deviations from the mean
◦ About 99.7% of the observations are within three
standard deviations from the mean
STA 291 Summer 2008 Lecture 5
28
SAT scores are scaled so that they have an
approximate bell-shaped distribution with a
mean of 500 and standard deviation of 100
◦ About 68% of the scores are between
◦ About 95% of the scores are between
◦ If you have a score above 700, you are in the
top
%
What percentile would this be?
STA 291 Summer 2008 Lecture 5
29
According to the National Association of Home
Builders, the U.S. nationwide median selling price
of homes sold in 1995 was $118,000
◦ Would you expect the mean to be larger, smaller, or equal
to $118,000?
◦ Which of the following is the most plausible value for the
standard deviation?
(a) –15,000, (b) 1,000, (c) 45,000, (d) 1,000,000
STA 291 Summer 2008 Lecture 5
30