Chapter 3 - Dalton State College
Download
Report
Transcript Chapter 3 - Dalton State College
PROBABILITY &
STATISTICS FOR P-8
TEACHERS
Chapter 3
Data Description
WHAT IS NEXT?
Now that we know how to organize
the data and create nice graphs to
present the results, we need to
focus on describing patterns in the
data.
Summarizing
data sets numerically
Are there certain values that seem
more typical for the data?
How typical are they?
A number that helps describe a set of
data is an AVERAGE!
Sometimes called a
MEASURE OF CENTRAL
TENDANCY
NUMERICAL MEASURES OF
DATA
Central
Tendency is the value or values
around which the data tend to cluster
Variability
shows how strongly the data
cluster around that value
FINDING THE CENTER
All of these are Measures of Central Tendency
o
o
o
o
MEAN
MEDIAN
MODE
MIDRANGE
The question
“What’s my average?”
has many meanings
What we should say is
“What’s my mean?”
WHAT DO THEY ALL MEAN?
MEAN
Arithmetic Mean (Mean)
the measure of center obtained by adding
the values and dividing the total by the
number of values
What most people call an
average.
NOTATION
denotes the sum of a set of values.
x
is the variable usually used to represent the
individual data values.
n
represents the number of data values in a
sample.
N represents the number of data values in a
population.
MEAN
The sample mean is computed using sample
data.
o Denoted by
x
The sample mean is a statistic.
If x1, x2, …, xn are the n observations of a
variable from a sample, then the sample
mean, x , is
x1 x2 xn
x
n
MEAN
The population mean is computed using all
data points in a population.
o Denoted by µ
The population mean is a parameter.
If x1, x2, …, xn are the N observations of a
variable from a population, then the
population mean, µ , is
x1 x2 xN
N
MEAN
x is pronounced ‘x-bar’ and denotes the mean of a
set of sample values
x =
x
n
µ is pronounced ‘mu’ and denotes the mean of
all values in a population
µ =
x
N
COMPUTING SAMPLE MEAN
The following data represent the travel
times (in minutes) to work for a sample of
seven employees of an insurance company.
23, 36, 23, 18, 5, 26, 43
Compute the sample mean.
COMPUTING SAMPLE MEAN
x =
x
n
23 36 23 18 5 26 43
7
174
7
24.9 minutes
MEAN
Regardless of the
shape of the
distribution, the
mean is the point
at which a
histogram of the
data would
balance:
MEDIAN
The median represents the middle value
when the original data values are arranged in
increasing or decreasing order
The median will be one of the data values if there
is an odd number of values.
The median will be the average of two data values
if there is an even number of values.
MEDIAN
The median is the value with exactly half
the data values below it and half above it.
It is the middle data
value (once the data
values have been
ordered) that divides
the histogram into
two equal areas.
It has the same
units as the data.
COMPUTING THE MEDIAN
The following data represent the travel times
(in minutes) to work for a sample of seven
employees of an insurance company.
23, 36, 23, 18, 5, 26, 43
Determine the median of this data.
COMPUTING THE MEDIAN
23, 36, 23, 18, 5, 26, 43
Step 1: Order the data:
5, 18, 23, 23, 26, 36, 43
Step 2: Locate the middle data point
Median = 23
COMPUTING THE MEDIAN
Suppose the insurance company hires a new
employee. The travel time of the new
employee is 70 minutes. Determine the
median of the “new” data set.
23, 36, 23, 18, 5, 26, 43, 70
COMPUTING THE MEDIAN
23, 36, 23, 18, 5, 26, 43, 70
Step 1: Order the data:
5, 18, 23, 23, 26, 36, 43, 70
Step 2: Locate the middle data point
Step 3: Find the mean of the two middle data points
Median = (23 + 26) / 2 = 24.5
DESCRIBE THE DISTRIBUTION
The following data represent the asking price of homes
for sale in Lincoln, NE.
79,995
99,899
105,200
128,950
130,950
131,800
149,900
151,350
154,900
189,900
203,950
217,500
111,000
120,000
121,700
132,300
134,950
135,500
159,900
163,300
165,000
260,000
284,900
299,900
125,950
126,900
138,500
147,500
174,850
180,000
309,900
349,900
Source: http://www.homeseekers.com
DESCRIBE THE DISTRIBUTION
Find the mean and median. Use the mean and
median to identify the shape of the distribution.
Verify your result by drawing a histogram of the
data.
The mean asking price is $168,320
The median asking price is $148,700
Therefore, we would conjecture that the
distribution is skewed right.
Asking Price of Homes in Lincoln, NE
12
10
Frequency
8
6
4
2
0
100000
150000
200000
250000
Asking Price
300000
350000
MODE
The
mode is the value that occurs most often
in a data set.
There
may be no mode, one mode (unimodal),
two modes (bimodal), or many modes
(multimodal).
MODE
NFL Signing Bonuses:
Find the mode of the signing bonuses of
eight NFL players for a specific year. The
bonuses in millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
You may find it easier to sort first.
10, 10, 10, 11.3, 12.4, 14.0, 18.0, 34.5
Select the value that occurs the most.
The mode is 10 million dollars.
MODE
Coal Employees in Pennsylvania
Find the mode for the number of coal
employees per county for 10 selected counties
in southwestern Pennsylvania.
110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752
No value occurs more than once.
There is no mode.
MODE
Licensed Nuclear Reactors
The data show the number of licensed nuclear
reactors in the United States for a recent 15year period. Find the mode.
104
104 104
104 104
104 104
104 104
104 107
107 109
109 109
109 109
109 110
110
109
112 111
111109
109
109 111
111 112
104 and 109 both occur the most. The data set
is said to be bimodal.
The modes are 104 and 109.
MODAL CLASS
Miles Run per Week
Find the modal class for the frequency
distribution of miles that 20 runners ran in one
week.
Class
Frequency
5.5 – 10.5
1
10.5 – 15.5
2
15.5 – 20.5
3
20.5 – 25.5
5
25.5 – 30.5
4
30.5 – 35.5
3
35.5 – 40.5
2
The modal class is
20.5 – 25.5.
The mode, the midpoint
of the modal class, is
23 miles per week.
MIDRANGE
The
midrange is the average of the lowest
and highest values in a data set.
Lowest Highest
MR
2
MIDRANGE
Water-Line Breaks
In the last two winter seasons, the city of
Brownsville, Minnesota, reported these
numbers of water-line breaks per month. Find
the midrange.
2, 3, 6, 8, 4, 1
1 8 9
MR
4.5
2
2
The midrange is 4.5.
PROPERTIES OF THE MEAN
Uses all data values.
Varies less than the median or mode
Used in computing other statistics, such as the
variance
Unique, usually not one of the data values
Cannot be used with open-ended classes
Affected by extremely high or low values, called
outliers
Central Tendency
PROPERTIES OF THE MEDIAN
Gives the midpoint
Used when it is necessary to find out whether
the data values fall into the upper half or lower
half of the distribution.
Can be used for an open-ended distribution.
Affected less than the mean by extremely high
or extremely low values.
PROPERTIES OF THE MODE
Used when the most typical case is desired
Easiest average to compute
Can be used with nominal data
Not always unique or may not exist
PROPERTIES OF THE MIDRANGE
Easy to compute.
Gives the midpoint.
Affected by extremely high or low values in a
data set
DISTRIBUTIONS
MEASURE OF DISPERSION
The
mean, median and mode give us an
idea of the central tendency, or where the
“middle” of the data is located
Variability
gives us an idea of how spread
out the data are around that middle
The
combination of central tendency and
dispersion provide a more complete
picture of the data
MEASURE OF DISPERSION
Without knowing something about
how data is dispersed, measures
of central tendency may be
misleading.
For Example:
A residential street with 20 homes on it
having a mean value of $200,000 where all
the homes are in a similar price range would
be very different from a street with the same
mean value but with 3 homes having a value
of $1 million and the other 17 clustered
around $60,000.
MEASURES OF VARIATION
How Can We Measure Variability?
Range
Variance
Standard Deviation
Coefficient of Variation
Chebyshev’s Theorem
Empirical Rule (Normal)
RANGE
The
range is the difference between the
highest and lowest values in a data set.
R Highest Lowest
Find the range in the following test
scores.
100, 68, 74, 56, 57, 68
Range = High - Low = 100 - 56
= 44
RANGE IN A HISTOGRAM
RANGE
Disadvantages:
Easy
to compute, but not very informative
Considers
only two observations
(the smallest and largest)
VARIANCE & STANDARD DEVIATION
The
variance is the average of the squares
of the distance each value is from the mean.
The
standard deviation is the square root
of the variance.
The
standard deviation is a measure of how
spread out your data are.
VARIANCE & STANDARD DEVIATION
The
population variance is
The
2
X
2
N
population standard deviation is
X
N
2
VARIANCE & STANDARD DEVIATION
Find the variance and standard deviation for
the data set for how long paint lasts before it
fades
2
Months, X
µ
10
60
50
30
40
20
35
35
35
35
35
35
X - µ (X -25
25
15
-5
5
-15
µ)2
625
625
225
25
25
225
1750
2
X
n
1750
6
291.7
1750
6
17.1
VARIANCE & STANDARD DEVIATION
The
sample variance is
The
sample standard deviation is
COMPUTATIONAL FORMULA
The
sample variance is
The
sample standard deviation is
WHY N - 1?
s is an estimate of the population standard
deviation () .
In
order to calculate an unbiased estimate
of the population standard deviation,
subtract one from the denominator.
Sample
standard deviation tends to be an
underestimation of the population standard
deviation.
EUROPEAN AUTO SALES
Find the variance and standard deviation for
the amount of European auto sales for a
sample of 6 years. The data are in millions of
dollars.
X
X2
11.2
11.9
12.0
12.8
13.4
14.3
75.6
125.44
141.61
144.00
163.84
179.56
204.49
958.94
s2 =
958.94 – (75.6)2 / 6
6-1
s2 = 1.28
s = 1.13
COMPARING STANDARD
DEVIATIONS
Data A
Mean = 15.5
s = 3.338
11
12 13 14 15
16 17 18 19 20 21
Data B
11
12 13 14 15
Data C
11
12 13 14 15
16 17 18 19 20 21
Least
Variable
Mean = 15.5
s = 0.926
Mean = 15.5
s = 4.570
16 17 18 19 20 21
Most Variable
COEFFICIENT OF VARIATION
The
measures discussed so far are
primarily useful when comparing
members from the same population, or
comparing similar populations.
When looking at two or more dissimilar
populations, it doesn’t make any more
sense to compare standard deviations
than it does to compare means.
COEFFICIENT OF VARIATION
The coefficient of variation is the standard
deviation divided by the mean, expressed as a
percentage.
s
CVAR 100%
X
Use CVAR to compare standard deviations
when the units are different.
SALES OF AUTOMOBILES
The mean of the number of sales of cars over
a 3-month period is 87, and the standard
deviation is 5. The mean of the commissions
is $5225, and the standard deviation is $773.
Compare the variations of the two.
5
CVar 100% 5.7%
87
Sales
773
CVar
100% 14.8%
5225
Commissions
Commissions are more variable than sales.
RANGE RULE OF THUMB
The Range Rule of Thumb
approximates the standard deviation
as
Range
s
4
when the distribution is unimodal and
approximately symmetric.
RANGE RULE OF THUMB
The shortest home-run hit by Mark
McGwire was 340 ft and the longest was
550 ft. Use the range rule of thumb to
estimate the standard deviation.
Range = 550 – 340 = 210 ft
Standard Deviation
approximation
s = range / 4
= 210 / 4
= 52.5 ft
CHEBYSHEV’S THEOREM
The proportion of values from any data set
that fall within k standard deviations of the
mean will be at least 1-1/k2, where k is a
number greater than 1 (k is not necessarily
an integer).
# of
Minimum Proportion
standard
within k standard
deviations, k
deviations
2
3
4
1-1/4=3/4
1-1/9=8/9
1-1/16=15/16
Minimum Percentage
within k standard
deviations
75%
88.89%
93.75%
MEASURES OF VARIATION:
CHEBYSHEV’S THEOREM
PRICES OF HOMES
The mean price of houses in a certain
neighborhood is $50,000, and the standard
deviation is $10,000. Find the price range for
which at least 75% of the houses will sell.
Chebyshev’s Theorem states that at least 75%
of a data set will fall within 2 standard
deviations of the mean.
50,000 – 2(10,000) = 30,000
50,000 + 2(10,000) = 70,000
At least 75% of all homes sold in the area will have a
price range from $30,000 and $75,000.
EMPIRICAL RULE
(NORMAL DISTRIBUTION)
The percentage of values from a data set that
fall within k standard deviations of the mean in
a normal (bell-shaped) distribution is listed
below.
# of standard Proportion within k standard
deviations, k
deviations
1
68%
2
95%
3
99.7%
EMPIRICAL RULE (NORMAL)
MEASURES OF POSITION
z-score
Percentile
Quartile
Outlier
MEASURES OF POSITION: Z-SCORE
A
z-score or standard score for a value is
obtained by subtracting the mean from the
value and dividing the result by the standard
deviation.
x
–
x
z= s
A
x
–
µ
z=
z-score represents the number of standard
deviations a value is above or below the mean.
TEST SCORES
A student scored 65 on a calculus test that had
a mean of 50 and a standard deviation of 10;
she scored 30 on a history test with a mean of
25 and a standard deviation of 5. Compare her
relative positions on the two tests.
x–x
z =
s
Calculus Test
History Test
65 – 50
z = 10
= 1.5
30 – 25
z= 5
= 1.0
She has a higher relative position in the Calculus class.
MEASURES OF POSITION:
PERCENTILES
Percentiles
separate the data set into 100
equal groups.
A
percentile rank for data represents the
percentage of data values below the datum.
# of values below X 0.5
Percentile
100%
total # of values
n p
c
100
PERCENTILES
Measures of location
There are 99 percentiles denoted
P1, P2, . . . P99
which divide a set of data into 100 groups
with about 1% of the values in each group
Use cumulative data to keep track of relative
positions
MEASURES OF POSITION: EXAMPLE OF
A PERCENTILE GRAPH
PERCENTILES FOR TEST SCORES
A teacher gives a 20-point test to 10 students.
Find the percentile rank of a score of 12.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Sort in ascending order.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
6 values
# of values below X 0.5
Percentile
100%
total # of values
6 0.5
A student whose score
100%
was 12 did better than
10
65% of the class.
65%
PERCENTILES FOR TEST SCORES
A teacher gives a 20-point test to 10 students.
Find the value corresponding to the 25th
percentile.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Sort in ascending order.
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
n p 10 25
c
2.5 3
100
100
The value 5 corresponds to the 25th percentile.
QUARTILES
Quartiles separate the data set into 4
equal groups
Q1 = P25,
Q2 = P50 (median)
Q3 = P75
We can easily find the quartiles by separating
the sorted data into two halves
Q2 = median of all data points
Q1 = median of lower half
Q3 = median of upper half
QUARTILES
For
quartiles, we want to divide our data
into 4 equal pieces.
Consider the following data set (already in order)
1 1 2 2 2 3 4 5 6 6 7 8
Q1
Q2
Q3
The quartiles will divide the data into 4 groups, each
with three elements.
BOX PLOTS
The Five-Number Summary is composed of
the following numbers:
Minimum Value
Q1
Median
Q3
Maximum Value
The Five-Number Summary can be
graphically represented using a Boxplot.
BOX PLOT
The
box plot is sometimes called a box and
whisker plot.
5 – Number Summary
PROCEDURE TABLE
Constructing Boxplots
1. Find the five-number summary.
2. Draw a horizontal axis with a scale that includes the
maximum and minimum data values.
3. Draw a box with vertical sides through Q1 and Q3,
and draw a vertical line though the median.
4. Draw a line from the minimum data value to the left
side of the box and a line from the maximum data
value to the right side of the box.
METEORITES (BOX PLOT)
The number of meteorites found in 10 U.S.
states is shown. Construct a boxplot for the
data.
89, 47, 164, 296, 30, 215, 138, 78, 48, 39
30, 39, 47, 48, 78, 89, 138, 164, 215, 296
Min
Q1
47
30
Median
83.5
Q3
Max
5-Number
Summary:
Min = 30
Q1 = 47
Q2 = 83.5
Q3 = 164
Max = 296
164
296
3-77
OUTLIERS
An
outlier is an extremely high or low data
value when compared with the rest of the data
values.
The
Interquartile Range, IQR = Q3 – Q1.
Range of middle 50% of data
Lower bound = Q1 – 1.5(IQR)
Upper bound = Q3 + 1.5(IQR)
An outlier is any value less than the lower
bound or more than the upper bound.
USING IQR TO FIND OUTLIERS
The red lines are 1.5 times the IQR.
Starting from Q1 going left, and starting from Q3
going right 1.5(IQR) we establish limits. All numbers
smaller on the left, and larger on the right are
outliers.
OUTLIERS
Check the meteorite example for outliers
30, 39, 47, 48, 78, 89, 138, 164, 215, 296
Step 1: The first and third quartiles are Q1 = 47 and Q3 = 164
Step 2: The interquartile range is 164 – 47 = 117
Step 3: The boundaries are
Lower Bound = Q1 – 1.5(IQR)
Upper Bound = Q3 + 1.5(IQR)
= 47 – 1.5 (117)
= 164 + 1.5 (117)
= -128.5
= 292.5
Step 4: The value 296 is greater than 292.5.
Therefore, 296 is an outlier.
OUTLIERS
o
An outlier can have a dramatic effect
on the mean.
o
An outlier can have a dramatic effect
on the standard deviation.
o
An outlier can have a dramatic effect
on the scale of the histogram so that
the true nature of the distribution is
totally obscured.
SUMMARY
o Measures of Central Tendency
o Mean, Median, Mode
o Measures of Dispersion
o Range, Variance, Standard Deviation
o Measures of Position
o Percentiles, Quartiles
o 5-Number Summary, Box Plots,
Outliers