Chapter Three Numerically Summarizing Data
Download
Report
Transcript Chapter Three Numerically Summarizing Data
Chapter 3
Numerically
Summarizing Data
3.1
Measures of Central Tendency
1
The following chart gives a summary of some background information
on 5 students at Joliet Junior College (JJC).
Name
Age Gender Number of Semesters
Completed At JJC
Jennifer 21 Female
1
Amy
19 Female
2
Brian
18 Male
4
Mark 18 Male
2
Jim
24 Male
3
Overall GPA
3.5
2.75
3.25
3.0
4.0
Which of the above data would be qualitative?
2
Answer:Gender
A parameter is a descriptive measure of
a population.
A statistic is a descriptive measure of a
sample.
A statistic is an unbiased estimator of a parameter if it
does not consistently over- or underestimate the
parameter.
3
The arithmetic mean of a variable is
computed by determining the sum of all the
values of the variable in the data set divided by
the number of observations.
4
The population arithmetic mean, is computed
using all the individuals in a population.
The population mean is a parameter.
The population arithmetic mean is denoted
by
5
6
The sample arithmetic mean, is computed
using sample data.
The sample mean is a statistic that is an
unbiased estimator of the population mean.
The sample arithmetic mean is
denoted by x
7
8
EXAMPLE
Computing a Population Mean and a Sample
Mean
The following chart gives a summary of some background
information on a Calculus class.
Name Age Gender
Overall GPA
Jennifer 21 Female
3.5
Amy
19 Female
2.75
Brian
25 Male
3.95
Jane
19 Female
2.75
Mark 18 Male
3.0
Julie
19 Female
3.85
Jim
24
Male
4.0
Ted
25 Male
3.7
Michel 19 Female
3.75
Amanda 19 Female
3.65
Linda 19 Female
4.0
9
Compute the Arithmetic Mean
Treat
the students in this class as a
population. Compute the population mean
of the GPA.
Then
take a simple random sample of n =
5 students. Compute the sample mean of
the GPA. Obtain a second simple random
sample of n = 5 students. Again compute
the sample mean of the GPA.
10
The population(size of 11) mean
3.5+2.75+ 3.95+2.75+ 3.0+ 3.85+ 4.0
+ 3.7+ 3.75+ 3.65+ 4.0=34.9
34.9
X
3.54
11
11
The median of a variable is the value that lies in
the middle of the data when arranged in ascending
order. That is, half the data is below the median
and half the data is above the median. We use M
to represent the median.
12
13
EXAMPLE
Computing the Median of Data
Find the population median of the total GPA from
the earlier example.
2.75 2.75 3.0 3.5 3.95 3.65 3.7 3.75 3.85
1 2
3 4
5
6
7 8
9
4.0 4.0
10 11
11 1
6
2
14
The mode of a variable is the most frequent
observation of the variable that occurs in the
data set.
If there is no observation that occurs with the
most frequency, we say the data has no
mode.
15
EXAMPLE
Finding the Mode of a Data Set
The data on the next slide represent the
Vice Presidents of the United States and
their state of birth. Find the mode.
16
17
18
The mode is
New York.
19
The arithmetic mean is sensitive to
extreme (very large or small) values in the
data set, while the median is not. We say
the median is resistant to extreme
values, but the arithmetic mean is not.
20
When data sets have unusually large or
small values relative to the entire set of
data or when the distribution of the data
is skewed, the median is the preferred
measure of central tendency over the
arithmetic mean because it is more
representative of the typical
observation.
21
22
23
24
25
EXAMPLE Identifying the Shape of the Distribution
Based on the Mean and Median
The following data represent the asking price of
homes for sale in Lincoln, NE.
26
Source: http://www.homeseekers.com
Find the mean and median. Use the mean
and median to identify the shape of the
distribution. Verify your result by drawing a
histogram of the data.
27
Find the mean and median. Use the mean
and median to identify the shape of the
distribution. Verify your result by drawing a
histogram of the data.
Using MINITAB/Excel/Spss, we find that the
mean asking price is $143,509 and the median
asking price is $131,825. Therefore, we would
conjecture that the distribution is skewed right.
28
29
30
3.2
Measures of Dispersion
31
To order food at a McDonald’s Restaurant, one must
choose from multiple lines, while at Wendy’s
Restaurant, one enters a single line. The following
data represent the wait time (in minutes) in line for a
simple random sample of 30 customers at each
restaurant during the lunch hour. For each sample,
answer the following:
(a) What was the mean wait time?
(b) Draw a histogram of each restaurant’s wait time.
(c ) Which restaurant’s wait time appears more
dispersed? Which line would you prefer to wait in?
Why?
32
Wait Time at Wendy’s
1.50
2.53
1.88
3.99
0.90
0.79
1.20
2.94
1.90
1.23
1.01
1.46
1.40
1.00
0.92
1.66
0.89
1.33
1.54
1.09
0.94
0.95
1.20
0.99
1.72
0.67
0.90
0.84
0.35
2.00
Wait Time at McDonald’s
3.50
0.00
1.97
0.00
3.08
0.00
0.26
0.71
0.28
2.75
0.38
0.14
2.22
0.44
0.36
0.43
0.60
4.54
1.38
3.10
1.82
2.33
0.80
0.92
2.19
3.04
2.54
0.50
1.17
0.23
33
The mean wait time in each line is 1.39
minutes.
34
35
The range, R, of a variable is the difference
between the largest data value and the
smallest data values. That is
Range = R = Largest Data Value – Smallest Data Value
36
EXAMPLE Finding the Range of a Set of Data
Find the range of the student GPA collected from
Section 3.1
37
The population variance of a variable is the
sum of squared deviations about the
population mean divided by the number of
observations in the population, N.
38
The population variance is symbolically
represented by lower case Greek sigma squared.
Note: When using the above formula, do not round until
the last computation. Use as many decimals as allowed
by your calculator in order to avoid round off errors.
39
EXAMPLE
Computing a Population Variance
Compute the population variance of the
population data collected in Section 3.1.
40
The sample variance is computed by
determining the sum of squared deviations
about the sample mean and then dividing this
result by n – 1.
41
Note: Whenever a statistic consistently overestimates or
underestimates a parameter, it is called biased. To obtain
an unbiased estimate of the population variance, we
divide the sum of the squared deviations about the mean
by n - 1.
42
EXAMPLE Computing a Sample Variance
Compute the sample variance using the sample
data from Section 3.1
43
The population standard deviation is denoted by
It is obtained by taking the square root of the
population variance, so that
44
EXAMPLE
Computing a Population Standard
Deviation and Sample Standard
Deviation
Compute the population and sample standard
deviation for the data obtained in Section 3.1
45
EXAMPLE
Comparing Standard Deviations
Determine the standard deviation waiting
time for Wendy’s and McDonald’s. Which
is larger? Why?
46
EXAMPLE
Comparing Standard Deviations
Determine the standard deviation waiting
time for Wendy’s and McDonald’s. Which
is larger? Why?
Sample standard deviation for Wendy’s:
0.738 minutes
Sample standard deviation for McDonald’s:
1.265 minutes
47
48
49
EXAMPLE Using the Empirical Rule
The following data represent the serum HDL
cholesterol of the 54 female patients of a family
doctor.
41
62
67
60
54
45
48
75
69
60
54
47
43
77
69
60
55
47
38
58
70
61
56
48
35
82
65
62
56
48
37
39
72
63
56
50
44
85
74
64
57
52
44
55
74
64
58
52
44
54
74
64
59
53
50
(a) Compute the population mean and standard
deviation.
(b) Draw a histogram to verify the data is bellshaped.
(c) Determine the percentage of patients that have
serum HDL within 3 standard deviations of the
mean according to the Empirical Rule.
(d) Determine the percentage of patients that have
serum HDL between 34 and 80.8 according to the
Empirical Rule.
(e) Determine the actual percentage of patients
that have serum HDL between 34 and 80.8.
51
(a) Using a TI83 plus graphing calculator, we find
(b)
57.4 and 11.7
52
57.4 and 11.7
(c) According to the Empirical Rule, approximately
99.7% of the patients will have serum HDL
cholesterol levels within 3 standard deviations of the
mean. That is, approximately 99.7% of the patients
will have serum HDL cholesterol levels greater than
or equal to 57.4 - 3(11.7) = 22.3 and less than or
equal to 57.4 + 3(11.7) = 92.5.
53
57.4 and 11.7
(d) Because 33.8 is 2 standard deviations below the
mean (57.4 - 2(11.7) = 34) and 81 is 2 standard
deviations above the mean (57.4 + 2(11.7) = 80.8),
the Empirical Rule states that approximately 95% of
the data will lie between 34 and 80.8.
(e) There are no observations below 34. There are
2 observations greater than 80.8. Therefore, 52/54
= 96.3% of the data lie between 34 and 80.8.
54
55
EXAMPLE Using Chebyshev’s Theorem
Using the data from the previous example, use
Chebyshev’s Theorem to
(a) determine the percentage of patients that have
serum HDL within 3 standard deviations of the
mean.
(b) determine the percentage of patients that have
serum HDL between 34 and 80.8.
56
Answer:
(a) (1-1/9)*100%=88.9%
(b) 57.4-34=23.4
80.8-57.4=23.4
23.4/11.7=2 two standard deviations, so the
percentage of patients that have serum HDL between
two stand deviations is at least
(1-1/4)*100%=75%
57