Center and Spread of the Data

Download Report

Transcript Center and Spread of the Data

Introduction
A Review of Descriptive Statistics
Charts
• When dealing with a larger set of data values,
it may be clearer to summarize the data by
presenting a graphical image
Intervals
• Numerical data values may be grouped or classified
by defining “class intervals”:
Suppose the following data values represent the ACT
test scores for 30 individuals.
8, 10, 11, 13, 13, 14, 14, 15, 15, 16,
16, 17, 17, 18, 18, 18, 18, 19, 20, 20,
21, 21, 21, 22, 22, 23, 25, 26, 28, 30
Define intervals so that each of the values fall into
exactly one of the intervals.
Frequency
• Determine how many data scores fall in each
of the intervals (the "frequency“)
“Histogram”
Draw a bar chart (or "histogram")
with the height of the bar on each
interval determined by the frequency
Relative Frequency
• Alternatively, give the percentage of scores or
"relative frequency".
• That is, if 5 of the 30 values fall in the interval,
then the relative frequency is 5/30 = 0.1667.
Relative to each
other, the bars are
the same height and
the histograms have
the same shape.
Cumulative Frequency
• …or we could “keep a running total”, called a
“cumulative frequency”, as we go from one interval to
the next.
• if there are 2 values in the first interval and 5 in the next,
then the cumulative frequency is 2 + 5 = 7 for the second
interval.
Cumulative Graph
• The increase in the height of the bar shows how
many data values were contributed by a given
interval.
The increase in the height of the bar shows how many
data values were contributed by a given interval.
The Middle
• In addition to the graphical summary
• also give numerical measurements which
describe the distribution of the data
The middle ?
Set of Heights
• the height (in inches) of 30 third graders.
47.5 48.5 50 52 52 53 53 54
54 54
54.5 54.5 55 55 55 55.5 55.5 55.5 56 56
56 56.5 56.5 57 57 57 57 57.5 58 58
• How should we describe the "middle height"?
• For numerical data, we commonly compute the
"arithmetic average" of the values, also called the
mean value.
The Mean Value
• To compute the average:
find the sum of the values and
divide by the number of values in the set.
• For our 30 third-graders, we find the sum of the
30 heights and then divide by 30:
Compare this to “the middle” of the histogram.
The “Middle Weight”
• Looks to be in the middle!
Mean = 54.7
Sampling a Population
• We distinguish between a sample
and the entire population.
• A population consists of all the members of
the set under consideration
(eg., all third-graders in the United States)
• A sample consists of a subset of members
selected from a population
(eg., 30 third-graders in our example)
Notation
The notation used depends on if we’re using the
entire population or a sample.
If a selected sample is representative of the
population, we expect the mean of the sample
is nearly equal to the mean for the population.
Median Value
• The median value is literally defined to be the
middle data value. You may need to "split the
difference" by averaging two middle values.
• Half the data lies at or below the median and
the other half lies at or above the median.
• Median is another “measure of the middle”
but is less affected by non-typical data
values.
Median third-grader?
• Consider our previous data for 30 third-graders.
47.5 48.5 50 52 52 53 53 54
54 54
54.5 54.5 55 55 55 55.5 55.5 55.5 56 56
56 56.5 56.5 57 57 57 57 57.5 58 58
• An even number of data values, so we average
the two middle values.
• The median is (55 + 55.5)/2 = 55.25 inches.
Mean vs. Median
• In smaller samples, the median value is often a
better measure; it is unaffected a non-typical
score and is more representative of the middle.
• Suppose test scores were
23, 58, 64, 68, 75, 79, 83, 85, 87, 91, 94
median is 79
• Mean equals about 73.36
The Spread
• Another characteristic of a data set is how
widely the data values are spread.
• Find a way to measure how widely the values
vary.
• The measurement we use is called the
"standard deviation".
The Deviations
• Having determined the mean value, we can
measure how far each data value xi varies
from the middle.
• The difference or "deviation" from the middle,
is computed as xi  x .
• Our goal is to compute a sort of average of
these deviations from the middle.
“16 ounce drink”
• Suppose a sample of 8 medium colas were
measured. The volumes, measured in ounces,
are given by the data below.
16.2 16.5 15.9 15.7
15.9 16.1 16.3 15.8
Volumes have an average or
mean value of 16.05 ounces.
Deviations in Colas
• Recall the contents of our 8 colas
where the mean value is 16.05 ounces.
data value
15.7
15.8
15.9
15.9
16.1
16.2
16.3
16.5
deviation from middle
15.7 - 16.05 = - 0.35
15.8 - 16.05 = - 0.25
15.9 - 16.05 = - 0.15
15.9 - 16.05 = - 0.15
16.1 - 16.05 = 0.05
16.2 - 16.05 = 0.15
16.3 - 16.05 = 0.25
16.5 - 16.05 = 0.45
Squared Deviations
• To prevent the negative and postive values from cancelling
each other out, we square them.
data
15.7
15.8
15.9
15.9
16.1
16.2
16.3
16.5
deviation from middle deviation squared
15.7 - 16.05 = - 0.35
(- 0.35)2 = 0.1225
15.8 - 16.05 = - 0.25
(- 0.25)2 = 0.0625
15.9 - 16.05 = - 0.15
(- 0.15)2 = 0.0225
15.9 - 16.05 = - 0.15
(- 0.15)2 = 0.0225
16.1 - 16.05 = 0.05
( 0.05)2 = 0.0025
16.2 - 16.05 = 0.15
= 0.0225
16.3 - 16.05 = 0.25
= 0.0625
16.5 - 16.05 = 0.45
= 0.2025
Avg. of Squared Deviations
• To average the deviations:
add the squared deviations and
divide by one less than the number of data
values in the sample.
• Finally, we "undo the squaring" by computing
the square root.
data value
15.7
15.8
15.9
15.9
16.1
16.2
16.3
16.5
deviation squared
0.1225
0.0625
0.0225
0.0225
0.0025
0.0225
0.0625
0.2025
total = 0.5200 = sum of squared deviations
Average Spread
s = 0.2726 is a sort of average of how far the
data values vary from the middle
Notation
• As with the mean value, notation depends
on the whether the data represents the
population or a sample.
Compare
• The standard deviation describes the
“distribution of the data”.
• Which of the following distributions would you
expect to have the larger standard deviation?
Match the statistics
with the histograms
Bell-shaped Distribution
• For reasonably large random samples, we
often observe a "bell-shaped" distribution.
• In such cases, we expect to find about 68% of
the data within one std. dev. of the mean.
Also, about 95% of the
data is expected to lie
within 2 standard
deviations of the mean.
“Empirical Rule”