Transcript classes

2.1 Frequency Distribution
A table that shows classes or intervals of data with a count of the number of entries
in each class. The frequency, f, of a class is the number of data entries in the class.
Class
Frequency, f
1–5
5
6 – 10
8
11 – 15
6
16 – 20
8
21 – 25
5
26 – 30
4
Lower & Upper class limits
Class width = 5
Constructing a Frequency Distribution
1.
Decide on the number of classes. (usually 5 to 20)
2.
Find the class width.
Determine the range of the data. (Xmax-Xmin)
Divide the range by the number of classes.
Round up to the next convenient number (*)
3.
Find the class limits.
Can use the minimum data entry as 1st class lower limit.
Find remaining lower limits:
Lower-Limit of preceeding class +width
Find upper limits:
Lower-Limit of class + width - 1
4.
Tally mark for each data entry in row of appropriate class.
5.
Count tally marks to find total frequency f for each class.
(*) Report class width as the next successive whole number.
Larson/Farber
(Ex: 7.3 becomes 8,
7 becomes 8,
7.9 becomes 8)
1
Example: Constructing a Frequency
Distribution
The data set below lists the number of minutes 50 Internet subscribers spent on the
Internet during their most recent session. Construct a frequency distribution that has
seven classes.
50 40 41 17 11 7 22 44 28 21 19 23 37 51 54 42 86
41 78 56 72 56 17 7 69 30 80 56 29 33 46 31 39 20
18 29 34 59 73 77 36 39 30 62 54 67 39 31 53 44
1.
2.
Number of classes = 7 (given)
Find the class width:
Range / #Classes = (86-7) / 7 ≈ 11.3 ↑ 12
3.
4.
5.
Find lower & upper limits of each class.
Tally the frequencies
Write the frequency for each class
# of subscribers
Minutes online
Class
7 – 18
Tally
Frequency, f
IIII I
6
19 – 30
IIII IIII
10
31 – 42
IIII IIII III
13
43 – 54
IIII III
8
55 – 66
IIII
5
67 – 78
IIII I
6
79 – 90
II
2
Σf = 50
Larson/Farber 4th ed.
2
Frequency Distribution
(with additional data features)
Minutes
online
# of subscribers
Frequency, f
Midpoint
Relative
frequency
Cumulative
frequency
7 – 18
6
12.5
0.12
6
19 – 30
10
24.5
0.20
16
31 – 42
13
36.5
0.26
29
43 – 54
8
48.5
0.16
37
55 – 66
5
60.5
0.10
42
67 – 78
6
72.5
0.12
48
79 – 90
2
84.5
0.04
f
 1
n
50
Class
Σf = 50
Cumulative class
Frequency: The
Midpoint Calculation
Relative Frequency of a class
(Lower class limit)  (Upper class limit)
Percentage of data in a class. Sum of the
2
class frequency
f
frequency for
relative frequency 

Sample size
n
that class and all
Larson/Farber
previous classes. 3
Class Boundaries
•
Larson/Farber
Class
Class
boundaries
Frequency,
f
7 – 18
6.5 – 18.5
6
19 – 30
18.5 – 30.5
10
31 – 42
30.5 – 42.5
13
43 – 54
42.5 – 54.5
8
55 – 66
54.5 – 66.5
5
67 – 78
66.5 – 78.5
6
79 – 90
78.5 – 90.5
2
4
•
•
•
•
A bar graph that represents the frequency distribution.
The horizontal scale is quantitative and measures the data values.
The vertical scale measures the frequencies of the classes.
Consecutive bars must touch.
6.5
(using class midpoints)
18.5
30.5
42.5
frequency
Frequency Histogram
data values
54.5
66.5
78.5
90.5
(using class boundaries)
More than half of the subscribers spent
between 19 and 54 minutes on the
Internet during their most recent session.
Larson/Farber
5
Frequency Polygon
• A line graph that emphasizes the continuous change in frequencies.
Frequency
Internet Usage
14
12
10
8
6
4
2
0
0.5
12.5
24.5
36.5
48.5
60.5
72.5
84.5
96.5
Time online (in minutes)
Class
Midpoint
Freq.
f
7 – 18
12.5
6
19 – 30
24.5
10
31 – 42
36.5
13
43 – 54
48.5
8
55 – 66
60.5
5
67 – 78
72.5
6
79 – 90
84.5
2
The graph should begin and end on the horizontal axis,
so extend the left side to one class width before the first
class midpoint and extend the right side to one class
width after the last class midpoint.
You can see that the frequency of subscribers increases up
to 36.5 minutes and then decreases.
Larson/Farber
6
Relative Frequency Histogram
• Same shape and same horizontal scale as corresponding frequency histogram.
• The vertical scale measures the relative frequencies, not frequencies.
6.5
18.5
30.5
42.5
54.5
66.5
78.5
90.5
Class
Class
boundaries
Frequency
,f
Relative
frequenc
y
7 – 18
6.5 – 18.5
6
0.12
19 – 30
18.5 – 30.5
10
0.20
31 – 42
30.5 – 42.5
13
0.26
43 – 54
42.5 – 54.5
8
0.16
55 – 66
54.5 – 66.5
5
0.10
67 – 78
66.5 – 78.5
6
0.12
79 – 90
78.5 – 90.5
2
0.04
From this graph you can see that 20% of
Internet subscribers spent between 18.5
minutes and 30.5 minutes online.
Larson/Farber
7
2.2 More Graphs and Displays
26
Graphing for Quantitative Data
Stem-and-leaf plot
• Each number separated into a stem & a leaf.
• Still contains original data values.
Data: 21, 25, 25, 26, 27, 28, 30, 36, 36, 45
2
3
4
1 5 5 6 7 8
0 6 6
5
Dot plot
• Each data entry is plotted, using a point, above a horizontal axis
Data: 21, 25, 25, 26, 27, 28, 30, 36, 36, 45
26
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Larson/Farber
8
Examples: Graphing Quantitative Data
The following are the numbers of text messages sent last month by the
cellular phone users on one floor of a college dormitory.
155 159 144 129 105 145 126 116 130 114 122 112 112 142 126 118
118 108 122 121 109 140 126 119 113 117 118 109 109 119 148 147 126
139 139 122 78 133 126 123 145 121 134 124 119 132 133 124 129 112
Stem-and-Leaf Plots
More than 50% of
the cellular phone
users sent between
110 and 130 text
messages.
Dot Plot
Larson/Farber 4th ed.
9
More Graphs for Qualitative Data Sets
Pareto Chart
• A vertical bar graph in which the
height of each bar represents
frequency or relative frequency.
• The bars are positioned in order of
decreasing height, with the tallest
bar positioned at the left.
Frequency
Pie Chart
• A circle is divided into sectors
that represent categories.
• The area of each sector is
proportional to the frequency
of each category.
Categories
Larson/Farber
10
Example: Pie Chart (Qualititative Data)
The numbers of motor vehicle occupants killed in crashes in 2005 are shown in
the table. A pie chart is used to organize the data. (Source: U.S. Department of
Transportation, National Highway Traffic Safety Administration)
Vehicle type
Cars
Trucks
Motorcycles
Other
Killed
Central Angle
Relative
(frequency) frequency (%) – Degrees (°)
18440
18,440 37594  0.49 360º(0.49)≈176º
13,778
4,553
823
f =37,594
13778
 0.37
37594
4553
 0.12
37594
823
 0.02
37594
360º(0.37)≈133º
360º(0.12)≈43º
360º(0.02)≈7º
From the pie chart, you can see that most
fatalities in motor vehicle crashes were those
involving the occupants of cars.
Larson/Farber
11
Example: Pareto Chart (Qualitative Data)
In a recent year, the retail industry lost $41.0 million in inventory shrinkage.
Inventory shrinkage is the loss of inventory through breakage, pilferage,
shoplifting, and so on. The causes of the inventory shrinkage are
administrative error ($7.8 million), employee theft ($15.6 million), shoplifting
($14.7 million), and vendor fraud ($2.9 million). Use a Pareto chart to organize
this data. (Source: National Retail Federation and Center for Retailing
Education, University of Florida)
Cause
$ (million)
Admin. error
7.8
Employee
theft
15.6
Shoplifting
14.7
Vendor fraud
2.9
From the graph, it is easy to see that the causes of inventory shrinkage that
should be addressed first are employee theft and shoplifting.
Larson/Farber
12
2.2 More Graphs for Paired Data Sets
(Each entry in one data set corresponds to one entry in a second data set.)
Scatter Plot.
 The ordered pairs are graphed as
points in a coordinate plane.
Time Series
• Data set is composed of quantitative entries taken at
regular intervals over a period of time.
• Example: The amount of precipitation measured
each day for one month.
Quantitative
data
 Used to show the relationship
between two quantitative variables.
y
time
x
Larson/Farber 4th ed.
13
Example:Scatter Plot (Paired Data)
The British statistician Ronald Fisher introduced a famous data set called
Fisher's Iris data set. This data set describes various physical
characteristics, such as petal length and petal width (in millimeters), for
three species of iris. The petal lengths form the first data set and the petal
widths form the second data set. (Source: Fisher, R. A., 1936)
Interpretation
As the petal length increases,
the petal width also tends to
increase.
Each point in the scatter plot represents the
petal length and petal width of one flower.
Larson/Farber
14
Example:Time Series Chart (Paired Data)
The table lists the number of cellular telephone subscribers (in millions) for the
years 1995 through 2005. Construct a time series chart for the number of cellular
subscribers. (Source: Cellular Telecommunication & Internet Association)
The graph shows that the number of subscribers has been
increasing since 1995, with greater increases 2003 to 2005
Larson/Farber
15
2.3 Measures of Central Tendency
(Typical or Central Entry of a data Set)
Mean
Median
Mode
Mean (average)
• The sum of all the data entries divided by the number of entries.
• Sigma notation: Σx = add all of the data entries (x) in the data set.
• Population mean
Sample Mean

x
N
x
x
n
Example: The prices (in dollars) for a sample of roundtrip flights from
Chicago, Illinois to Cancun, Mexico are listed. Find the mean flight price.
872 432 397 427 388 782 397
Σx = 872 + 432 + 397 + 427 + 388 + 782 + 397 = 3695
x 3695
x

 527.9
n
7
Larson/Farber
Mean flight price is about $527.90.
16
Measures of Central Tendency
Mean
Median
Mode
Median
• The value that lies in the middle of the data when the data set is ordered.
• Measures the center of an ordered data set by dividing it into two equal parts.
• If the data set has an
 odd number of entries: median is the middle data entry.
 even number of entries: median is the mean of the two middle data entries.
Example1: The prices (in dollars) for a sample of roundtrip flights from Chicago,
Illinois to Cancun, Mexico are listed. Find the median of the flight prices.
872 432 397 427 388 782 397
Order data and find middle: 388 397 397 427 432 782 872
Example2: The flight priced at $432 is no longer available. What is the median
price of the remaining flights?
397  427
 412
388 397 397 427 782 872 Median 
2
Larson/Farber 4th ed.
17
Measure of Central Tendency
Mean
Median
Mode
Mode
• The data entry that occurs with the greatest frequency.
• If no entry is repeated the data set has no mode.
• If two entries occur with the same greatest frequency, each entry is a mode
(bimodal).
Example1: The prices (in dollars) for a sample of roundtrip flights from
Chicago, Illinois to Cancun, Mexico are listed. Find the mode of the flight
prices. 872 432 397 427 388 782 397
The mode of the
• Ordering the data helps to find the mode.
flight prices is $397.
388 397 397 427 432 782 872
Example2: At a political debate a sample of audience
members was asked to name the political party to
which they belong. Their responses are shown in the
table. What is the mode of the responses?
Republican
Larson/Farber
Political Party
Frequency, f
Democrat
34
Republican
56
Other
21
Did not respond
9
18
Comparing the Mean, Median, and Mode
The mean is a reliable measure; it takes into account every entry of a data set,
BUT, the mean is greatly affected by outliers (a data entry that is far removed
from the other entries in the data set).
Example: Find the mean, median, and mode of the sample ages of a class
shown. Which measure of central tendency best describes a typical entry of
this data set? Are there any outliers?
Ages in a class
x 20  20  ...  24  65

 23.8 years
n
20
20 20 20 20 20 20 21
Median:
21  22
 21.5 years
2
23 23 23 24 24 65
Mode:
20 years (the entry occurring with the
greatest frequency)
Mean: x 
21 21 21 22 22 22 23
• The mean takes every entry into account, but is influenced by the outlier of 65.
• The median also takes every entry into account, and it is not affected by the outlier.
• In this case the mode exists, but it doesn't appear to represent a typical entry.
Larson/Farber
19
Example: Finding a Weighted Mean
You are taking a class in which your grade is determined from five sources:
50% from your test mean, 15% from your midterm, 20% from your final
exam, 10% from your computer lab work, and 5% from your homework. Your
scores are 86 (test mean), 96 (midterm), 82 (final exam), 98 (computer lab),
and 100 (homework). What is the weighted mean of your scores? If the
minimum average for an A is 90, did you get an A?
Source
x∙w
Score, x
Weight, w
Test Mean
86
0.50
86(0.50)= 43.0
Midterm
96
0.15
96(0.15) = 14.4
Final Exam
82
0.20
82(0.20) = 16.4
Computer Lab
98
0.10
98(0.10) = 9.8
Homework
100
0.05
100(0.05) = 5.0
Σw = 1
x 
Larson/Farber
The data has
varying weights.
Σ(x∙w) = 88.6
( x  w)
88.6

 88.6
w
1
20
The Shape of Distributions
Symmetric Distribution
• A vertical line can be drawn through
the middle of a graph of the
distribution and the resulting halves
are approximately mirror images.
Uniform Distribution (rectangular)
• All entries or classes in the
distribution have equal or
approximately equal frequencies.
• Symmetric.
Skewed Left Distribution (negative skew)
Skewed Right Distribution (positive skew)
• “Tail” of the graph elongates more to the left. • “Tail” of graph elongates to the right.
• The mean is to the left of the median.
• Mean is to the right of the median.
Larson/Farber 4th ed.
21
2.4 Measures of Deviation
•
Variation in data
•
How individual data values vary within a given data set
Range
• Quantitative data only
• The difference between the
maximum and minimum data
entries in the set.
• Range = (Xmax - Xmin)
• Advantage: Easy to compute
• Disadvantage: Only uses 2
data entries (not all)
Example: Corporation A hired 10 graduates.
The starting salaries for each graduate are
shown. Find the range of the starting salaries.
Starting salaries (1000s of dollars)
41 38 39 45 47 41 44 41 37 42
Xmax = 47
Xmin = 37
Range = 47 – 37 = 10
Corporation B’s starting salaries are below:
Note: Both corporation data sets
have the same mean, median &
mode. The range shows us how
‘varied’
data is!
Larson/Farberthe
4th ed.
40 23 41 50 49 32 41 29 52 58
Xmax = 58
Xmin = 23
Range = 58 – 23 = 35
22
Deviation, Variance, and Standard
Deviation
Deviation
• The difference between the data entry, x, and the mean of the data set.
Salary ($1000s)
Deviation
• Population data set: Deviation of x = x – μ
x
x–μ
• Sample data set:
Deviation of x = x – x
Deviations for all data entries in
Corporation A’ starting salary data set.
Mean

x 415

 41.5
N
10
The sum of deviations = 0. This is true for
any data set, so we use the squares of the
deviations instead.
Larson/Farber
41
41 – 41.5 = –0.5
38
38 – 41.5 = –3.5
39
39 – 41.5 = –2.5
45
45 – 41.5 = 3.5
47
47 – 41.5 = 5.5
41
41 – 41.5 = –0.5
44
44 – 41.5 = 2.5
41
41 – 41.5 = –0.5
37
37 – 41.5 = –4.5
42
Σx = 415
42 – 41.5 = 0.5
Σ(x – μ) = 0
23
Deviation, Variance, and Standard
Deviation
Population Variance
( x   ) 2
 
N
2
(Population) Standard Deviation
(Sum of squares, SSx)
Population Standard Deviation
( x   ) 2
  
N
2
Step1: Find the mean of the data set.
x

N
Step2: Find deviation of each entry: x – μ
Step3: Square each deviation:
(x – μ)2
Step4: Add to get the sum of squares.
Sample Variance
( x  x ) 2
s 
n 1
Sample Standard Deviation
2
( x  x )
s s 
n 1
2
2
Note: For ‘grouped-data’ organized into a
( x  x ) 2 f
2
frequency distribution use: s  s  n  1
Larson/Farber
SSx = Σ(x – μ)2
Step5: Divide by N to get the variance.
( x   ) 2
 
N
2
Step6: Square root to get standard deviation.
( x   ) 2

N
**Question**
How would the directions change for a
SAMPLE Standard Deviation?
Standard Deviation
The following data represents the
midterm grade percentages of all
students in an algebra class. Find the
standard deviation of the data.
57 55 72 75 84 69 69 90 68 76 85 50
56 13 76 49 93 78 73 60 62 70 38
23
Number of data values: N = _______
Mean   x = ______________
1518/23 = 66
N
( x   )2 7030/23 = 305.65
Variance  
= ___________
N
2
Standard Deviation
( x   ) 2
  
N
2
Larson/Farber
Grades
(x)
Deviation
(x – μ )
(x – μ)2
57
55
72
75
84
69
69
90
68
76
85
50
56
13
76
49
93
78
73
60
62
70
38
-9
-11
6
9
18
3
3
24
2
10
19
-16
-10
-53
10
-17
27
12
7
-6
-4
4
-28
81
121
36
81
324
9
9
576
4
100
361
256
100
2809
100
289
729
144
49
36
16
16
784
∑(x – μ )=0
SSx=Σ(x – μ)2 = ____
7030
25
Using Technology for Calculations
The TI-83/84 calculator can do some of this work for you.
1. <STAT> <ENTER>
2. Choose a column such as L3 and enter data.
3. <STAT>, Arrow over to <CALC> <ENTER>
4. See: 1-Var Stats <2nd> <L3> <ENTER>
5. See Readout such as this
Note: You can also do these
Functions separately using
<LIST><MATH>
Larson/Farber
26
Interpreting Standard Deviation
•
•
Standard deviation is a measure of the typical amount an entry deviates from the mean.
The more the entries are spread out, the greater the standard deviation.
Empirical Rule (68 – 95 – 99.7 Rule)
For data with a (symmetric) bell-shaped distribution, the standard deviation has the
following characteristics:
• About 68% of the data lie within one standard deviation of the mean.
• About 95% of the data lie within two standard deviations of the mean.
• About 99.7% of the data lie within three standard deviations of the mean.
Larson/Farber.
27
Interpreting Standard Deviation:
Empirical Rule (68 – 95 – 99.7 Rule)
99.7% within 3 standard deviations
95% within 2 standard deviations
68% within 1
standard deviation
34%
34%
2.35%
2.35%
13.5%
x  3s
Larson/Farber
x  2s
13.5%
x s
x
xs
x  2s
x  3s
28
Example: Using the Empirical Rule
In a survey conducted by the National Center for Health Statistics, the sample
mean height of women in the United States (ages 20-29) was 64 inches, with a
sample standard deviation of 2.71 inches. Estimate the percent of the women
whose heights are between 64 inches and 69.42 inches.
• Because the distribution is bell-shaped, you can use the Empirical Rule.
34%
13.5%
55.87
x  3s
58.58
x  2s
61.29
x s
64
x
66.71
xs
69.42
x  2s
72.13
x  3s
34% + 13.5% = 47.5% of women are between 64 and 69.42 inches tall.
Larson/Farber
29
Chebychev’s Theorem
For data with any shape distribution:
• The portion of any data set lying within k standard deviations (k > 1) of the
mean is at least: 1  1
k2
2 standard deviations : (k=2), At least 1  12  3 or 75% of the data lie
within 2 standard deviations of the mean. 2 4
3 standard deviations : (k=3), At least 1  12
within 3 standard deviations of the mean. 3

8
or 88.9%
9
of the data lie
Example: The age distribution for Florida is shown in the histogram. Apply
Chebychev’s Theorem to the data using k = 2. What can you conclude?
k = 2: μ – 2σ = 39.2 – 2(24.8) = -10.4 (Use 0 - age is non-negative)
μ + 2σ = 39.2 + 2(24.8) = 88.8
Conclusion: At least 75% of the population of
Florida is between 0 and 88.8 years old.
Larson/Farber 4th ed.
30
2.5 Measures of Position
• Fractiles are numbers that partition (divide) an ordered data set into equal parts.
• Quartiles approximately divide an ordered data set into four equal parts.
 First quartile, Q1: About ¼ of the data fall on or below Q1.
 Second quartile, Q2: About ½ of the data fall on or below Q2 (median).
 Third quartile, Q3: About three quarters of the data fall on or below Q3.
 Interquartile Range (IQR): Q3 – Q1
Example: The test scores of 15 employees enrolled in a CPR training course
are listed. Find the first, second, and third quartiles of the test scores.
13 9 18 15 14 21 7 10 11 20 5 18 37 16 17
Lower half
Upper half
Step1: Order the data: 5 7 9 10 11 13 14 15 16 17 18 18 20 21 37
Step2: Find Median (Q2):
Q1
Q2
Q3
¼ of employees
scored 10 or less
Step3: Find Q1 & Q3 (medians of lower & upper halves respectively):
Percentiles: Divide a data set into 100 equal parts.
• Often used in education & health fields Ex: A student scored in the 95th
percentile on the math test - better than 95% of the other students.
• Q1 = 25th percentile, Q2 = 50th percentile, Q3 = 75th percentile
Larson/Farber
31
Box-and-Whisker Plot
•
•
Exploratory data analysis tool that highlights important features of a data set.
Requires (five-number summary): Minimum & Maximum entry, Q1 Q2 & Q3
Creating a Box-and-whisker plot
1.
Find the 5-number data set summary
2.
Construct a horizontal scale that
spans the range of the data.
3.
Plot the five numbers above the
horizontal scale.
4.
Draw a box above the horizontal
scale from Q1 to Q3 and draw a
vertical line in the box at Q2.
5.
Draw whiskers from the box to the
minimum and maximum entries.
Example:
Draw a box-and-whisker plot
Minimum value = 6
Maximum value = 104
Q1 = 10,
Q2 = 18,
Q3 = 31,
About half the scores are between 10 & 31.
There is a possible outlier of 104.
Box
Whisker
Minimum
.
entry
Whisker
Q1
Median, Q2
Q3
Maximum
entry
The Standard Score (Z-Score)
•
•
•
•
The number of standard deviations a given value x falls from the mean μ.
Negative Z : The x-value is below the mean
value - mean
x
z

Positive Z
: The x-value is above the mean
standard deviation

Zero Z
: The x-value is equal to the mean
Example: In 2007, Forest Whitaker won the Best Actor Oscar at age 45 for his role in the
movie The Last King of Scotland. Helen Mirren won the Best Actress Oscar at age 61 for
her role in The Queen. The mean age of all best actor winners is 43.7, with a standard
deviation of 8.8. The mean age of all best actress winners is 36, with a standard deviation of
11.5. Find the z-score that corresponds to the age for each actor or actress. Compare results.
•
•
Forest Whitaker
Helen Mirren
z
z
x

x

45  43.7

 0.15
8.8
0.15 Std. Dev. above mean
61  36
 2.17
11.5
2.17 Std. Dev above mean

Unusual Scores occur about 5% of the time
Very Unusual Scores occur about .3% of the time
(Usual range)
(Unusual range)