Chapter 3 - UniMAP Portal

Download Report

Transcript Chapter 3 - UniMAP Portal

Chapter 3
EXPLORATION DATA ANALYSIS
3.1 GRAPHICAL DISPLAY OF DATA
3.2 MEASURES OF CENTRAL TENDENCY
3.3 MEASURES OF DISPERSION
3.1 Graphical Display of Data

Most of the statistical information in newspapers, magazines, company
reports and other publications consists of data that are summarized and
presented in a form that is easy for the reader to understand

In this chapter we will discusses and displays several graphical tools for
summarizing and presenting data, including histogram, frequency
polygon, ogive, dot plot, bar chart, pie chart and the scatter plot for twovariable numerical data.
3.1 Graphical Display of Data:
Ungroup Versus Group of Data


Ungrouped data

have not been summarized in any way

are also called raw data
Grouped data

logical groupings of data exists


i.e. age ranges (20-29, 30-39, etc.)
have been organized into a frequency distribution
3.1 Graphical Display of Data
Example of Ungrouped Data
42
26
32
34
57
30
58
37
50
30
53
40
30
47
49
50
40
32
31
40
52
28
23
35
25
30
36
32
26
50
55
30
58
64
52
49
33
43
46
32
61
31
30
40
60
74
37
29
43
54
Ages of a Sample of
Managers from
Urban Child Care
Centers in the
United States
3.1 Graphical Display of Data
Frequency Distribution

Frequency Distribution – summary of data presented in the form of class
intervals and frequencies

Vary in shape and design

Constructed according to the individual researcher's preferences
Frequency Distribution

Steps in Frequency Distribution

Step 1 - Determine range of frequency distribution


Step 2 – determine the number of classes


Range is the difference between the high and the lowest numbers
Don’t use too many, or two few classes
Step 3 – Determine the width of the class interval

Approx class width can be calculated by dividing the range
by the number of classes

Values fit into only one class
Frequency Distribution of Child
Care Manager’s Ages
Class Interval
Frequency
20-under 30
6
30-under 40
18
40-under 50
11
50-under 60
11
60-under 70
3
70-under 80
1
3.1 Graphical Display of Data
Relative Frequency
Class Interval
Frequency
Relative
Frequency
20-under 30
6
6

50
.12
30-under 40
18
.36
40-under 50
11
18

50
50-under 60
11
.22
60-under 70
3
.06
70-under 80
1
.02
Total
50
1.00
.22
The relative frequency is
the proportion of the total
frequency
that is any given class
interval in a frequency
distribution.
3.1 Graphical Display of Data
Cumulative Frequency
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Total
Frequency
6
18
11
11
3
1
50
Cumulative
Frequency
6
24
18 + 6
35
11 + 24
46
49
50
The cumulative
frequency is a running
total of frequencies
through the classes of
frequency distribution
Common Statistical Graphs –
Quantitative Data

Histogram -- vertical bar chart of frequencies

Frequency Polygon -- line graph of frequencies

Ogive -- line graph of cumulative frequencies

Stem and Leaf Plot – Like a histogram, but shows individual data
values. Useful for small data sets.

Pareto Chart -- type of chart which contains both bars and a line
graph.

The bars display the values in descending order, and the line graph shows
the cumulative totals of each category, left to right.

The purpose is to highlight the most important among a (typically large) set
of factors.
3.1 Graphical Display of Data
Histogram

A histogram is a graphical summary of a frequency distribution

The number and location of bins (bars) should be determined based on
the sample size and the range of the data
Data Range
42
26
32
34
57
30
58
37
50
30
53
40
30
47
49
50
40
32
31
40
52
28
23
35
25
30
36
32
26
50
55
30
58
64
52
49
33
43
46
32
61
31
30
40
60
74
37
29
43
54
Range = Largest - Smallest
= 74 - 23
= 51
Smallest
Largest
Number of Classes and Class Width


The number of classes should be between 5 and 15.

Fewer than 5 classes cause excessive summarization.

More than 15 classes leave too much detail.

Or use the formula no. of class = 1 + 3.3 log n (n = numbers set of data)
Class Width

Divide the range by the number of classes for an approximate class width

Round up to a convenient number
Range
51
Approx Class Width =
=
= 8.5
Num Class 6
Class Width = 10
Class Midpoint
The midpoint of each class interval is called the
class midpoint or the class mark.
beginning class endpoint + ending class endpoint
Class Midpoint =
2
30 + 40
=
2
= 35
1
Class Midpoint = class beginning point + class width
2
1
= 30 + 10
2
= 35
Midpoints for Age Classes
Relative Cumulative
Class Interval Frequency Midpoint Frequency Frequency
20-under 30
6
25
.12
6
30-under 40
18
35
.36
24
40-under 50
11
45
.22
35
50-under 60
11
55
.22
46
60-under 70
3
65
.06
49
70-under 80
1
75
.02
50
Total
50
1.00
10
Frequency
6
18
11
11
3
1
0
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Frequency
20
Histogram
0
10 20 30 40 50 60 70 80
Years
15
10
5
Frequency
6
18
11
11
3
1
0
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Frequency
20
Frequency Polygon
0
10 20 30 40 50 60 70 80
Years
Ogive
30-under 40
24
40-under 50
35
50-under 60
46
60-under 70
49
70-under 80
50
40
6
20
20-under 30
0
Frequency
Frequency
Class Interval
60
Cumulative
0
10
20
30
40
Years
50
60
70
80
Stem and Leaf plot:
Safety Examination Scores for Plant
Trainees
Raw Data
Stem
Leaf
86
77
91
60
55
2
3
76
92
47
88
67
3
9
23
59
72
75
83
4
79
5
569
6
07788
77
68
82
97
89
81
75
74
39
67
7
0245567789
79
83
70
78
91
8
11233689
68
49
56
94
81
9
11247
Construction of Stem and Leaf Plot
Raw Data
86
77
91
60
76
92
47
88
23
59
72
75
Stem
Leaf
55
2
3
67
3
9
4
79
5
569
6
07788
Stem
83
77
68
82
97
81
75
74
39
67
7
0245567789
79
83
70
78
91
8
11233689
68
49
56
Leaf94
81
9
11247
Stem
89
Leaf
Common Statistical Graphs –
Qualitative Data

Pie Chart -- proportional representation for categories of a whole

Bar Chart – frequency or relative frequency of one more categorical
variables
Complaints by Amtrak Passengers
COMPLAINT
NUMBER PROPORTION
DEGREES
Stations, etc.
28,000
.40
144.0
Train
Performance
Equipment
14,700
.21
75.6
10,500
.15
50.4
Personnel
9,800
.14
50.6
Schedules,
etc.
Total
7,000
.10
36.0
70,000
1.00
360.0
Complaints by Amtrak Passengers
Second Quarter U.S. Truck Production
Second Quarter Truck
Production in the U.S.
(Hypothetical values)
Company
2d Quarter
Truck
Production
A
357,411
B
354,936
C
160,997
D
34,099
E
Totals
12,747
920,190
Second Quarter U.S. Truck Production
17%
4%
1%
39%
39%
A
B
C
D
E
Pie Chart Calculations for Company A
Company
A
B
C
357, 411
=
920,190
D
E
Totals
2d Quarter
Truck
Production
.388  360 =
Proportion
Degrees
357,411
.388
140
354,936
.386
139
160,997
.175
63
34,099
.037
13
12,747
920,190
.014
1.000
5
360
3.2 Measures of Central Tendency:
Ungrouped Data

Measures of central tendency yield information about “particular
places or locations in a group of numbers.”

Common Measures of Location

Mode

Median

Mean

Percentiles

Quartiles
Mode

Mode - the most frequently occurring value in a data set

Applicable to all levels of data measurement (nominal, ordinal, interval,
and ratio)

Can be used to determine what categories occur most frequently

Sometimes, no mode exists (no duplicates)

Bimodal – In a tie for the most frequently occurring value, two modes
are listed

Multimodal -- Data sets that contain more than two modes
Median

Median - middle value in an ordered array of numbers.

Half the data are above it, half the data are below it

Mathematically, it’s the (n+1)/2

ordered observation
For an array with an odd number of terms, the median is the middle
number


th
n=11 => (n+1)/2
th
= 12/2 th = 6th ordered observation
For an array with an even number of terms the median is the average
of the middle two numbers

n=10 => (n+1)/2
observation
th
= 11/2 th = 5.5th = average of 5th and 6th ordered
Arithmetic Mean

Mean is the average of a group of numbers

Applicable for interval and ratio data

Not applicable for nominal or ordinal data

Affected by each value in the data set, including extreme values

Computed by summing all values in the data set and dividing the sum
by the number of values in the data set
Demonstration Problem 3.1
The number of U.S. cars in service by top car rental companies in a
recent year according to Auto Rental News follows.
Company Number of Cars in Service
Enterprise 643,000; Hertz 327,000; National/Alamo 233,000; Avis 204,000;
Dollar/Thrifty 167,000; Budget 144,000; Advantage 20,000; U-Save 12,000;
Payless 10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000; Triangle 6,000
Compute the mode, the median, and the mean.
Demonstration Problem 3.1
Solutions
Mode: 9,000 (two companies with 9,000 cars in service)
Median: With 13 different companies in this group, N = 13. The
median is located at the (13 +1)/2 = 7th position. Because the
data are already ordered, median is the 7th term, which is 20,000.
Mean: μ = ∑x/N = (1,791,000/13) = 137,769.23
Which Measure Do I Use?



Which measure of central tendency is most appropriate?

In general, the mean is preferred, since it has nice mathematical properties (in
particular, see chapter 7)

The median and quartiles, are resistant to outliers
Consider the following three datasets

1, 2, 3 (median=2, mean=2)

1, 2, 6 (median=2, mean=3)

1, 2, 30 (median=2, mean=11)

All have median=2, but the mean is sensitive to the outliers
In general, if there are outliers, the median is preferred to the mean
Calculation of Grouped Mean
Sometimes data are already grouped, and you are
interested in calculating summary statistics
f * M 2150



 43.0
 f 50
Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Frequency (f)
6
18
11
11
3
1
50
Midpoint (M)
25
35
45
55
65
75
f*M
150
630
495
605
195
75
2150
Median of Grouped Data - Example
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Cumulative
Frequency Frequency
6
6
18
24
11
35
11
46
3
49
1
50
N = 50
N
 cfp
Md  L  2
W 
fmed
50
 24
 40  2
10 
11
 40.909
Mode of Grouped Data

Midpoint of the modal class

Modal class has the greatest frequency
Class Interval
20-under 30
30-under 40
40-under 50
50-under 60
60-under 70
70-under 80
Frequency
6
18
11
11
3
1
30  40
Mode 
 35
2
3.3 Measures of Dispersion :
Range

The difference between the largest and the smallest values in a set of data

Advantage – easy to compute

Disadvantage – is affected by extreme values
3.3 Measures of Dispersion :
Sample Variance

Sample Variance - average of the squared deviations from the arithmetic
mean

Sample Variance – denoted by s2
X  1773
X
2,398
1,844
1,539
1,311
s
2
X-𝑋
625
71
-234
-462
X X



n 1
(X-𝑋)2
390,625
5,041
54,756
213,444
2
663,886

 221, 289
3
3.3 Measures of Dispersion :
Sample Standard Deviation

Sample standard deviation is the square root of the sample variance

Same units as original data
s  s  221,289  470.4
2