Describing Data: class 2

Download Report

Transcript Describing Data: class 2

DESCRIBING DATA: 2
Numerical summaries of
data
using measures of
central tendency
and
dispersion
Central tendency--Mode
Table 1. Undergraduate Majors
Major
Anthropology
Economics
Geography
Political Science
Sociology
F
97
104
57
110
82
Bimodal Distributions
Table 1. Undergraduate Majors
Major
Anthropology
Economics
Geography
Political Science
Sociology
F
97
110
57
110
82
Mode for Grouped Frequency
Distributions based on Interval Data
Mean daily
temp.
10-19.9 degrees
20-29.9
30-39.9
40-49.9
50-59.9
60-69.9
Place A
(f)
5
5
20
30
20
20
Place B
(f)
0
5
10
15
30
40
Midpoint of the modal class interval
Median
• The point in the distribution above which
and below which exactly half the
observations lie (50th percentile)
• Calculation depends on whether the no. of
observations is odd or even.
Distribution 1
(n=5)
198
179
172
167
154
Distribution 2
(n=6)
197
193
189 Median=
188
187
183
179
MEDIAN for grouped frequency
distributions based on interval data
Mean daily
temp.
10-19.9 degrees
20-29.9
30-39.9
40-49.9
50-59.9
60-69.9
(f)
5
5
20
30
20
20
Cumulative
(f)
5
10
30
60
80
100
Median = 40 + ((20/30) * 10) = 40 + 6.67 = 46.67
ARITHMETIC MEAN
Y   ( yi ) / n
y  (1  1  3  3  6  7  7) / 7
 28 / 7  4
Mean for Grouped Data
Mean daily
temp.
10-19.9
degrees
20-29.9
30-39.9
40-49.9
50-59.9
60-69.9
Totals
(f)
5
5
20
30
20
20
100
Midpoint of
interval
15
F times
midpoint
75
25
35
45
55
65
125
700
1350
1100
1300
4650
Mean = sum of weighted midpoints / n = 4650/100=46.5
Mean is the balancing
point of the distribution
X
X
X
0
1
X
2
3
4
MEAN
5
X
X
X
6
7
8
9
Key Properties of the
Mean
• Sum of the differences between the
individual scores and the mean equals 0
 (Y  Y )  0
• sum of the squared differences between the
individual scores and the mean equals a
minimum value.
2
The minimum value
 (Y  Y )

Weaknesses of each measure
of central tendency
• MODE: ignores all other info. about
values except the most frequent one
• MEDIAN: ignores the LOCATION of
scores above or below the midpoint
• MEAN: is the most sensitive to extreme
values
Impacts of skewed distributions
Mean
Mode
Median
Measures of Dispersion
Poverty Households (%) in 2 suburbs by tract
Suburb A
24
23
Less
22
dispersion
21
20
Mean=22
Suburb B
28
25
more
22 dispersion
19
16
Mean=22
Range
• Highest value minus the lowest value
• problem: ignores all the other values
between the two extreme values
Interquartile range
• Based on the quartiles (25th percentile
and 75th percentile of a distribution)
• Interquartile range = Q3-Q1
• Semi-interquartile range = (Q3-Q1)/2
• eliminates the effect of extreme scores by
excluding them
Graphic representation:
Box Plot
200
132
101
100
Infant
mortality
rate
0
-100
N=
52
Africa
africa
44
Asia
asia
37
Latin America
latin a merica
Variance
• A measure of dispersion based on the
second property of the mean we discussed
earlier:
 (Y  Y ) 
2
minimum
Step 1: Calculate the total
sum of squares around the
mean
Y
10
12
14
15
16
18
20
Me an =105/7=15
(Y  Y )
-5
-3
-1
0
+1
+3
+5
(Y  Y ) 2
25
9
1
0
1
9
25
Su m = 70
Step 2: Take an average
of this total variation
s   (Y  Y ) / n  1
2
2
Why n-1? Rather than simply n???
The normal procedure involves estimating variance
for a population using data from a sample.
Samples, especially small samples, are less likely to
include extreme scores in the population.
N-1 is used to compensate for this underestimate.
Step 3: Take the square
root of variance
s
 (Y  Y )
2
/ n 1
Purpose: expresses dispersion in the
original units of measurement--not units
of measurement squared
Like variance: the larger the value the
greater the variability
Coefficient of Variation (V)
V = (standard deviation / mean)
Value: To allow you to make comparisons of
dispersion across groups with very different mean
values or across variables with very different
measurement scales.