Numeric Summaries and Descriptive Statistics

Download Report

Transcript Numeric Summaries and Descriptive Statistics

populations vs. samples
• we want to describe both samples and
populations
• the latter is a matter of inference…
“outliers”
• minority cases, so different from the majority
that they merit separate consideration
– are they errors?
– are they indicative of a different pattern?
• think about possible outliers with care, but
beware of mechanical treatments…
• significance of outliers depends on your
research interests
summaries of distributions
• graphic vs. numeric
– graphic may be better for visualization
– numeric are better for statistical/inferential
purposes
• resistance to outliers is usually an advantage
in either case
general characteristics
0.22
• kurtosis [“peakedness”]
0.4
0.8
X
X
0.00
-5
0.0
-5
5
D
0.0
-5
5
‘leptokurtic’
D
’platykurtic’
5
5
right
(positive)
skew
4
X
3
• skew (skewness)
2
5
1
4
0.2
0.4
0.6
D
0.8
1.0
1.2
3
X
0
0.0
left
(negative)
skew
2
1
0
0.0
0.2
0.4
0.6
D
0.8
1.0
1.2
central tendency
• measures of central tendency
– provide a sense of the value expressed by
multiple cases, over all…
• mean
• median
• mode
mean
• center of gravity
• evenly partitions the sum of all
measurement among all cases; average of
all measures
n
x
x
i 1
n
i
mean – pro and con
• crucial for inferential statistics
• mean is not very resistant to outliers
• a “trimmed mean” may be better for
descriptive purposes
mean
rim diameter (cm)
unit 1 unit 2
12.6 16.2
11.6 16.4
16.3 13.8
13.1 13.2
12.1 11.3
26.9 14.0
9.7
9.0
11.5 12.5
14.8 15.6
13.5 11.2
12.4 12.2
13.6 15.5
11.7
n
total
total/n
12
13
168.1 172.6
14.0 13.3
R: mean(x)
unit 1
9
3
14.0==
8
651
641
65
7
unit 2
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
24
56
0
28
25
237
0
==13.3
trimmed mean
rim diameter (cm)
unit 1 unit 2
9.7
9.0
11.5 11.2
11.6 11.3
12.1 11.7
12.4 12.2
12.6 12.5
13.1 13.2
13.5 13.8
13.6 14.0
14.8 15.5
16.3 15.6
26.9 16.2
16.4
unit 1
9
3
13.2==
n
total
total/n
10
11
131.5 147.2
13.2 13.4
R: mean(x, trim=.1)
8
651
641
65
7
unit 2
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
24
56
0
28
25
237
0
==13.4
median
• 50th percentile…
• less useful for inferential purposes
• more resistant to effects of outliers…
median
unit 1
9
rim diameter (cm)
unit 1 unit 2
9.7
9.0
11.5 11.2
11.6 11.3
12.1 11.7
12.4 12.2
12.6 12.5
12.9 <-13.2 13.2
13.1 13.8
13.5 14.0
13.6 15.5
14.8 15.6
16.3 16.2
26.9 16.4
3
12.85==
8
651
641
65
7
unit 2
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
24
56
0
28
25
237
0
==13.20
mode
• the most numerous category
• for ratio data, often implies that data have
been grouped in some way
• can be more or less created by the grouping
procedure
• for theoretical distributions—simply the
location of the peak on the frequency
distribution
regional centers
regional centers
villages
hamlets
isolated scatters
modal class = ‘hamlets’
0.22
0.00
1.0
-5
1.5
2.0
2.5
5
dispersion
• measures of dispersion
– summarize degree of clustering of cases, esp.
with respect to central tendency…
• range
• variance
• standard deviation
range
unit 1 unit 2
9.7
9.0
11.5 11.2
11.6 11.3
12.1 11.7
12.4 12.2
12.6 12.5
13.1 13.2
13.5 13.8
13.6 14.0
14.8 15.5
16.3 15.6
26.9 16.2
16.4
R: range(x)
unit 1
*
9
|
|
|
|
|
|
|
|
|
|
3
|
|
8
| 651
| 641
|
65
|
*
7
unit 2
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
24
56
0
28
25
237
0
*
|
|
|
|
|
|
*
• would be better to use midspread…
R: var(x)
variance
• analogous to average deviation of cases from
mean
• in fact, based on sum of squared deviations from
the mean—“sum-of-squares”
n
s 
2
 x
i 1
 x
2
i
n 1
variance
• computational form:
2


x    xi  / n

i 1
i 1


2
s 
n 1
n
n
2
i
• note: units of variance are squared…
• this makes variance hard to interpret
• ex.: projectile point sample:
mean = 22.6 mm
variance = 38 mm2
• what does this mean???
standard deviation
• square root of variance:
n
s
  xi  x 
i 1
n 1
2
2


x    xi  / n

i 1
i 1


s
n 1
n
n
2
i
standard deviation
• units are in same units as base measurements
• ex.: projectile point sample:
mean = 22.6 mm
standard deviation = 6.2 mm
• mean +/- sd (16.4—28.8 mm)
– should give at least some intuitive sense of where most
of the cases lie, barring major effects of outliers