RM_Descriptive_stats_II

Download Report

Transcript RM_Descriptive_stats_II

VARIABILITY
Review: Distribution
An arrangement of cases according to their
score or value on one or more variables
•
Categorical
variable
•
Continuous
variable
Case no.
Age
Height
M/F
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
23
22
23
25
27
22
24
23
23
25
21
21
24
27
21
25
22
22
25
26
21
31
24
31
23
27
25
26
22
29
24
68
64
69
71
64
72
65
66
66
68
68
62
71
66
62
56
71
70
66
60
52
70
71
61
72
71
71
64
66
69
67
M
F
F
M
F
M
F
M
F
F
M
F
M
F
F
F
M
M
F
F
F
F
M
F
M
F
M
F
F
M
F
Summary
statistics
mean = 24
mean = 67
%M 39
%F 61
Dispersion and the mean
•
•
•
Dispersion: How scores or values
arrange themselves around the mean
If most scores cluster about the mean the
shape of the distribution is peaked
– This is the so-called “normal”
distribution
– In social science the scores or values
for many variables are normally or
near-normally distributed
– This allows use of the mean to
describe the dataset (that’s why it’s
called a “summary statistic”)
When scores are more dispersed a
distribution’s shape is flatter
– Distance between most scores and
the mean is greater
– Many scores are at a considerable
distance from the mean
– The mean loses value as a summary
statistic
Arrests
Mean A good
3.0  descriptor
TT
Normal distribution
Arrests
Mean A poor
3.65 descriptor
TT
“Flat” distribution
Normal distributions
•
Characteristics:
– Unimodal and symmetrical: shapes on both sides of the mean are identical
– 68.26 percent of the area “under” the curve – meaning 68.26 percent of the cases –
falls within one “standard deviation” (+/- 1 ) from the mean
– NOTE: The fact that a distribution is “normal” or “near-normal” does NOT imply that
the mean is of any particular value. All it implies is that scores distribute themselves
around the mean “normally”.
• Means depend on the data. In this distribution the mean could be any value.
• By definition, the standard deviation score that corresponds with the mean of a
normal distribution - whatever that score might be - is zero.
Mean (whatever it is)
Standard deviation (always 0 at the mean)
Measuring dispersion
•
Average deviation
 (x - )
----------n
–
–
–
•
Average distance between the mean and the values (scores) for each case
Uses absolute distances (no + or -)
Affected by extreme scores
Variance (s2): A sample’s cumulative dispersion
 (x - )2
----------n  use n-1 for small samples
•
Standard deviation (s): A standardized form of variance, comparable between samples
 (x - )2
----------n  use n-1 for small samples
–
–
–
Square root of the variance
Expresses dispersion in units of equal size for that particular distribution
Less affected by extreme scores
How well do means represent
(summarize) a sample?
Frequency
If variable “no. of tickets” was
“normally” distributed most
cases would fall inside the bellshaped curve. Here they don’t.
Number of tickets
A
B
C
2.13
-1 SD
D
E
F
G
H
I
4.46
mean
K
L
J
6.79
+1 SD
M
13 officers scored on numbers
of tickets written in one week
Officer A: 1 ticket
Officers B & C: 2 tickets each
Officers D & E: 3 tickets each
Officers F & G: 4 tickets each
Officers H & I: 5 tickets each
Officer J: 6 tickets
Officers K & L: 7 tickets each
Officer M: 9 tickets
Mean = 4.46
SD = 2.33
In a normal distribution about 66% of
cases fall within 1 SD of the mean.
.66 X 13 cases = 9 cases
But here only 7 cases (Officers D-J) fall
within 1 SD of the mean. Six officers
wrote very few or very many tickets,
making the distribution considerably
more dispersed than “normal.”
So…for this sample, the mean does NOT
seem to be a good summary statistic. It
is NOT a good shortcut for describing
how officers in this sample performed.
13 officers scored on numbers
of tickets written in one week
Frequency
If variable “no. of tickets” was
“normally” distributed most
cases would fall inside the bellshaped curve. Here they do!
Officer A: 1 ticket
Officer B: 2 tickets
Officer C: 3 tickets
Officers D, E, F: 4 tickets each
Officers G, H, I: 5 tickets each
Officers J & K: 6 tickets each
Officer L: 7 tickets
Officer M: 9 tickets
Mean = 4.69
SD = 2.1
In a normal distribution 66
percent of the cases fall within
1 SD of the mean
.66 X 13 = 8.58 = 9 cases
Number of tickets
A
B
C
2.59
-1 SD
D
E
F
G
H
I
4.69
mean
J
K
L
6.79
+1 SD
M
Here, 9 of the 13 cases (officers C-K)
do fall within 1 SD of the mean.
The distribution is normal because
most officers wrote close to the same
number of tickets, so the cases
“clustered” around the mean.
So, for this sample the mean is a good
summary statistic - a good shortcut for
describing officer performance
Going beyond description…
•
•
•
•
•
•
As we’ve seen, when variables are normally or nearnormally distributed, the mean, variance and standard
deviation can help describe datasets
But they are also useful in explaining why things change;
that is, in testing hypotheses
For example, assume that patrol officers in the XYZ
police dept. were tested for effectiveness, and that on a
scale of 1 (least eff.) to 5 (most eff.) their mean score
was 3.2, distributed about normally
You want to use XYTZ P.D. to test the hypothesis that
college-educated cops are more effective: college 
greater effectiveness
– Independent variable: college (Y/N)
– Dependent variable: effectiveness (scale 1-5)
You draw two officer samples (we’ll cover this later in
the term) and compare their mean effectiveness scores
– 10 college grads (mean 3.7)
– 10 non-college (mean 2.8)
On its face, the difference between means is in the
hypothesized direction: college grads seem more
effective. But that’s not the end of it. Each group’s
variance would then be used to determine whether the
difference in scores is “statistically significant.” Don’t
worry - we’ll cover this later!
Are collegeeducated cops
more effective?
College grads
Non-college grads
Variability
exercise
Sample 1 (n=10)
Officer
Score
Mean
Diff.
Sq.
1
3
2.9
.1
.01
2
3
2.9
.1
.01
3
3
2.9
.1
.01
4
3
2.9
.1
.01
5
3
2.9
.1
.01
6
3
2.9
.1
.01
7
3
2.9
.1
.01
8
1
2.9
-1.9
3.61
9
2
2.9
-.9
.81
10
5
2.9
2.1
4.41
____________________________________________________
Sum 8.90
Variance (sum of squares / n-1)
s2
.99
Standard deviation (sq. root of variance)
s
.99
Random sample of
patrol officers,
each scored 1-5 on a
cynicism scale
This is not an acceptable graph – it’s only to illustrate dispersion
Sample 2 (n=10)
Another random
sample of patrol
officers,
each scored 1-5 on a
cynicism scale
Officer
Score
Mean
Diff.
Sq.
1
2
3
4
5
6
7
8
9
10
2
1
1
2
3
3
3
3
4
2
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
___
Sum ____
Variance s2 ____
Standard deviation s ____
Compute ...
Two random samples of patrol officers, each scored 1-5 on a cynicism scale
Sample 1 (n=10)
Officer
1
2
3
4
5
6
7
8
9
10
Score
3
3
3
3
3
3
3
1
2
5
Mean
2.9
2.9
2.9
2.9
2.9
2.9
2.9
2.9
2.9
2.9
Variance (sum of squares / n-1)
Standard deviation (sq. root of variance)
Sample 2 (n=10)
Diff.
.1
.1
.1
.1
.1
.1
.1
-1.9
-.9
2.1
Sq.
.01
.01
.01
.01
.01
.01
.01
3.61
.81
4.41
Sum
s2
s
8.90
.99
.99
Officer
1
2
3
4
5
6
7
8
9
10
Score
2
1
1
2
3
3
3
3
4
2
Mean
2.4
2.4
2.4
2.4
2.4
2.4
2.4
2.4
2.4
2.4
Variance (sum of squares / n-1)
Standard deviation (sq. root of variance)
Diff.
-.4
-1.4
-1.4
-.4
.6
.6
.6
.6
1.6
-.4
Sq.
.16
1.96
1.96
.16
.36
.36
.36
.36
2.56
.16
Sum
s2
s
8.40
.93
.97
These are not acceptable graphs – they’re only used here to illustrate how the scores disperse around the mean
z-score (a “standard” score)
•
•
If the distribution of a variable (e.g., number of arrests) is approximately normal, we can estimate
where any score would fall in relation to the mean.
We first convert the sample score into a z-score using the sample standard deviation
z-scores
-3
-2
-1
0
+1
+2
+3
•
•
•
•
•
•
We then look up the z-score in a table. It gives the proportion of cases in the distribution…
– Between a case and the mean
– Beyond the case, away from the mean (left for negative z’s, right for positive z’s)
Z-scores can be used to identify the percentile bracket into which a case falls (e.g., bottom ten percent)
Since z-scores are standardized like percentages, they can be used to compare samples
The z-table indicates the proportion of the area under the curve (the proportion of scores) between the
mean and any z score, and the proportion of the area beyond that score (to the left or right)
In a normal distribution 95 percent of all z-scores falls between +/- 1.96
In a normal distribution 5 present of all z-scores fall beyond +/- 1.96
Rare/unusual cases
Proportion of area “under the curve” where cases lie
.025
.475
.475
.025
100 percent of cases
95 percent of cases
2½ pct.
-1.96
2½ pct.
+1.96
Variability exercise
1 2 3 4 5
Frequency
6
Sample of twenty officers drawn from
the Anywhere police department,
each measured for number of arrests
0 1
Unit of analysis: officers
Case: one officer
Variable: number of arrests
2 3 4
Arrests
5
6
Number of arrests is
presumably normally
distributed in the
population of officers,
meaning the whole
police department. That
is, most officers make
about the same number
of arrests; a few make
less, and a few make
more.
Officer
Assignment
1.
2.
Compute the sample standard
deviation
Obtain the z-score for 0, 1, 2, 3,
4, 5 and 6 arrests
(x -x)
z = -------s
NOTE: There are only seven values:
0, 1, 2, 3, 4, 5, 6. Only need to
compute their statistics once.
#Arrests
1
2
2
4
3
5
4
3
5
1
6
3
7
2
8 (Jay)
0
9
3
10
4
11
5
12
3
13
2
14
1
15
4
16
6
17
3
18
4
19
2
20
3
Mean
Diff.
Sum of squared differences
Variance (sum of squares/n-1)
Standard deviation (sq root var)
Diff. Squared
Z-score
Ofcr
#Arr
Mean
Diff.
Diff. Sq
1
2
3
-1
1
2
4
3
1
1
3
5
3
2
4
4
3
3
0
0
5
1
3
-2
4
6
3
3
0
0
7
2
3
-1
1
8 (Jay)
0
3
-3
9
9
3
3
0
0
10
4
3
1
1
11
5
3
2
4
12
3
3
0
0
13
2
3
-1
1
14
1
3
-2
4
15
4
3
1
1
16
6
3
3
9
17
3
3
0
0
18
4
3
1
1
19
2
3
-1
1
20
3
3
0
0
Sum of squared differences
42
Variance (sum of squares/n-1)
2.21
Standard Deviation (sq. root)
1.49
z
Prop. between mean and z
Prop. beyond z
0 (Jay)
0-3/1.49
-2.01
48% (.4778)
2% (.0222)
1
1-3/1.49
-1.34
41% (.4099)
9% (.0901)
2
2-3/1.49
-.67
25% (.2486)
25% (.2514)
3
3-3/1.49
0
0
50% (.50)
4
4-3/1.49
+.67
25% (.2486)
25% (.2514)
5
5-3/1.49
+1.34
41% (.4099)
9% (.0901)
6 (Dudley)
6-3/1.49
+2.01
48% (.4778)
2% (.0222)
1
2
3
4
5
6
calculate
No. of officers
Jay’s score
falls in the
bottom two
percent of a
normal
distribution
arrests
z-score
No. of arrests
-2
0
-1
1
0
2
3
+1
4
+2
5
6
Dudley’s
score falls in
the top two
percent of a
normal
distribution
Exam information
• You must bring a regular, non-scientific calculator with no functions
beyond a square root key and a z-table.
• You need to understand the concept of a distribution.
• You will be given data and asked to create graph(s) depicting the
distribution of a single variable.
• You will compute basic statistics, including mean, median, mode,
standard deviation and z-score. All computations must be shown on
the answer sheet.
• You will be given the formulas for variance (s2) and z. You must use and
display the procedure described in the slides and practiced in class for
manually calculating variance (s2) and standard deviation (s).
• You will use the z-table to calculate where cases from a given sample
would fall in a normal distribution.
• This is a relatively brief exam. You will have one hour to complete it.
We will then take a break and move on to the next topic.