Measures of variability

Download Report

Transcript Measures of variability

Measures of Variability
• Chapter 5 of Howell (except 5.3 and 5.4)
• People are all slightly different (that’s what
makes it fun)
• Not everyone scores the same on the same scale
• This is interesting for us - must take it into
account
• The variation tells us about the people we
studied
1
Example of variability
• Imagine this variable:
• 5738229193
• The mean is 4.9
• We sort of expect 4.9 to be representative of the
scores, but:
2.5
2
1.5
Series1
1
0.5
0
1
2
3
4
5
6
7
8
9
The data is at the
edges - not at all
close to 4.9!
2
3
A second sample:
• Look at this one:
• 444555566
• The mean is also 4.9
• But the distribution:
4.5
4
3.5
3
2.5
Series1
2
1.5
1
0.5
0
1
2
3
4
5
6
7
8
9
Same mean as before, but the
numbers are very clustered
close to the mean!
How do we explain this difference?
4.5
2.5
4
2
3.5
1.5
2.5
3
Series1
1
Series1
2
1.5
1
0.5
0.5
0
0
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
• Both have the same mean
• the mean obviously doesn’t tell the whole
story!
• What is the actual difference between those
data sets?
• The left one if more “spread out” than the one
on the right
4
Variability
• Measures of variability capture this
“spreadness” of the data
• not applicable to nominal variables
• Various ways to measure it
• How far does the data stretch?
• How far, on average, is it spread from the
mean?
5
Extents of the data - the range
• The range is the total width of the data
• Consider x, with a sample
• 7434563
• These values range all the way from 3 (the
smallest value) to 7 (the biggest value) - it’s
range is 4
• Easy to calculate:
• rangex = max(x) - min(x)
• (the largest value of x minus the smallest value of x)
• A high range value means the data is very
spread
6
Example: calculating the range
• Calculate the range for x, from the sample:
•
•
•
•
• 26 28 32 15 25 12
Step 1 - find the largest value of x
• in this sample, it is 32
Step 2 - find the smallest value of x
• in this sample, it is 12
Step 3 - biggest minus smallest
• 32 - 12 = 20
The range is 20
7
Why the range is cool/ why it sucks
• Gives an idea of how far spread the data is
• a higher range number means the data is more
spread apart
• Can compare various sample’s ranges to see
which is spread the most
• But: can’t distinguish between these two
samples (both have range = 10)
10
4.5
9
4
8
3.5
7
3
6
5
2.5
Series1
2
Series1
4
3
1.5
2
1
1
0.5
0
0
1
1
2
3
4
5
6
7
8
9
10
11
2
3
4
5
6
7
8
9
10
11
8
9
A better idea of variation
• The right histogram shows more clustering,
but has a few values which “throw off” the
range
• Range can be fooled by “extreme values” outliers
• There exist better measures which are “outlier
proof”
Outlier proofing - Varience
• The varience presents a better measure of
data spread
• not as easily influenced by outliers
• Varience is based on the average distance
of the scores from the mean
• It is not on the variable’s scale
• the variance is not in the same units a the
variable
• Still useful - bigger values mean more
spread
10
Calculating variance (brace yourself)
• Variance is calculated using a formula:
Varience is the mean of the squared deviations
of the observations
11
Calculating variance (in English)
• Easy if broken down into 5 small steps!
• Step 1: Work out the mean of x, and n
• Step 2: For each data point, work out the
deviation (x minus the mean of x)
• Step 3: For each data point, square the
deviations you got above
• Step 4: Add all the squared deviations
together
• Step 5: Divide your sum by n minus 1
12
Example: working out s2
• Work out the variance for x, based on the
sample:
• 16, 12, 15, 14, 20
• By the numbers!
• Step 1: work out the mean and n
• n is 5
• 16+12+15+14+20 = 77
• 77 / 5 = 15.4
• The mean is 15.4
13
Example: working out
s2
• For the remaining steps, make yourself a
table:
• x
x-x
(x-x)2
Each column is
a step - fill in
one at a time
14
Example: working out s2
• Step 2: Work out the deviation (x minus
mean of x)
x
16
12
15
14
20
x-x
0.6
-3.4
-0.4
-1.4
4.6
(x-x)2
15
Example: working the variance
• Step 3: Square the deviations (column 2
times column 2)
x
16
12
15
14
20
x-x
0.6
-3.4
-0.4
-1.4
4.6
(x-x)2
0.36
11.56
0.16
1.96
21.16
16
Example: working the variance
• Step 4: sum the squared deviations
• 0.36+11.56+0.16+1.96+21.16 = 35.2
• Step 5: divide the sum by (n-1)
• n=5
• n-1 = 4
• 35.2 / 4 = 8.8
• The variance of this data set is 8.8
• Simple, but tedious!
17
18
Variance: The bad news
• Variance is a good measure of spread, but it
is in odd units
• A bigger number means more spread, but the
number itself means very little
• Because we square in the formula, we cause the
numbers to loose their scale
• The variance of an IQ scale is not in IQ points
• Would be nice to have a measure of variation
which is in the correct units!
19
The Standard Deviation
• The standard deviation is a measure
variation
• Has all the good properties of the variance
• PLUS it is in the same scale as the variable
• Standard deviation of IQ scores is expressed in IQ
points
• Gives and intuitive understanding of how far apart
the scores truly are spread
– “Scores were centered at 100 and spread by 15”
20
Calculating the standard deviation
• Very simple formula:
• To work it out, calculate variance and then
take its square root
21
Example: working out s
• Work out the variance for x, based on the
sample:
• 16, 12, 15, 14, 20
• Step 1: Work out the variance
• s2 = 8.8 (from the previous example)
• Step 2: find the square root:
• 8.8 = 2.966
The standard dev is 2.966
22
Variance and standard deviation
• If you have variance, it is easy to work out
standard deviation
• Square root the variance
• If you have the standard deviation, it is
easy to work out the variance
• Square it
23
Using the standard deviation with the
mean
• By looking at the mean and std deviation at
the same time, we can get a good idea of a
variable:
Mean: 5.35
Std dev: 2.3
A
Mean: 5.35
Std dev: 1.008
B
4.5
6
4
3.5
5
3
4
2.5
3
Series1
Series1
2
1.5
2
1
1
0.5
0
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Understanding distributions
• The mean tells us the “middle” of the
distribution
• The standard dev tells us the “spreadness”
of the data
• From this we can derive a lot
• A low std dev means that everyone scored
almost the same
• A high std dev tells you there was a lot of
disagreement
24