Transcript Document

Variability
Introduction to Statistics
Chapter 4
Jan 22, 2009
Class #4
Describing Variability


Describes in an exact quantitative measure, how spread
out/clustered together the scores are
Variability is usually defined in terms of distance
 How far apart scores are from each other
 How far apart scores are from the mean
 How representative a score is of the data set as a whole
Describing Variability: the
Range

Simplest and most obvious way of describing
variability
Range = Highest - Lowest
(real limits)

The range only takes into account the two extreme
scores and ignores any values in between. To counter
this there the distribution is divided into quarters
(quartiles). Q1 = 25%, Q2 =50%, Q3 =75%
 The Interquartile range: the distance of the middle two
quartiles (Q3 – Q1)
 The Semi-Interquartile range: is one half of the Interquartile
range
Interquartile range (IQR)
The most common percentiles are quartiles. Quartiles divide
data sets into fourths or four equal parts.
• The 1st quartile, denoted Q1, divides the bottom 25% the
data from the top 75%. Therefore, the 1st quartile is
equivalent to the 25th percentile.
• The 2nd quartile divides the bottom 50% of the data from the
top 50% of the data, so that the 2nd quartile is equivalent to
the 50th percentile, which is equivalent to the median.
• The 3rd quartile divides the bottom 75% of the data from the
top 25% of the data, so that the 3rd quartile is equivalent to
the 75th percentile.
Interquartile range (IQR)
 The interquartile range (IQR) is the distance
between the 75th percentile and the 25th
percentile
 The IQR is essentially the range of the middle
50% of the data
 Because it uses the middle 50%, the IQR is not
affected by outliers (extreme values)
Interquartile range (IQR)
 Example:
 Compute
the interquartile range for the
sorted
 18, 33, 58, 67, 73, 93, 147
 The 25th and 75th percentiles are the
.25*(7+1) and .75*(7+1) = 2nd and 6th
observations, respectively.
 IQR = 93-33 = 60.
Describing Variability:
Deviation in a Population

A more sophisticated measure of variability is one
that shows how scores cluster around the mean

Deviation is the distance of a score from the mean
X - , e.g. 11 - 6.35 = 3.65, 3 – 6.35 = -3.35

A measure representative of the variability of all the
scores would be the mean of the deviation scores
(X - )
Add all the deviations and divide by n
N
 However the deviation scores add up to zero (as mean
serves as balance point for scores)
Describing Variability:
Variance in a Population
X
3
3
4
4
4
5
5
5
6
6
6
6
7
7
8
8
9
10
10
11
Sum
X-
-3.35
-3.35
-2.35
-2.35
-2.35
-1.35
-1.35
-1.35
-0.35
-0.35
-0.35
-0.35
0.65
0.65
1.65
1.65
2.65
3.65
3.65
4.65
0
(X -)²
11.22
11.22
5.52
5.52
5.52
1.82
1.82
1.82
0.12
0.12
0.12
0.12
0.42
0.42
2.72
2.72
7.02
13.32
13.32
21.62
106.55

To remove the +/- signs we
simply square each deviation
before finding the average. This
is called the Variance:
(X - )²
N


= 106.55
20
= 5.33
The numerator is referred to as the
Sum of Squares (SS): as it refers to
the sum of the squared deviations
around the mean value
SS is a basic component of
variability – the sum of squared
deviation scores
Variability:
Variance in a Population



let X = [3, 4, 5 ,6, 7]
Mean = 5
(X - Mean ) = [-2, -1, 0, 1, 2]
r
2
( X  )


subtract Mean from each number in X

(X - Mean )2 = [4, 1, 0, 1, 4]
squared deviations from the mean

(X - Mean )2 = 10
sum of squared deviations from the mean (SS)

(X - Mean )2 /N = 10/5 = 2
average squared deviation from the mean
N
2
Variability:
Variance in a Population



let X = [1, 3, 5, 7, 9]
Mean = 5
(X - Mean) = [-4, -2, 0, 2, 4 ]
r
2
( X  )


subtract Mean from each number in X

(X - Mean)2 = [16, 4, 0, 4, 16]
squared deviations from the mean

(X - Mean)2 = 40
sum of squared deviations from the mean (SS)

(X - Mean)2 /n = 40/5 = 8
average squared deviation from the mean
N
2
Variability:
Variance in a Population
 Variance can be calculated with the sum of
squares (SS) divided by n
r
2
( X  )


N
2
Variability: Variance in a Sample
 Variance in a sample
S
2
(X  X )


2
n 1
n is the number of scores -1
SS is the Sum of Squared Deviations From the Mean

SS   (X  X)2

So, variance (S2) is the average squared deviation
from the mean
Describing Variability:
Population and Sample Variance

Population variance is designated by ²
² = (X - )² = SS
N
N

Sample Variance is designated by s²


Samples are less variable than populations: they therefore give
biased estimates of population variability
Degrees of Freedom (df): the number of independent (free to
vary) scores. In a sample, the sample mean must be known
before the variance can be calculated, therefore the final score is
dependent on earlier scores: df = n -1
s² =
(x - M)² =
n-1
SS = 106.55 = 5.61
n -1
20 -1
Describing Variability: the
Standard Deviation



Variance is a measure based on squared distances
In order to get around this, we can take the square
root of the variance, which gives us the standard
deviation
Population () and Sample (s) standard deviation
 = (X - )²
N
s = (X - M)²
n-1
Variability:
Standard Deviation of a Sample
 The square root of Variance is called the
Standard Deviation
S
2
S
(X  X )


n 1
(X  X )
n 1
2
Variance
2
Standard Deviation
Variability: Standard Deviation
 “The Standard Deviation tells us
approximately how far the scores vary from
the mean on average”
 It is approximately the average deviation of
scores from the mean
The Standard Deviation and the
Normal Distribution
scores above or below any given
point on a normal curve
 34% of scores between the
mean and 1 SD above or
below the mean
 An additional 14% of scores
between 1 and 2 SDs above or
below the mean
 Thus, about 96% of all scores
are within 2 SDs of the mean
(34% + 34% + 14% + 14% =
96%)
 Note: 34% and 14% figures can
be useful to remember
Probability Density
 There are known percentages of
Describing Variability

The standard deviation is the most common
measure of variability, but the others can be used.
A good measure of variability must:

Must be stable and reliable: not be greatly affected by
little details in the data
 Extreme scores
 Multiple sampling from the same population
 Open-ended distributions

Both the variance and SD are related to other
statistical techniques
SS Computational Formula
 Note this formula on page 93. In later
chapters, we will be using this alternate SS
formula.
Credits
 http://www.le.ac.uk/pc/sk219/introtostats1.ppt#259,4,Plotting Data:
describing spread of data
 http://math.usask.ca/~miket/Sullivan_PP/Chapter_3/sec3_4.ppt#24