Class 3 Lecture: Descriptive Statistics 2

Download Report

Transcript Class 3 Lecture: Descriptive Statistics 2

Sociology 5811:
Lecture 3: Measures of Central
Tendency and Dispersion
Copyright © 2005 by Evan Schofer
Do not copy or distribute without
permission
Announcements
• First math problem set will be handed out in Lab
on Monday…
• Due September 20
Today’s Class:
• The Mean (and relevant mathematical notation)
• Measures of Dispersion
Review: Variables / Notation
• Each column of a dataset is considered a variable
• We’ll refer to a column generically as “Y”
Person
1
# Guns
owned
0
2
3
3
0
4
1
5
1
The variable “Y”
Note: The total
number of cases in
the dataset is
referred to as “N”.
Here, N=5.
Equation of Mean: Notation
• Each case can be
identified a subscript
• Yi represents “ith” case of
variable Y
• i goes from 1 to N
• Y1 = value of Y for first
case in spreadsheet
• Y2 = value for second
case, etc.
• YN = value for last case
Person
1
# Guns
owned (Y)
Y1 = 0
2
Y2 = 3
3
Y3 = 0
4
Y4 = 1
5
Y5 = 1
Calculating the Mean
• Equation:
N
1
Y   Yi
N i 1
• 1. Mean of variable Y represented by Y with a
line on top – called “Y-bar”
• 2. Equals sign means equals: “is calculated
by the following…”
• 3. N refers to the total number of cases for
which there is data
• Summation (S) – will be explained next…
Equation of Mean: Summation
• Sigma (Σ): Summation
– Indicates that you should add up a series of numbers
N
The things on top
and bottom tell you
how many times to
add up Y-sub-i…
AND what
numbers to
substitute for i.
Y
i

i 1
The thing on
the right is the
item to be
added
repeatedly
Equation of Mean: Summation
N
 Y  Y Y
i
i 1
1
2
 Y3  Y4  Y5
• 1. Start with bottom: i = 1.
– The first number to add is Y-sub-1
• 2. Then, allow i to increase by 1
– The second number to add is i = 2, then i = 3
• 3. Keep adding numbers until i = N
– In this case N=5, so stop at 5
Equation of the Mean: Example 2
• Can you calculate mean for gun ownership?
N
Person
1
# Guns
owned (Y)
Y1 = 0
2
Y2 = 3
3
Y3 = 0
4
Y4 = 1
5
Y5 = 1
1
Y   Yi
N i 1
• Answer:
1
Y  5  1
5
Properties of the Mean
• The mean takes into account the value of every
case to determine what is “typical”
– In contrast to the the mode & median
– Probably the most commonly used measure of
“central tendency”
• But, it is often good to look at median & mode also!
• Disadvantages
– Every case influences outcome… even unusual ones
– Extreme cases affect results a lot
– The mean doesn’t give you any information on the
shape of the distribution
• Cases could be very spread out, or very tightly clustered
The Mean and Extreme Values
• Extreme values affect the mean a lot:
Case
Num CD’s
Num CD’s2
1
20
20
2
40
40
3
0
0
4
70
1000
Mean
32.5
265
Changing this
one case really
affects the
mean a lot
Example 1
• And, very different groups can have the same
mean:
16
14
12
10
8
6
4
Std. Dev = 21.72
2
0
Mean = 101
N = 23.00
0
50
25
100
75
150
125
200
175
Number of CDs (Group 1)
Example 2
6
5
4
3
2
1
Std. Dev = 67.62
Mean = 100.0
0
N = 23.00
0.0
50.0
25.0
100.0
75.0
150.0
125.0
200.0
175.0
Number of CDs (Group 2)
Example 3
14
12
10
8
6
4
2
Std. Dev = 102.15
0
N = 23.00
Mean = 104
0
50
25
100
75
150
125
200
175
Number of CDs (Group 3)
Interpreting Dispersion
• Question: What are possible social
interpretations of the different distributions (all
with the same mean)?
• Example 1: Individuals cluster around 100
• Example 2: Individuals distributed sporadically
over range 0-200
• Example 3: Individuals in two groups – near zero
and near 200
Measures of Dispersion
• Remember: Goal is to understand your
variable…
• Center of the distribution is only part of the story
• Important issue:
• How “spread out” are the cases around the
mean?
– How “dispersed”, “varied” are your cases?
– Are most cases like the “typical” case? Or not?
Measures of Dispersion
• Some measures of dispersion:
• 1. Range
– Also related: Minimum and Maximum
• 2. Average Absolute deviation
• 3. Variance
• 4. Standard deviation
Minimum and Maximum
• Minimum: the lowest value of a variable
represented in your data
• Maximum: the highest value of a variable
represented in your data
• Example: In previous histograms about number
of CDs owned, the minimum was 0, the
maximum was 200.
The Range
• The Range is calculated as the maximum minus
the minimum
– In case of CD ownership, 200 - 0 = 200
• Advantage:
– Easy
• Disadvantage:
– 1. Easily influenced by extreme values… may not
be representative
– 2. Doesn’t tell you anything about the middle cases
The Idea of Deviation
• Deviation: How much a particular case differs
from the mean of all cases
• Deviation of zero indicates the case has the same
value as the mean of all cases
– Positive deviation: case has higher value than mean
– Negative deviation: case has lower value than mean
• Extreme positive/negative indicates cases further
from mean.
Deviation of a Case
• Formula:
di  Yi  Y
• Literally, it is the distance from the mean (Y-bar)
Deviation Example
Case
Num CD’s
1
20
Deviation from
mean (32.5)
-12.5
2
40
7.5
3
0
-32.5
4
70
37.5
Turning the Deviation into a
Useful Measure of Dispersion
• Idea #1: Add it all up
– The sum of deviation for all cases:
• What is sum of the following?
-12.5, 7.5, -32.5, 37.5
N
d
i
i 1
• Problem: Sum of deviation is always zero
– Because mean is the exact center of all cases
– Cases equally deviate positively and negatively
– Conclusion: You can’t measure dispersion this way
Turning the Deviation into a
Useful Measure of Dispersion
• Idea #2: Sum up “absolute value” of deviation
– Absolute value makes negative values positive
N
– Designated by vertical bars:
• What is sum?
-12.5, 7.5, -32.5, 37.5
• Answer: 90
d
 i
i 1
– These 4 cases deviate by 90 cds from the mean
• Problem: Sum of Absolute Deviation grows
larger if you have more cases…
– Doesn’t allow comparison across samples
Turning the Deviation into a
Useful Measure of Dispersion
• Idea #3: The Average Absolute Deviation
– Calculate the sum, divide by total N of cases
– Gives the deviation of the average case
• Formula:
N
N
 d  Y Y
i
AAD 
i 1
N
i

i 1
N
Turning the Deviation into a
Useful Measure of Dispersion
• Digression: Here we have used the mean to
determine “typical” size of case deviations
– Originally, I introduce the mean as a way to analyze
actual case values (e.g. # of CDs owned)
– Now: Instead of looking at typical case values, we
want to know what sort of deviation is typical
• In other words a statistic, the mean, is being used to analyze
another statistic – a deviation
– This is a general principle that we will use often:
statistics can help us understand our raw data and
also further summarize our statistical calculations!
Average Absolute Deviation
• Example: Total Deviation = 90, N=4
– What is Average absolute deviation?
– Answer: 22.5
• Advantages
– Very intuitive interpretation:
• Tells you how much cases differ from the mean, on average
• Disadvantages
– Has non-ideal properties, according to statisticians
Turning the Deviation into a
Useful Measure of Dispersion
• Idea #4: Square the deviation to avoid problem
of negative values
– Sum of “squared” deviation
– Divide by “N-1” (instead of N) to get the average
• Result: The “variance”:
N
s 
2
Y
d
i 1
N
2
i
N 1

 (Y  Y )
i 1
i
N 1
2
Calculating the Variance 1
Case
1
Num
CD’s (Y)
20
2
40
3
0
4
70
Calculating the Variance 2
Case
1
Num
Mean
CD’s (Y) (Y bar)
20
32.5
2
40
32.5
3
0
32.5
4
70
32.5
Calculating the Variance 3
Case
1
Num
Mean Deviation
CD’s (Y) (Y bar)
(d)
20
32.5
-12.5
2
40
32.5
7.5
3
0
32.5
-32.5
4
70
32.5
37.5
Calculating the Variance 4
Case
1
Num
Mean Deviation
Squared
CD’s (Y) (Y bar)
(d)
Deviation (d2)
20
32.5
-12.5
150
2
40
32.5
7.5
56.25
3
0
32.5
-32.5
1056.25
4
70
32.5
37.5
1406.25
Calculating the Variance 5
• Variance = Average of “squared deviation”
– Average = mean = sum up, divide by N
– In this case, use N-1
• Sum of 150 + 56.25 + 1056.26 + 1406.25 =
2668.75
• Divide by N-1
– N-1 = 4-1 = 3
• Compute variance:
• 2668.75 / 3 = 889.6 = variance = s2
The Variance
• Properties of the variance
– Zero if all points cluster exactly on the mean
– Increases the further points lie from the mean
– Comparable across samples of different size
• Advantages
– 1. Provides a good measure of dispersion
– 2. Better mathematical characteristics than the AAD
• Disadvantages:
– 1. Not as easy to interpret as AAD
– 2. Values get large, due to “squaring”
Turning the Deviation into a
Useful Measure of Dispersion
• Idea #5: Take square root of Variance to shrink it
back down
• Result: Standard Deviation
– Denoted by lower-case s
– Most commonly used measure of dispersion
• Formula:
N
sY  s 
2
Y
(
Y

Y
)
 i
i 1
N 1
2
Calculating the Standard
Deviation
• Simply take the square root of the variance
• Example:
– Variance = 889.6
– Square root of 889.6 = 29.8
• Properties:
–
–
–
–
Similar to Variance
Zero for perfectly concentrated distribution
Grows larger if cases are spread further from the mean
Comparable across different sample sizes
Example 1: s = 21.72
16
14
12
10
8
6
4
Std. Dev = 21.72
2
0
Mean = 101
N = 23.00
0
50
25
100
75
150
125
200
175
Number of CDs (Group 1)
Example 2: s = 67.62
6
5
4
3
2
1
Std. Dev = 67.62
Mean = 100.0
0
N = 23.00
0.0
50.0
25.0
100.0
75.0
150.0
125.0
200.0
175.0
Number of CDs (Group 2)
Example 3: s = 102.15
14
12
10
8
6
4
2
Std. Dev = 102.15
0
N = 23.00
Mean = 104
0
50
25
100
75
150
125
200
175
Number of CDs (Group 3)
Thinking About Dispersion
• Suppose we observe that the standard deviation of
wealth is greater in the U.S. than in Sweden…
– What can we conclude about the two countries?
• Guess which group has a higher standard deviation for
income: Men or Women? Why?
• The standard deviation of a stock’s price is sometimes
considered a measure of “risk”. Why?
• Suppose we polled people on two political issues and the
S.D. was much higher for one
• What are some possible interpretations?
• What are some other examples where the deviation
would provide useful information?