Measures of Central Tendency

Download Report

Transcript Measures of Central Tendency

Describing Distributions with
Numbers
Chapter 2
What we will do
• We are continuing our exploration of data.
• In the last chapter we graphically depicted
data
• Now we are going to look at how we can
describe data using “summary” statistics
• We will look at statistics that provide
measures of central tendency
• We will also look at statistics that provide
measures of dispersion
Sometimes Statistics are So
Simple…
• Sometimes statistics are so simple we
have to do something to make them look
fancier than they are. Enter “The Mean”.
• The mean simply means taking the
average of something.
• You all know how to do this. You add up
the group, then you divide it by the number
of items in the group.
But just to make sure you know I know what
I am doing I have a formula
1
X 
n
Xi
We may talk about these formulas
but…
• Don’t worry, we may talk about the
formulas that mathematically describe
statistics so you can get a better
understanding of how they work.
• I might also hand calculate a few to
demonstrate this
• But no one today hand calculates real data
• Neither should you that is why we have
software
The Median
• The Median is the mid point of a
distribution. Half the observations have
values less than the median, half have
values more
• The formula looks like this
• Note the formula gives the location of the
median (the observation which has a
value equal to the median) not its value
M  ( N  1) / 2
Here is where Stem & Leaf Graphs
can come in handy (N=20)
Mean and Median which one?
• In general the Mean is more susceptible to
distortion by
– abnormally large cases, in the language of the
book a distribution skewed to the right
– or abnormally small cases, in the language of
the book a distribution skewed to the left.
• For example, one Bill Gates among a
thousand people will seriously distort the
“Mean” income of this sample. However,
it will have little or no impact on the
“Median” Income
Level of Measure Matters Also
• You cannot take the mean of a categorical
variable (one measured at the nominal or ordinal
level).
• You can however calculate the median of a
variable measured at the ordinal level.
• This is a good point to stop and remind you
about the stupidity of machines.
• Unless the variables are tagged in the data set
as to level of measure, your computer really
won’t care and will happily chug along
calculating even meaningless statistics such as
the mean of your categorical variables.
One more
• The Mode is the measure of central
tendency for nominal data. It is simply the
category with the largest number of cases.
If all we knew was how well the
data clumped together…
• Even though the Median is less
susceptible to distortion by an abnormally
large or small case, it can still provide a
very weak description of your data if the
observations are widely dispersed.
• This is why we are often interested in the
Quartiles
Just like the Median only smaller
• Quartiles are just like the Median only on a
smaller scale. Instead of defining the mid
point of the distribution they define the
break-point between:
– The first quarter and the second quarter
– The break between the second quarter and
the third quarter (which is the Median by the
way)
– The break between the third quarter and the
fourth quarter
The Five-Number Summary
• Moore is very big on the use of the fivenumber summary to summarily describe
data.
• Minimum value
• Q1
• M
• Q3
• Maximum value
You can graphically depict this with
a box plot
• Fortunately all the computer programs we
are employing can easily generate both
the numerical summary and the
accompanying box plots
• SPSS can generate all this and more
using its “Frequencies” and “Explore”
commands. Excel does the job just as
nicely.
Here is an example of an SPSS Box plot for before
tax income for men and women in Ontario from the
Survey of Household Spending
• Notice on the previous slide how the
distance from the first quartile to the
median and then to the third quartile is not
necessarily symmetrical and then that the
whiskers on the box plot are also not
symmetrical. This is an indication of skew
• Unlike the example in the book my
whiskers indicate not max and min value
but percentiles,
Here is the five number summary
for Men and Women
Spotting outliers
• Obviously our box plots provide an
excellent way to spot outliers.
• A statistic that can also help is the
“interquartile range”. This is just the
range between quartile one and three.
• When an observation lies 11/2 times the
Interquartile range above quartile three or
below quartile 1, it is often considered to
be an outlier.
While I used ratio level data…
• While I used ratio level data for my
example of the five-number summary, it
should be noted that there is nothing here
(quartiles, Median, maximum, minimum
value) that would not work with data
measured at the interval or ordinal level
Range
• Along with quartiles (which works when
data is at least measured at the ordinal
level) we must also remember to look at
“Range” which is the only measure of
dispersion that works at the nominal level.
Standard Deviation
• The best way to describe Standard Deviation
(notation S) is that it is the square root of
Variance (notation S2)
• So why do you need variance? A bit of math if
you look at the formula in your book.
The Formula for S2
• Variance is the sum
of the squared
distances of each
observation from the
mean over N-1 (N-1
being the degree of
freedom).
S
2
2
1

(

x
)

x
i
n 1
2
S
The Formula for
involves a
squaring
• We have to square these distances as,
otherwise -- in a symmetrical distribution -- they
would cross cancel and there would be no
variance.
• The problem with variance is all that squaring
produces numbers that are very large and not
too intuitive to read on their own (though you will
see later that variance is an important tool and
even a building block for other things).
• Taking the square root produces a much
more usable number (S).
• Quite simply, when you know X
and S
• You can go up and down a list of numbers
and figure out which list is more
concentrated about its mean and which is
more diffuse and which are similar
If you want a quick example
Frequency
Value
Frequency
Value
1
0
1
0
1
1
1
2
1
2
1
4
1
3
1
6
1
4
1
8
1
5
1
10
1
6
1
12
1
7
1
14
1
8
1
16
1
9
1
18
1
10
1
20
N= 11
∑ = 55
N= 11
∑ = 110
Mean = 5
S2=11
Mean = 10
S2= 44
S= 3.3
S= 6.6
But once again, keep in mind…
If the mean is susceptible to distortion from
extreme variables, S is doubly so due to
all those squarings