central tendency & variability

Download Report

Transcript central tendency & variability

A bar chart of a quantitative variable with only a few
categories (called a discrete variable) communicates the
relative number of subjects with each of the possible
responses.
However, the bar chart does not graphically distinguish
between quantitative and qualitative variables.
Once we looked at the variable label and the values, we
would realize that this is a quantitative variable, but it
would take that extra work to understand it.
4/8/2016
Slide 1
If the quantitative variable has a large number of
categories (called a continuous variable), the bar chart
provides little information beyond the fact that there are
a lot of different values, and some occur more
frequently than others.
4/8/2016
Slide 2
Histograms are used as the preferred graph
for quantitative variables. While the bars
resemble those of a bar chart, histograms
are distinguished by the absence of gaps
between consecutive bars.
For continuous variables, values
are grouped in equally spaced
intervals to convey a sense of
what the distribution looks like.
4/8/2016
Slide 3
While we used counts and percents to describe
the distribution of a qualitative variable, we use
statistical measures to describe the center,
spread, and shape of a quantitative variable.
Measures of central
tendency identify a
value in the center
of the distribution.
Measures of variability or
dispersion summarize how
the values for individual cases
are spread out around the
measure of central tendency.
4/8/2016
Measures of central
tendency identify a
value in the center
of the distribution.
Measures of variability or
dispersion summarize how
the values for individual cases
are spread out around the
measure of central tendency.
Slide 4
There are two measures of the shape of the
distribution: skewness and kurtosis.
Many of the statistics we will use assume
that the distribution of a variable is bellshaped, i.e. the normal distribution.
Skewness measures the symmetry of
the distribution on both sides of the
average score for the distribution.
Having overlaid a blue normal curve
on the distribution of this variable,
we can see that the bars on either
side of the red center line are similar
as one moves away from the center.
Kurtosis measure the degree to which the
distribution is peaked or flat compared to
the normal distribution. In this example, the
bars at the center of the distribution are
close to what would be expected for a
normal distribution and the frequencies
decrease as we move away from the center.
4/8/2016
Slide 5
Both of these variables have a problem with
skewness, caused by atypical scores at one
end of the distribution.
Skewness is characterized as negative or
positive, depending on which side, or tail, of
the distribution has the unusual scores.
4/8/2016
This is an example of negative skewness,
where a few small scores have
elongated the left tail of the distribution.
The tail on the right is truncated.
This is an example of positive skewness,
where a few large scores have elongated
the right tail of the distribution. The tail
to the left is truncated.
Slide 6
Both of these variables have a problem with
kurtosis, caused by either too few cases in
the center of the distribution, or too many
cases in the center of the distribution.
This is an example of negative kurtosis,
where the scores are uniformly distributed
through the range of scores. The kurtosis
statistic will have a negative value.
4/8/2016
This is an example of positive kurtosis,
where the scores are heavily concentrated
in the center of the distribution. The
kurtosis statistic will have a positive value.
Slide 7
There are two measures of central
tendency for quantitative variables: the
mean and the median.
The mean is the average score.
The median is the middle score, i.e. half of
the scores are higher and half are lower.
When the distribution has
minimal skewness and is
symmetric, both the red
mean line and the green
median line fall in the
center of the distribution.
While both measures reflect the center of
the distribution, the mean is the preferred
measure because it uses information for
all of the cases in the distribution.
4/8/2016
For each measure of centrality, there is
a corresponding measure of spread.
The standard deviation is used with
the mean, and the interquartile range
is used with median.
Slide 8
When skewing is present, the red
mean line moves away from the
center of the distribution as
identified by the green median line
in the direction of the skewness.
At some level of skewness , the median becomes more
effective at representing the center of the distribution.
The issue is selecting a defensible rule for deciding the
dividing line between acceptable skewness and
problematic skewness.
4/8/2016
The rule of thumb that we will use
is that skewness less than -1.0 or
greater than +1.0 is problematic
and indicates that the median is
the preferred measure.
Slide 9
Kurtosis does not affect the
location of the measure of
central tendency.
Kurtosis indicates that there are
either more cases than expected
in the middle of the distribution
(positive kurtosis), or fewer cases
than expected (negative kurtosis).
When the bars fall below the center
of the normal curve overlay, the
distribution has negative kurtosis, and
is referred to as a flat distribution.
4/8/2016
The bars extending about the normal curve
overlay indicate that there is positive
kurtosis. A distribution with positive kurtosis
is characterized as a “peaked distribution.”
Slide 10
• The homework problems on central tendency and variability
focus on describing the distribution of quantitative variables.
• The counts and percents that we used for qualitative variables
are not effective for quantitative variables that can have many
different scores in the distribution.
• We describe the distribution of quantitative variables with
summary statistics that try to communicate the value on
which the distribution is centered, the spread of the values
from the center of the distribution, the symmetry of the
distribution around the center measure, and the degree to
which the distribution is bell-shaped or flat.
4/8/2016
Slide 11
• The center, or central tendency, of the distribution is usually
represented by the mean (average score) or the median
(middle score) of the distribution.
• The standard deviation is used as the measure of spread
(variability or dispersion) that is paired with the mean. It
measures the average difference between the mean and each
of the scores in the distribution.
• The range and interquartile range are used to measure the
spread around the median. The range is the difference
between the highest score and lowest score. The interquartile
range is the difference between the highest and lowest score
when the smallest 25% and the largest 25% of the scores are
removed from the distribution.
4/8/2016
Slide 12
• Both the mean and the median can be computed for the
values in the distribution of any quantitative variable.
• However, the degree to which one or the other is a “good”
measure or indicator of the central tendency of a distribution
differs with the shape of the distribution, specifically the
symmetry of the distribution as measured by skewness.
• If the distribution is symmetric, both the mean and the
median fall in the center of the distribution. The mean is the
preferred measure because it uses all of the cases in the
distribution in its calculation, and because it can be used in a
broader range of statistical tests.
• If the distribution is not symmetric, the median stays in the
middle of the distribution, but the mean is pulled away from
the center toward one of the tails of the distribution.
4/8/2016
Slide 13
• The degree of symmetry of a distribution of scores for a
quantitative variable can vary quite widely.
These six histograms
show progressively
increasing skewness.
At what point do we
choose the median
over the mean?
4/8/2016
Slide 14
• There is no universally accepted criteria for the amount of
skewness that dictates a preference for the median.
• Most agree that we should be concerned with substantial
violations of skewness and ignore minor departures, but there
is not agreement of what is a substantial violation.
• One rule of thumb indicates that a distribution has a
substantial skewness problem when the size of the skew
statistic is twice its standard error (in the SPSS output).
• The rule of thumb that I have used and which will be used for
the problems is that skewness is a problem if it is less than -1
for negatively skewed distributions or greater than +1 for
positively skewed.
4/8/2016
Slide 15
The skewness for
this histogram is
0.35.
The skewness for
this histogram is
1.09.
4/8/2016
The skewness for
this histogram is
0.84.
The skewness for
this histogram is
1.33.
By my rule of thumb, we would use the mean as the measure of
central tendency for the top row, and the median for the bottom
row. That the rule is arbitrary is shown by the similarity of the
last chart on the top row to the first chart on the bottom row.
The skewness for
this histogram is
0.94.
The skewness for
this histogram is
1.86.
Slide 16
One rule of thumb suggests that when the value of
the skewness statistic is 2 times the value of the
skewness standard error, the median is preferred.
For this variable, the statistic (.401) is more than
twice the standard error (.153), so the median
would be preferred.
4/8/2016
Slide 17
Another rule of thumb uses only the value of the
skewness statistic. When the skewness is smaller
than -1.0 or larger than + 1.0, the distribution is
badly skewed and the median is a better measure
of central tendency. This is the rule of thumb
used in our problems.
The skewness of this distribution (0.40) is
in the allowable range, making the mean
and standard deviation the preferred
measures of center and spread.
4/8/2016
Slide 18