Measures of Central Tendency
Download
Report
Transcript Measures of Central Tendency
Descriptive Statistics
the everyday notions of
central tendency
Usual
Customary
Most
Standard
Expected
normal
Ordinary
Medium
commonplace
NY Times, 10/24/ 2010
Stories vs. Statistics
By JOHN ALLEN PAULOS
Overview
What are descriptive statistics?
A bit of terminology/notation
Measures of Central Tendency
Measures of Variability
Mean, Mode, Median
Ranges, Standard Deviations
The Normal Curve
Terminology/Notation
A data distribution = A set of data/scores
(the whole thing)
X = A raw, single score (i.e., 2 from above)
∑ = Summation (added up)
1, 2, 4, 7
∑X = 14 (each individual score added up)
n = sample size (distribution size, or
number of scores)
n = 4 (from above)
Descriptive Statistics
Descriptive statistics are the side of
statistics we most often use in our
everyday lives
Realize that most observations/data are
too “large” for a human to take in and
comprehend – we must “reduce” them
How can we summarize what we see?
Example – Grades/Registrar
Descriptive Statistics
Descriptive statistics = describing the
data
n = 50, a test score of 83%
Where does it fit in the class??
Making sense out of chaos
Descriptive Statistics
Transform a set of numbers or
observations into indices that describe
or characterize the data
“Summary statistics”
A large group of statistics that are used in
all research manuscripts
Even the most complex statistical tests and
studies start with descriptive statistics
Descriptive Statistics
Measurement
Scales
•
•
•
•
Nominal
Ordinal
Interval
Ratio
Graphic
Portrayals
•
•
•
•
Frequencies
Histograms
Bar graphs
Normal distribution
Relationship
Descriptive
Statistics
Central
Tendency
• Mean
• Median
• Mode
• Scatterplot
• Correlation
• Regression
Variability
• Range
• Standard deviation
• Standardized scores
Descriptive Statistics
Descriptive statistics usually accomplish
two major goals:
1) Describe the central location of the data
2) Describe how the data are dispersed
about that point
In other words, they provide:
1) Measures of Central Tendency
2) Measures of Variability
Measure of Central Tendency
What SINGLE summary value best
describes the CENTRAL location of an
entire distribution?
Mode: which value occurs most often
Median: the value above and below which
50% of the cases fall (the middle; 50th
percentile)
Mean: mathematical balance point;
arithmetic/mathematical average
Mode
Most frequent occurrence
What if data were?
17, 19, 20, 20, 22, 23, 25, 28
17, 19, 20, 20, 22, 23, 23, 28
Problem: set of numbers can be
bimodal, or trimodal, depending on the
scores
Not a stable measure
Ex. 17, 19, 20, 22, 23, 28, 28
Median
Rank numbers, pick middle one
What if data were…?
17, 19, 20, 23, 23, 28
Solution: add up two middle scores, divide
by 2 (=21.5)
Best measure in asymmetrical distribution
(i.e. skewed), not sensitive to extreme scores
Ex. 17, 19, 20, 23, 23, 428
Mean = X
Add up the numbers and divide by the
sample size (the number of numbers!)
X
X
n
Try this one…
2,3,5,6,9
2+3+5+6+9 = 25 / 5 = 5
(Usually) best measure of the three –
uses the most information (all values
from distribution contribute)
Characteristics of the Mean
Balance point
Point around which deviations sum to zero
Deviation = X – X
For instance, if scores are 2,3,5,6,9
Mean is 5
Sum of deviations: (-3)+(-2)+0+1+4=0
∑ (X – X) = 0
Characteristics of the Mean
Affected by extreme scores
Example 1
Scores 7, 11, 11, 14, 17
Mean = 12, Mode and Median = 11
Example 2
Scores 7, 11, 11, 14, 170
Mean = 42.6, Mode & Median = 11
Characteristics of the Mean
Balance point
Affected by extreme scores
Appropriate for use with interval or ratio
scales of measurement
More stable than Median or Mode when
multiple samples drawn from the same
population
Basis for inferential stats
Guidelines to Choose Measure
of Central Tendency
Mean is preferred because it is the basis
of inferential statistics
Median may be better for skewed data
Distribution of wealth in the US – ex.
annual household income in Washington
state for 2000: mean=$76,818;
median=$42,024
Mode to describe average of nominal
data (eye color, hair color, etc…)
Normal Distribution
Frequency,
How often
a score
occurs
Scores
MLB batting
averages over
3-year span
(min. 100 AB)
Mean = 0.267
n = 1291
Normal Distribution
Mode
Median
Mean
Scores
“Normal” distribution
indicates the data are
perfectly symmetrical
Positively skewed distribution
Mode
Median
Mean
Scores
NFL
Salaries
2011
Negatively skewed distribution
Mode
Median
Mean
Scores
Relationship among the MCT &
shape of distribution
Alaska’s average elevation of
1900 feet is less than that of Kansas.
Nothing in that average suggests
the 16 highest mountains in
the United States are in Alaska.
Averages mislead, don’t they?
Grab Bag, Pantagraph, 08/03/2000
Variability
Measures of dispersion or spread
The only thing
constant is variation.
the notions of variability
•Unusual
•Peculiar
•Strange
•Original
•Extreme
•Special
•Unlike
•Deviant
•Dissimilar
•different
NY Times, 10/24/ 2010
Stories vs. Statistics
By JOHN ALLEN PAULOS
Variability defined
Measures of Central Tendency provide a
summary level of the data
Recognizes that scores vary across individual
cases
ie, the mean or median may not be an actual
score in your distribution
Variability quantifies the spread of
performance
How scores vary around mean/mode/median
To describe a distribution
1) Measure of Central Tendency
Mean, Mode, Median
2) Measure of Variability
Multiple measures
Range, Interquartile range, Semi-Interquartile
Range
Standard Deviation
Range
Range = Difference between low/high score
# of hours spent watching TV/week
Range = (Max - Min) Score
2, 5, 7, 7, 8, 8, 10, 12, 12, 15, 17, 20
20 - 2 = 18
Very susceptible to outliers
Doesn’t indicate anything about variability
around the mean/central point
Semi-Interquartile range
What is a quartile??
Interquartile Range = Q3 - Q1
Divide sample into 4 parts of equal size
Q1 , Q2 , Q3 = Quartile Points
Difference between highest and lowest
quartile
SIQR = IQR / 2
Related to the Median…prevents
outliers from overly skewing measure
For ordinal data or skewed interval/ratio
BMD and walking
Quartiles based on miles
walked/week
Krall et al, 1994, Walking is related to
bone density and rates of bone loss.
AJSM, 96:20-26
Notes:
Skewed Distribution?
95th Percentile?
50th Percentile vs Median?
Variation itself is nature's only irreducible essence.
Stephen Jay Gould
Standard Deviation
Most commonly accepted measure of
spread
1.
2.
3.
4.
Compute the deviations of all numbers from
the mean
Square and THEN sum each of the deviations
Divide by the number of deviations
2
Finally, take the square root
( x X )
n
Standard Deviation
Distribution = 1, 3, 5, 7
X = 16 /4 = 4
1) Compute Deviations = -3, -1, 1, 3
2) Square Deviations = 9, 1, 1, 9
3) Sum Deviations = 20
4) Divide by n= 20/4 = 5
5) Take square root = √5 = 2.2
Key points about SD
SD small data clustered round mean
SD largedata scattered from the mean
Affected by extreme scores (just like
mean)…oftentimes called “outliers”
Consistent (more stable) across samples from
the same population
Just like the mean - so it works well with
inferential stats (where repeated samples are
taken)
SD Example
Three NFL quarterbacks with similar QB
ratings in 2006:
Matt Hasselbeck (SEA) = 76.0
Rex Grossman (CHI) = 73.9
Brett Favre (GB) = 72.7
Note: QB rating involves a complex formula accounting for passing
attempts, completions, yards, touchdowns, and
interceptions…100+ is considered outstanding & 70-80 is average
All appear to have had very similar,
somewhat mediocre seasons as QB’s
SD Example
Let’s look at the SD of their game-bygame QB ratings:
Matt Hasselbeck (SEA) = 29.97
Rex Grossman (CHI) = 47.60
Brett Favre (GB) = 27.81
Grossman had, by far, the most
variability (i.e. inconsistency) in his
game-by-game performances…is this
good or bad?
Clinical Use of SD
SD and the normal curve
The following concepts are critical to
your understanding of how descriptive
statistics works
Remember – a “normal” curve is
perfectly symmetrical. This is not
typical, but usually data are almost
normal…
SD and the normal curve
X = 70
SD = 10
34.1%
60
70
34.1%
80
About 68% of
scores fall
within 1 SD
of mean
The standard deviation and
the normal curve
About 68% of
scores fall
between 60
and 70
X = 70
SD = 10
34%
60
34%
70
80
The standard deviation and
the normal curve
About 95% of
scores fall
within 2 SD
of mean
X = 70
SD = 10
34.1% 34.1%
13.6%
50
60
13.6%
70
80
90
The standard deviation and
the normal curve
About 95% of
scores fall
between 50
and 90
X = 70
SD = 10
34.1% 34.1%
13.6%
50
60
13.6%
70
80
90
The standard deviation and
the normal curve
About 99.7%
of scores fall
within 3 S.D.
of the mean
X = 70
SD = 10
34.1% 34.1%
13.6%
13.6%
2.3%
40
2.3%
50
60
70
80
90
100
The standard deviation and
the normal curve
About 99.7%
of scores fall
between 40
and 100
X = 70
SD = 10
34.1% 34.1%
13.6%
13.6%
2.3%
40
2.3%
50
60
70
80
90
100
What about X = 70, SD = 5?
What approximate percentage of scores
fall between 65 & 75?
…1SD below + 1SD above = 68%
What range includes about 99.7% of all
scores?
…3SD below to 3SD above = 55 to 85
Interpreting The Normal Table
Area under Normal Curve
Specific SD values (z) include certain
percentages of the scores
Values of Special Interest
1.96 SD = 47.5% of scores (47.5 + 47.5 = 95%)
2.58 SD = 49.5% of scores (49.5 + 49.5 = 99%)
ie, 95% of scores fall within 1.96 standard deviations
of the mean (1.96 above and 1.96 below)
IQ
68% have an IQ
between 85-115
X = 100
SD = 15
34.1% 34.1%
13.6%
13.6%
2.3%
55
2.3%
70
85
100
115
130
145
MLB players’
batting averages
over a 3-year
span (min. 100
at bats)
~95% of players
have an average
between 0.196
and 0.337
Next Week…
We will utilize our understanding of
descriptive statistics concepts, including
central tendency, variability, and the
normal curve, to examine standardized
scores
Homework = Cronk 3.1 – 3.4
Bring calculator to class
In-class activity 2…