Transcript The “X”
Welcome to the statistics module series .
Foundations
Of Research
1
This series has five modules:
Using numbers in science
You are here
Central tendency and variance
Z and the normal distribution
Click any module
title to open it.
Calculating a t score
Testing t: the Central Limit Theorem
40
35
30
25
20
15
10
5
0
An
ys
ub
s
Al
co
tan
ho
l
ce
African-Am., n=430
Ma
rij
u
Ot
h
an
a
er
d
ru
g
Latino, n = 130
Al
-d
ru
g
s
s+
se
x
White, n = 183
Dr. David J. McKirnan
The University of Illinois
Chicago
[email protected]
Want to print a copy of a module for notes?
Foundations
Of Research
When you open a PowerPoint module it saves a copy in
your downloads folder in "show" format (ppsx).
To convert it to a printing format (pptx):
Click ‘esc’ to leave the open module
Open PowerPoint, then browse to and open the module
Click “print’; In the dialogue box click “print what?”.
Select “Handouts (3 slides per page)”
Then, come back and re-run the show
Click anywhere
2
Foundations
Of Research
3
Numbers and Science
Numbers in science
Number scales & distributions
Central Tendency: Mode, Median,
Mean
This should open as a PowerPoint
“Show”.
If it does not please go to “slide show” and
click “run show”
Click through it by pressing any key.
Focus & think about each point; do not
just passively click.
You will have a quiz at the end of each
section.
Psychology 242, Dr. McKirnan
Click anywhere
Why are Numbers so important to Science?
Foundations
Of Research
4
Numbers make our measures clear and specific
They help us Operationally Define our variables
They are the common language of science, commerce,
and the like.
They allow us to compare of research outcomes
Across groups or conditions within a study
Across studies
They allow us to perform statistical operations
We will describe basic statistics as we go…
Foundations
Of Research
Costs and & benefits of numbers
5
Virtues of numerical data:
Representing data by numbers helps clarify and
simplify communication
Can show general trend or “common denominator” in
the data
Danger of numerical data:
Can over-simplify / ignore individual differences
Reducing complex natural processes – physical or
behavioral – to simple numbers may misrepresent
reality.
Foundations
Of Research
The danger of reducing complex
processes to simple numbers:
6
A. Simple descriptive numbers are less important that
testing hypotheses & theory
• Testing why or how something works is more important than
collecting simple numbers.
B. Where does the single number come from?
• The choice to emphasize one statistic over another can reflect
personal bias or social pressure.
• Selection of some statistics (such as the average value in highly
skewed data such as income) can be politically biased.
C. Biased search for confirmatory measure.
• “Cherry picking” results that support your hypothesis, ignoring
disconfirming data (e.g., Pharmaceutical Co. research)
Foundations
Of Research
7
Political (mis)uses of scientific data
D. Ambiguity in the interpretation of data:
Diagnoses of autism have markedly increased. Is this a shift in the
actual disorder, or a lower threshold for reporting cases?
E. Simply ignoring “inconvenient” data:
The Project DARE continues as a politically favored drug
abuse prevention program, despite data showing it to be
ineffective.
Data indicating climate change are ignored or distorted by
many political commentators.
Thus…
Foundations
Of Research
8
Representing nature in terms of numbers is
important to scientific progress
Key for operational definitions
Provide powerful analytic & communication tools
Understanding ANY statistic requires that we
understand how it was derived
Statistical statements can be literally / technically “true” but
misleading.
There are lies, damned lies,
What context was the
statistic
was drawn from?
and
statistics.
What else do we need to know to evaluate
it?
-- Mark Twain
Political or commercial misuse of numerical / statistical
information is common and problematic.
Foundations
Of Research
Variable
9
Some basic terms:
Characteristic or attribute with different levels
or qualities (e.g., age, speed, effort, attention).
Participant
Age
Jo
32
Sally
45
Bill
22
Sam
37
Mary Louise…
34…
Foundations
Of Research
10
Some basic terms:
Variable
Characteristic or attribute with different levels or
Value
One level or state of a variable (e.g., age = 21,
qualities (e.g., age, speed, effort, attention).
ethnicity = Latino).
Participant
Age
Jo
32
Sally
45
Bill
22
Sam
37
Mary Louise…
34…
Foundations
Of Research
11
Some basic terms:
Variable
Characteristic or attribute with different levels or
Value
One level or state of a variable (e.g., age = 21,
qualities (e.g., age, speed, effort, attention).
ethnicity = Latino).
Distribution Set of scores (each with its own value) for one
variable (e.g., distribution of student ages).
Participant
Age
Jo
32
Sally
45
Bill
22
Sam
37
Mary Louise…
34…
12
Some basic terms:
Foundations
Of Research
Variable
Characteristic or attribute with different levels or
Value
One level or state of a variable (e.g., age = 21,
qualities (e.g., age, speed, effort, attention).
ethnicity = Latino).
Distribution Set of scores (each with its own value) for one
7
8
variable (e.g., distribution of student ages).
6
Participant
4
3
Bill
2
5
Jo
Sally
Sam
1
Number of participants
X
X X
X X X
X
Mary
X Louise…
10 13 16 19 21 24 27 30 33 36 39 41 44 47…
Age
Age
32
45
45
22
22
37
37
34
34…
Some basic terms:
Foundations
Of Research
Primary “drift” of a set of scores (Mean, Mode,
Variance
Measure of how much the scores in a
or Median).
3
4
5
6
7
8
distribution differ from each other.
2
1
Central
Tendency
Number of participants
13
X
X X
X X
X
X
X X
X
X
X
X
X X
X X
X
X
X
X X X
X X X X
X X X X X
10 13 16 19 21 24 27 30 33 36 39 41 44 47…
Age
Foundations
Of Research
Some basic terms:
14
Variable
Characteristic or attribute with different levels or
Value
One level or state of a variable (e.g., age = 21,
qualities (e.g., age, speed, effort, attention).
ethnicity = Latino).
Distribution Set of scores (each with its own value) for one
variable (e.g., distribution of student ages).
Central
Tendency
Primary “drift” of a set of scores (Mean, Mode, or
Variance
Measure of how much the scores in a distribution
Parameter
Mathematical characteristic of (variable that
Statistic
Mathematical characteristic of a sample.
Median).
differ from each other.
describes an aspect of…) a population.
Foundations
Of Research
Research questions, hypotheses &
designs
Numbers in science
Number scales &
distributions
Central Tendency: Mode, Median,
Mean
Psychology 242, Dr. McKirnan
Week 3; Experimental
15
Foundations
Of Research
Types of numerical scales
16
Ratio
Continuous
Scales of physical properties.
Scales
Used for physical description: temperature, elapsed time, height
Interval
Arbitrary or relative ψ scales
Common in behavioral research, e.g., attitude or rating scales.
Ordinal
Rank order with non-equal intervals
Simple finish place, rank in organization, most, 2nd most, 3rd most...
Categorical
Categories only
Typical of inherent categories: ethnic group, gender, zip code
17
Scale details
Foundations
Of Research
Ratio Scales;
Ratio
temperature, height, blood pressure..
Interval
Ordinal
Categorical
Each scale value describes a physical reality
The zero point is grounded in a physical property
0 hits is meaningful
Water always freezes at 32oF
Each scale point is absolute
32o has an absolute meaning, not just relative to other
temperatures.
Each interval is continuous & exactly equal:
Batting average: % hits / at bats
30oF 40oF ≡ 90oF 100oF
Used for correlational designs, or as dependent
variable in experiments.
Foundations
Of Research
18
Interval scales
Ratio
Interval Scales; e.g, attitude rating scale, IQ…
Interval
Ordinal
Categorical
Not grounded in physical reality; “ψ reality” only
Correlational designs, dependent variable in
experiments.
Does not have a “0” point; each scale point is
relative
Intervals are designed to be the same…
… but may not be.
=
Does not
Strongly
agree at all
1
2
3
4
5
6
In agreeing with an attitude statement
7
agree
(Science has changed my life…)
this is a bigger ψ step than this
19
Interval scales
Foundations
Of Research
Ratio
Interval
Ordinal Scales; e.g., rank order
Ordinal
Categorical
Also not grounded in physical reality
Relative “strength” or placement on a scale
Primarily correlational / measurement designs.
No “0” point; each scale point is relative
Provides no information about intervals
Least
preferred
Text
discussions
lecture
Instructor
Intervals between each scale point are arbitrary…
(Many interval scales may actually be ordinal…)
Most
preferred
Categorical Scales
20
Interval scales
Foundations
Of Research
Ratio
Interval
Ordinal
Categorical
Simple groups / categories
Measurement studies (e.g., epidemiology)
Independent Variable in an experiment.
Binary (“nominal”) measures: 2 levels only
Experimental v. control group
“Drug user” v. not
Categorical: >2 levels
Ethnic group, City neighborhood, primary drug used…
Groups may be inherent (gender, ethnicity) or
arbitrary (experimental group…).
Two central ways of using numbers.
Foundations
Of Research
Descriptive Statistics:
Simple quantitative description or summary.
Batting average in baseball
Grade-point average
Univariate Analysis: examines cases in terms of a
single variable.
Frequency distributions group data into categories.
21
E.g., drug use by Age categories, Ethnic groups…
Inferential Statistics:
Conduct analyses on samples
Compare groups (experimental v. control…)
Characterize a sample in epidemiological research
Use statistical operations to generalize the results
to a population.
What is a Frequency Distribution?
Foundations
Of Research
22
Frequency
We plot each data point
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
(a score for one person…)
The “Y” Axis
X
X
1
Psychology 242, Dr. McKirnan
X
X
X
X
X
X
X
X
X
X
X
X
X
2
3
Scores
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
4
on an axis of scores.
Scores are on the the “X” axis
Frequencies on the “Y” axis..
We show the shape of a
distribution by drawing a curve
X
over the cluster of data points…
X
X
X
X
X
X
X
X
X
5
164
9 participants
participants got
got aa
score
scoreofof‘4’‘6’
‘3’
X
X
X
X
X
6
7 The “X” Axis
Statistics introduction 1
The normal distribution
Foundations
Of Research
23
Frequency
In a “normal” distribution
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
X
X
1
Psychology 242, Dr. McKirnan
X
X
X
X
X
X
X
X
X
X
X
X
X
2
3
Scores
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
4
Scores are symmetrical
around the mid-point
The variance in scores
shows the classic “Bell
Shape”
X
X
X
X
X
X
X
X
X
X
5
X
X
X
X
X
6
7
Statistics introduction 1
Less “normal” distributions
Foundations
Of Research
24
Frequency
Score distributions may
depart from the Bell Curve
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Scores here are still
symmetrical
The variance is “flat” or
irregular
X
X
X
X
1
X
X
X
X
X
X
X
X
X
X
X
X
X
2
3
Scores
X
X
X
X
X
X
X
X
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
5
6
7
Frequency
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
25
Skewed distributions
Foundations
Of Research
Distributions may be very
“non-normal”
Center of the
distribution
X
X
X
X
X
X
X
X
X
X
X
1
X
X
X
X
X
X
X
X
X
X
X
X
Here scores are not
symmetrical
This is called a “skewed”
distribution; scores load
up on one side of the
scale.
X
X
X
X
X
X
2
3
Scores
X
X
X
X
X
X
4
5
X
X
6
This is a positive
skew; the “tail” of the
distribution goes
toward higher values
X
7
Less “normal” distributions
Frequency
Foundations
Of Research
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Other data may show
a negative skew.
X
1
X
X
X
X
2
3
Scores
X
X
X
X
X
X
X
X
X
X
4
5
X
X
X
X
X
X
X
X
X
X
X
X
6
X
X
X
X
X
X
X
X
X
X
X
7
26
Foundations
Of Research
Descriptive / Univariate Statistics
How angry at you at the banks that crashed the economy?
1. Gather raw attitude data:
2. Show the number of
students at each level of
the variable “angry”…
Clicker
responses
in class
27
Foundations
Of Research
28
Descriptive statistics example
3. Compile the
descriptive
statistics:
4. Show the data as a
frequency distribution…
Foundations
Of Research
Summary: Introduction & number scales
Numbers are crucial to science
Operational definitions
Communication
Summary
Statistical analyses
Types of numerical scales
Ratio
Interval
Ordinal
Categorical
Numbers fall into regular
frequency distributions
29
Foundations
Of Research
Research questions, hypotheses &
designs
Using numbers in science
Number scales & frequency
distributions
30
Central Tendency: Mode,
Median, Mean
Images from the FACE RESEARCH
LAB, Ben Jones and Lisa DeBruine ,
School of Psychology, University of
Aberdeen.
Foundations
Of Research
Central tendency
Central tendency – the general “drift” in a set
of scores or values – can reflect any physical or ψ
process or numerical scale.
Central tendency reduces the variance in a
sample of values to one core value.
The following is an example from the FACE
RESEARCH LAB; Ben Jones and Lisa DeBruine, School
of Psychology, University of Aberdeen.
31
Foundations
Of Research
What is Central Tendency (Mean or average…)
Variance in
the faces of the
different lab
members.
32
The average
(”Mean” or M)
face of the lab
members
Foundations
Of Research
Describing data
33
We characterize the general trend or character
of data using two key statistics:
1. Central tendency or general “drift” of the scores.
Mode
most common score
Median
middle of the distribution
Mean
average score
2. Variance: how diverse the scores are (how much vary
from each other).
Range
…from the highest to lowest score
Standard
deviation
“average” amount the scores vary
from the Mean score
Foundations
Of Research
34
The Mode
Most frequent score in the distribution.
Example: scores = 15, 20, 21, 20, 36,15, 25,15,12
Show scores as a frequency distribution
score
12
15
20
21
25
36
frequency
1
3
2
1
1
1
% of cases
11%
33%
22%
11%
11%
11%
15 is most common, and is considered the mode.
Characteristics:
used for all numerical scales, particularly categorical
insensitive to extreme values or range of scores
unstable; sensitive to small shifts in number of case
Foundations
Of Research
35
Median
Mid-point of a distribution of scores:
Half are above, half are below
Only for continuous (interval or ratio) scales
List all the scores in numerical order
score
12
15
20
21
25
36
frequency
1
3
2
1
1
1
12
15
% of cases
15
11%
15
33%
20
22%
11%
20
11%
21
11%
25
36
Foundations
Of Research
36
Median
Mid-point of a distribution of scores:
Only for continuous (interval or ratio) scales
List all the scores in numerical order
Locate the score in the center of the order
For an even number of scores the Median is
the average of the middle two scores
The Mode = 15, whereas Median = 20
For this distribution the median best reflects the
center of the distribution
Median is not sensitive to extreme
values
Adding ‘1,000’ would not matter; ‘20’ would still
be the center of the distribution.
15
15
15
20
20
21
25
36
1,000
Four below
12
Four above
Half are above, half are below
Foundations
Of Research
37
Mean (M)
The “average” score in sample;
x
n
Most common measure of central tendency
Total all scores: 12+15+20+21+20+36+15+25+15 = 179
Divide by “n” of scores: 179 / 9 = 19.9 (round to 20).
Characteristics:
Good for Ratio or interval scales
Sensitive to all observed values
Highly stable; with larger n is insensitive to subtle
changes in values
Can be highly sensitive to extreme values.
12+15+20+21+20+36+15+25+15+1000 = 1,179 / 10 = 118
Foundations
Of Research
38
The Mean: Statistical notation
Some basic statistical notation:
X
Score on one variable for one participant
n
Number of scores in the sample
Σ
Sum of a set of scores
M or X Mean; sum of scores divided by n of scores:
x
n
Foundations
Of Research
Central tendency:
Normal Distributions
Many variables in nature
(and science) are normally
distributed; their
frequency distribution is
bell shaped.
For a normal distribution
the mean, mode, and
median are all same -- the
center of the distribution
Mode
Median
Mean
39
Foundations
Of Research
40
The distribution of student ages
Mode of age
distribution
Mean
Age category
Over 30
26 to 30
22 to 25
Median
19 to 21
50
45
40
35
30
25
20
15
10
5
0
Under 19
Percent of students
Age is a good example of a variable that is normally
distributed
Foundations
Of Research
Central tendency:
41
Bimodal Distributions
A bimodal distribution
has two modes, at the
outsides of the
distribution.
The mean & median
are similar, at the
center.
Mode
Mode
Mean
Median
Common examples are:
Highly polarized political attitudes (i.e., where few are “in
the middle”).
Some personality variables..
Those who fear and loathe statistics v. those who love
statistics and are popular, mentally healthy, and happy,
Foundations
Of Research
Central tendency:
Skewed Distributions
A skewed distribution has extreme scores in one
direction.
The extreme scores make the
median higher than the mode.
(The high scores to the right move the
50% point that direction…).
Mode
The Mean gets pulled even higher.
(Adding in some very high scores raises the
average…).
Mean (M)
Median
Common examples:
Behaviors such as alcohol or drug use:
Most people use none or moderate
A diminishing number use higher levels
Demographic variables such as income
42
Foundations
Of Research
Central tendency; normal distribution
Age, Chicago community sample
N
Measures of Central
Tendency: A normal
distribution
Valid
Missing
Mean
Median
Mode
Std. Deviation
Variance
Skewness
Range
43
793
24
33.1967
32.0000
30.00
9.54557
91.118
Histogram .633
50.00
Scores for age from a
large community sample
form a largely symmetrical
distribution.
The Mean, Median, and
mode are similar.
Any measure of central
tendency well represents
the data.
200
150
100
50
Mean = 33.1967
Std. Dev. = 9.54557
N = 793
0
10
20
30
Psychology 242, Dr. McKirnan
40
age2
50
60
Foundations
Of Research
Local examples of distributions, 1
How many packs of
cigarettes do you smoke
each week?
A = I do not smoke
B= 1 pack or less
C = 2 packs or so
D = 3 to 5 packs
E = A pack a day or more
44
Foundations
Of Research
Cigarette smoking forms a bimodal distribution
People do not smoke at all…
…or smoke a lot.
Few are “light” smokers.
Mean = 1.1 packs per week
clearly misrepresents the data.
The Mode or Median are better
indicators.
Statistics
Packs of cigarettes per week
N
None
1
2
4 or 5
45
7 or more
Valid
Missing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
Sum
789
28
1.1305
.05915
.0000
.00
1.66
2.761
.978
.087
-.871
.174
4.00
.00
4.00
892.00
Foundations
Of Research
Local examples of distributions, 2b
During the last week on
how many days did you have
at least one drink of alcohol?
A=0
B= 1
C = 2 or 3
D = 4 or 5
E = 6 or 7
46
Foundations
Of Research
Drug or alcohol exemplify positively skewed data
Most people use no
or few substances
47
This strong positive
skew is reflected in a
skewness statistic
400
Statistics
300
Smaller & smaller
#s use more…
Frequency
200
100
0
.00
1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00
Psychology 242, Dr. McKirnan
Number of alcohol or drugs used > rarely
(Data from a community survey sample)
Number of alcohol or drugs used > rarely
N
Valid
766
Missing
51
Mean
1.11
Median
1.00
Mode
.00
Std. Deviation
1.59
Variance
2.54
Skewness
2.46
Minimum
.00
Maximum
9.00
Foundations
Of Research
48
Example of a strong positive skew: American incomes.
This figure represents the bottom 99% of American household
incomes (2000).
Each =100,000 households
Median =
$40,700
Mode ~
$20K
Mean =
$57,000
bottom
80% <
$80,500
$100K
Income in the U.S. is highly skewed:
the M household income is over
twice the modal income.
$200K
$300K
49
American income distribution (2000)
Foundations
Of Research
This dramatic positive skew
makes “M family income” a
very poor indicator of most
people’s financial resources
99% of families fit into the very
bottom of the complete U.S.
income distribution
The super-rich 1% extends far further…
Tiger
Woods
$50M
CEO of TimeWarner
$100M
CEO of
Citigroup
$200M
George
Lucas
Foundations
Of Research
Deceptiveness of the M in skewed data:
Example from the 2004-2008 tax cut data
50
$85,002
Did all Americans benefit equally from the Bush
era tax cuts?
The M annual tax cut for all tax payers is $1629 per
year from 2001 to 2010... a pretty good number.
The extreme
positive skew in
incomes & tax
legislation make
this highly
deceptive.
$10,000
$9,000
$8,000
$7,000
$6,000
Annual estimated tax cuts
$5,000
$4,000
$2,780
$3,000
$2,000
$1,000
$0
$98
Lowest
20%
$508
$791
2nd 20%
3rd 20%
$1,081
$1,225
4th 20% Next 10% Next 4%
Income groups
Source: Citizens for Tax Justice, 6/12/06; click here
Top 1%
Foundations
Of Research
51
$85,002
Deceptive Means in skewed data
The M for tax cuts is pulled upward by a small
number of very high values
Almost 90% of tax payers got less than the M
tax cut of $1,629
$10,000
$9,000
$8,000
$7,000
$6,000
Annual estimated tax cuts
$5,000
$4,000
$2,780
$3,000
$2,000
$1,000
$0
$98
Lowest
20%
$508
$791
2nd 20%
3rd 20%
$1,081
$1,225
4th 20% Next 10% Next 4%
Income groups
Top 1%
Foundations
Of Research
52
$85,002
Deceptive Means in skewed data
In data this skewed a single measure of
central tendency cannot be accurate.
It is more accurate to show income groups
separately.
The bottom 66% of
tax payers got M =
$461 in cuts,
Top 10% M =
$10,203
$10,000
$9,000
$8,000
$7,000
$6,000
Annual estimated tax cuts
$5,000
$4,000
$2,780
$3,000
$2,000
$1,000
$0
$98
Lowest
20%
$508
$791
2nd 20%
3rd 20%
$1,081
$1,225
4th 20% Next 10% Next 4%
Income groups
Top 1%
Foundations
Of Research
Mean v. Median in skewed data
53
The difference between the Mode,
Median & Mean in skewed data can
be crucial: Example from men’s v.
women’s number of sex partners.
M number of partners is
much higher for men
than for women.
However, both men &
women have highly
skewed distributions:
Most at the low end
A long tail to the right
A small mode at very
high numbers
Foundations
Of Research
Mean v. Median in skewed data
54
Expressing the data as Ms greatly exaggerates differences
between men & women.
For both genders:
mode = 1,
medians are similar (5 v. 3),
This suggests only modest
gender differences.
Men’s data are far more skewed – some men report A LOT of
partners – their M is very high (6 times their median…).
The genders are similar,
except for a few men at
the very top of the scale,
who pull their M
upwards.
Simply taking the Ms at
face value would be
deceiving.
Foundations
Of Research
Bottom line:
55
Many natural processes or variables show a (more or less)
normal distribution…
Age, Height, IQ, most attitude scales…
Many important behavioral variables have a highly
skewed distribution.
When a distribution is normal the Mode = Mean =
Median, all at the center of the distribution.
When a distribution is bimodal or skewed the M can be
deceptive.
Mode
Median
Best for categorical data or a simple
summary
Best for highly skewed data
Mean
best for normally distributed data.
Foundations
Of Research
Summary: Introduction & number scales
Primary measures of Central Tendency
Summary
Mode
Most common value
Median Middle of the distribution
Average
Mean
Types of frequency distributions
“Normal” Best characterized by the
Mean
56
Foundations
Of Research
Summary: Introduction & number scales
Primary measures of Central Tendency
Summary
Mode
Most common value
Median Middle of the distribution
Average
Mean
Types of frequency distributions
“Normal”
Bimodal Best characterized
by the Modes
57
Foundations
Of Research
Summary: Introduction & number scales
Primary measures of Central Tendency
Summary
Mode
Most common value
Median Middle of the distribution
Average
Mean
Types of frequency distributions
“Normal”
Bimodal
Skewed Best characterized
by the Median or
mode
58