Transcript The “X”

Welcome to the statistics module series .
Foundations
Of Research
1
This series has five modules:
Using numbers in science
You are here
Central tendency and variance
Z and the normal distribution
Click any module
title to open it.
Calculating a t score
Testing t: the Central Limit Theorem
40
35
30
25
20
15
10
5
0
An
ys
ub
s
Al
co
tan
ho
l
ce
African-Am., n=430
Ma
rij
u
Ot
h
an
a
er
d
ru
g
Latino, n = 130
Al
-d
ru
g
s
s+
se
x
White, n = 183
Dr. David J. McKirnan
The University of Illinois
Chicago
[email protected]
Want to print a copy of a module for notes?
Foundations
Of Research

When you open a PowerPoint module it saves a copy in
your downloads folder in "show" format (ppsx).

To convert it to a printing format (pptx):
 Click ‘esc’ to leave the open module
 Open PowerPoint, then browse to and open the module
 Click “print’; In the dialogue box click “print what?”.
 Select “Handouts (3 slides per page)”

Then, come back and re-run the show
Click anywhere
2
Foundations
Of Research
3
Numbers and Science

 Numbers in science

Number scales & distributions

Central Tendency: Mode, Median,
Mean
This should open as a PowerPoint
“Show”.
If it does not please go to “slide show” and
click “run show”
 Click through it by pressing any key.
 Focus & think about each point; do not
just passively click.
 You will have a quiz at the end of each
section.
Psychology 242, Dr. McKirnan
Click anywhere
Why are Numbers so important to Science?
Foundations
Of Research
4
 Numbers make our measures clear and specific
 They help us Operationally Define our variables
 They are the common language of science, commerce,
and the like.


They allow us to compare of research outcomes

Across groups or conditions within a study

Across studies
They allow us to perform statistical operations

We will describe basic statistics as we go…
Foundations
Of Research
Costs and & benefits of numbers
5
Virtues of numerical data:


Representing data by numbers helps clarify and
simplify communication
Can show general trend or “common denominator” in
the data
Danger of numerical data:


Can over-simplify / ignore individual differences
Reducing complex natural processes – physical or
behavioral – to simple numbers may misrepresent
reality.
Foundations
Of Research
The danger of reducing complex
processes to simple numbers:
6
A. Simple descriptive numbers are less important that
testing hypotheses & theory
• Testing why or how something works is more important than
collecting simple numbers.
B. Where does the single number come from?
• The choice to emphasize one statistic over another can reflect
personal bias or social pressure.
• Selection of some statistics (such as the average value in highly
skewed data such as income) can be politically biased.
C. Biased search for confirmatory measure.
• “Cherry picking” results that support your hypothesis, ignoring
disconfirming data (e.g., Pharmaceutical Co. research)
Foundations
Of Research
7
Political (mis)uses of scientific data
D. Ambiguity in the interpretation of data:
Diagnoses of autism have markedly increased. Is this a shift in the
actual disorder, or a lower threshold for reporting cases?
E. Simply ignoring “inconvenient” data:
The Project DARE continues as a politically favored drug
abuse prevention program, despite data showing it to be
ineffective.
Data indicating climate change are ignored or distorted by
many political commentators.
Thus…
Foundations
Of Research


8
Representing nature in terms of numbers is
important to scientific progress

Key for operational definitions

Provide powerful analytic & communication tools
Understanding ANY statistic requires that we
understand how it was derived


Statistical statements can be literally / technically “true” but
misleading.
There are lies, damned lies,

What context was the
statistic
was drawn from?
and
statistics.

What else do we need to know to evaluate
it?
-- Mark Twain
Political or commercial misuse of numerical / statistical
information is common and problematic.
Foundations
Of Research

Variable
9
Some basic terms:
 Characteristic or attribute with different levels
or qualities (e.g., age, speed, effort, attention).
Participant
Age
Jo
32
Sally
45
Bill
22
Sam
37
Mary Louise…
34…
Foundations
Of Research


10
Some basic terms:
Variable
 Characteristic or attribute with different levels or
Value
 One level or state of a variable (e.g., age = 21,
qualities (e.g., age, speed, effort, attention).
ethnicity = Latino).
Participant
Age
Jo
32
Sally
45
Bill
22
Sam
37
Mary Louise…
34…
Foundations
Of Research


11
Some basic terms:
Variable
 Characteristic or attribute with different levels or
Value
 One level or state of a variable (e.g., age = 21,
qualities (e.g., age, speed, effort, attention).
ethnicity = Latino).

Distribution  Set of scores (each with its own value) for one
variable (e.g., distribution of student ages).
Participant
Age
Jo
32
Sally
45
Bill
22
Sam
37
Mary Louise…
34…


12
Some basic terms:
Foundations
Of Research
Variable
 Characteristic or attribute with different levels or
Value
 One level or state of a variable (e.g., age = 21,
qualities (e.g., age, speed, effort, attention).
ethnicity = Latino).
Distribution  Set of scores (each with its own value) for one
7
8
variable (e.g., distribution of student ages).
6
Participant
4
3
Bill
2
5
Jo
Sally
Sam
1
Number of participants

X
X X
X X X
X
Mary
X Louise…
10 13 16 19 21 24 27 30 33 36 39 41 44 47…
Age
Age
32
45
45
22
22
37
37
34
34…
Some basic terms:
Foundations
Of Research
 Primary “drift” of a set of scores (Mean, Mode,
Variance
 Measure of how much the scores in a
or Median).
3
4
5
6
7
8
distribution differ from each other.
2
1

Central
Tendency
Number of participants

13
X
X X
X X
X
X
X X
X
X
X
X
X X
X X
X
X
X
X X X
X X X X
X X X X X
10 13 16 19 21 24 27 30 33 36 39 41 44 47…
Age
Foundations
Of Research


Some basic terms:
14
Variable
 Characteristic or attribute with different levels or
Value
 One level or state of a variable (e.g., age = 21,
qualities (e.g., age, speed, effort, attention).
ethnicity = Latino).

Distribution  Set of scores (each with its own value) for one
variable (e.g., distribution of student ages).
Central
Tendency
 Primary “drift” of a set of scores (Mean, Mode, or

Variance
 Measure of how much the scores in a distribution

Parameter
 Mathematical characteristic of (variable that
Statistic
 Mathematical characteristic of a sample.


Median).
differ from each other.
describes an aspect of…) a population.
Foundations
Of Research

Research questions, hypotheses &
designs
Numbers in science

 Number scales &
distributions

Central Tendency: Mode, Median,
Mean
Psychology 242, Dr. McKirnan
Week 3; Experimental
15
Foundations
Of Research
Types of numerical scales
16
Ratio
Continuous
Scales of physical properties.
Scales
Used for physical description: temperature, elapsed time, height
Interval
Arbitrary or relative ψ scales
Common in behavioral research, e.g., attitude or rating scales.
Ordinal
Rank order with non-equal intervals
Simple finish place, rank in organization, most, 2nd most, 3rd most...
Categorical
Categories only
Typical of inherent categories: ethnic group, gender, zip code
17
Scale details
Foundations
Of Research
Ratio Scales;
Ratio
temperature, height, blood pressure..
Interval
Ordinal
Categorical
Each scale value describes a physical reality



The zero point is grounded in a physical property

0 hits is meaningful

Water always freezes at 32oF
Each scale point is absolute


32o has an absolute meaning, not just relative to other
temperatures.
Each interval is continuous & exactly equal:


Batting average: % hits / at bats
30oF  40oF ≡ 90oF  100oF
Used for correlational designs, or as dependent
variable in experiments.
Foundations
Of Research
18
Interval scales
Ratio
Interval Scales; e.g, attitude rating scale, IQ…




Interval
Ordinal
Categorical
Not grounded in physical reality; “ψ reality” only
Correlational designs, dependent variable in
experiments.
Does not have a “0” point; each scale point is
relative
Intervals are designed to be the same…
… but may not be.
=
Does not
Strongly
agree at all
1
2
3
4
5
6
In agreeing with an attitude statement
7
agree
(Science has changed my life…)
this is a bigger ψ step than this
19
Interval scales
Foundations
Of Research
Ratio
Interval
Ordinal Scales; e.g., rank order
Ordinal
Categorical

Also not grounded in physical reality

Relative “strength” or placement on a scale

Primarily correlational / measurement designs.

No “0” point; each scale point is relative

Provides no information about intervals
Least
preferred
Text
discussions
lecture
Instructor
Intervals between each scale point are arbitrary…
(Many interval scales may actually be ordinal…)
Most
preferred
Categorical Scales

20
Interval scales
Foundations
Of Research
Ratio
Interval
Ordinal
Categorical
Simple groups / categories
 Measurement studies (e.g., epidemiology)
 Independent Variable in an experiment.


Binary (“nominal”) measures: 2 levels only

Experimental v. control group

“Drug user” v. not
Categorical: >2 levels


Ethnic group, City neighborhood, primary drug used…
Groups may be inherent (gender, ethnicity) or
arbitrary (experimental group…).
Two central ways of using numbers.
Foundations
Of Research

Descriptive Statistics:


Simple quantitative description or summary.

Batting average in baseball

Grade-point average
Univariate Analysis: examines cases in terms of a
single variable.

Frequency distributions group data into categories.


21
E.g., drug use by Age categories, Ethnic groups…
Inferential Statistics:


Conduct analyses on samples

Compare groups (experimental v. control…)

Characterize a sample in epidemiological research
Use statistical operations to generalize the results
to a population.
What is a Frequency Distribution?
Foundations
Of Research
22
Frequency
 We plot each data point
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
(a score for one person…)
The “Y” Axis
X
X
1
Psychology 242, Dr. McKirnan
X
X
X
X
X
X
X
X
X
X
X
X
X
2
3
Scores
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
4
on an axis of scores.
 Scores are on the the “X” axis
 Frequencies on the “Y” axis..
 We show the shape of a
distribution by drawing a curve
X
over the cluster of data points…
X
X
X
X
X
X
X
X
X
5
164
9 participants
participants got
got aa
score
scoreofof‘4’‘6’
‘3’
X
X
X
X
X
6
7 The “X” Axis
Statistics introduction 1
The normal distribution
Foundations
Of Research
23
Frequency
In a “normal” distribution
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
X
X
1
Psychology 242, Dr. McKirnan
X
X
X
X
X
X
X
X
X
X
X
X
X
2
3
Scores
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
4
Scores are symmetrical
around the mid-point
The variance in scores
shows the classic “Bell
Shape”
X
X
X
X
X
X
X
X
X
X
5
X
X
X
X
X
6
7
Statistics introduction 1
Less “normal” distributions
Foundations
Of Research
24
Frequency
Score distributions may
depart from the Bell Curve
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Scores here are still
symmetrical
The variance is “flat” or
irregular
X
X
X
X
1
X
X
X
X
X
X
X
X
X
X
X
X
X
2
3
Scores
X
X
X
X
X
X
X
X
4
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
5
6
7
Frequency
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
25
Skewed distributions
Foundations
Of Research
Distributions may be very
“non-normal”
Center of the
distribution
X
X
X
X
X
X
X
X
X
X
X
1
X
X
X
X
X
X
X
X
X
X
X
X
Here scores are not
symmetrical
This is called a “skewed”
distribution; scores load
up on one side of the
scale.
X
X
X
X
X
X
2
3
Scores
X
X
X
X
X
X
4
5
X
X
6
This is a positive
skew; the “tail” of the
distribution goes
toward higher values
X
7
Less “normal” distributions
Frequency
Foundations
Of Research
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Other data may show
a negative skew.
X
1
X
X
X
X
2
3
Scores
X
X
X
X
X
X
X
X
X
X
4
5
X
X
X
X
X
X
X
X
X
X
X
X
6
X
X
X
X
X
X
X
X
X
X
X
7
26
Foundations
Of Research
Descriptive / Univariate Statistics
How angry at you at the banks that crashed the economy?
1. Gather raw attitude data:
2. Show the number of
students at each level of
the variable “angry”…
Clicker
responses
in class
27
Foundations
Of Research
28
Descriptive statistics example
3. Compile the
descriptive
statistics:
4. Show the data as a
frequency distribution…
Foundations
Of Research
Summary: Introduction & number scales
Numbers are crucial to science
 Operational definitions
 Communication
Summary
 Statistical analyses
Types of numerical scales
 Ratio
 Interval
 Ordinal
 Categorical
Numbers fall into regular
frequency distributions
29
Foundations
Of Research
Research questions, hypotheses &
designs

Using numbers in science

Number scales & frequency
distributions


30
Central Tendency: Mode,
Median, Mean
Images from the FACE RESEARCH
LAB, Ben Jones and Lisa DeBruine ,
School of Psychology, University of
Aberdeen.
Foundations
Of Research
Central tendency
 Central tendency – the general “drift” in a set
of scores or values – can reflect any physical or ψ
process or numerical scale.
 Central tendency reduces the variance in a
sample of values to one core value.
 The following is an example from the FACE
RESEARCH LAB; Ben Jones and Lisa DeBruine, School
of Psychology, University of Aberdeen.
31
Foundations
Of Research
What is Central Tendency (Mean or average…)
Variance in
the faces of the
different lab
members.
32
The average
(”Mean” or M)
face of the lab
members
Foundations
Of Research
Describing data
33
We characterize the general trend or character
of data using two key statistics:
1. Central tendency or general “drift” of the scores.

Mode
 most common score

Median
 middle of the distribution

Mean
 average score
2. Variance: how diverse the scores are (how much vary
from each other).

Range
 …from the highest to lowest score

Standard
deviation
 “average” amount the scores vary
from the Mean score
Foundations
Of Research
34
The Mode
Most frequent score in the distribution.
Example: scores = 15, 20, 21, 20, 36,15, 25,15,12

Show scores as a frequency distribution
score
12
15
20
21
25
36


frequency
1
3
2
1
1
1
% of cases
11%
33%
22%
11%
11%
11%
15 is most common, and is considered the mode.
Characteristics:



used for all numerical scales, particularly categorical
insensitive to extreme values or range of scores
unstable; sensitive to small shifts in number of case
Foundations
Of Research
35
Median
Mid-point of a distribution of scores:
 Half are above, half are below
 Only for continuous (interval or ratio) scales

List all the scores in numerical order
score
12
15
20
21
25
36
frequency
1
3
2
1
1
1
12
15
% of cases
15
11%
15
33%
20
22%
11%
20
11%
21
11%
25
36
Foundations
Of Research
36
Median
Mid-point of a distribution of scores:
 Only for continuous (interval or ratio) scales


List all the scores in numerical order
Locate the score in the center of the order
For an even number of scores the Median is
the average of the middle two scores

The Mode = 15, whereas Median = 20

For this distribution the median best reflects the
center of the distribution
Median is not sensitive to extreme
values

Adding ‘1,000’ would not matter; ‘20’ would still
be the center of the distribution.
15
15
15
20
20
21
25
36
1,000
Four below 

12
Four above 
 Half are above, half are below
Foundations
Of Research
37
Mean (M)
The “average” score in sample;
x
n
Most common measure of central tendency


Total all scores: 12+15+20+21+20+36+15+25+15 = 179

Divide by “n” of scores: 179 / 9 = 19.9 (round to 20).
Characteristics:

Good for Ratio or interval scales

Sensitive to all observed values

Highly stable; with larger n is insensitive to subtle
changes in values

Can be highly sensitive to extreme values.
12+15+20+21+20+36+15+25+15+1000 = 1,179 / 10 = 118
Foundations
Of Research
38
The Mean: Statistical notation
Some basic statistical notation:
X
Score on one variable for one participant
n
Number of scores in the sample
Σ
Sum of a set of scores
M or X Mean; sum of scores divided by n of scores:
x
n
Foundations
Of Research


Central tendency:
Normal Distributions
Many variables in nature
(and science) are normally
distributed; their
frequency distribution is
bell shaped.
For a normal distribution
the mean, mode, and
median are all same -- the
center of the distribution
Mode
Median
Mean
39
Foundations
Of Research
40
The distribution of student ages
Mode of age
distribution
Mean
Age category
Over 30
26 to 30
22 to 25
Median
19 to 21
50
45
40
35
30
25
20
15
10
5
0
Under 19
Percent of students
Age is a good example of a variable that is normally
distributed
Foundations
Of Research
Central tendency:
41
Bimodal Distributions
A bimodal distribution
has two modes, at the
outsides of the
distribution.
The mean & median
are similar, at the
center.
Mode
Mode
Mean
Median
Common examples are:
 Highly polarized political attitudes (i.e., where few are “in
the middle”).
 Some personality variables..
Those who fear and loathe statistics v. those who love
statistics and are popular, mentally healthy, and happy,
Foundations
Of Research
Central tendency:
Skewed Distributions
A skewed distribution has extreme scores in one
direction.
The extreme scores make the
median higher than the mode.
(The high scores to the right move the
50% point that direction…).
Mode
The Mean gets pulled even higher.
(Adding in some very high scores raises the
average…).
Mean (M)
Median
Common examples:
 Behaviors such as alcohol or drug use:
 Most people use none or moderate
 A diminishing number use higher levels
 Demographic variables such as income
42
Foundations
Of Research
Central tendency; normal distribution
Age, Chicago community sample
N
Measures of Central
Tendency: A normal
distribution
Valid
Missing
Mean
Median
Mode
Std. Deviation
Variance
Skewness
Range
43
793
24
33.1967
32.0000
30.00
9.54557
91.118
Histogram .633
50.00

Scores for age from a
large community sample
form a largely symmetrical
distribution.

The Mean, Median, and
mode are similar.

Any measure of central
tendency well represents
the data.
200
150
100
50
Mean = 33.1967
Std. Dev. = 9.54557
N = 793
0
10
20
30
Psychology 242, Dr. McKirnan
40
age2
50
60
Foundations
Of Research
Local examples of distributions, 1
How many packs of
cigarettes do you smoke
each week?
A = I do not smoke
B= 1 pack or less
C = 2 packs or so
D = 3 to 5 packs
E = A pack a day or more
44
Foundations
Of Research
Cigarette smoking forms a bimodal distribution

People do not smoke at all…

…or smoke a lot.

Few are “light” smokers.

Mean = 1.1 packs per week
clearly misrepresents the data.

The Mode or Median are better
indicators.
Statistics
Packs of cigarettes per week
N
None
1
2
4 or 5
45
7 or more
Valid
Missing
Mean
Std. Error of Mean
Median
Mode
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
Sum
789
28
1.1305
.05915
.0000
.00
1.66
2.761
.978
.087
-.871
.174
4.00
.00
4.00
892.00
Foundations
Of Research
Local examples of distributions, 2b
During the last week on
how many days did you have
at least one drink of alcohol?
A=0
B= 1
C = 2 or 3
D = 4 or 5
E = 6 or 7
46
Foundations
Of Research
Drug or alcohol exemplify positively skewed data
Most people use no
or few substances
47
This strong positive
skew is reflected in a
skewness statistic
400
Statistics
300
Smaller & smaller
#s use more…
Frequency
200
100
0
.00
1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00
Psychology 242, Dr. McKirnan
Number of alcohol or drugs used > rarely
(Data from a community survey sample)
Number of alcohol or drugs used > rarely
N
Valid
766
Missing
51
Mean
1.11
Median
1.00
Mode
.00
Std. Deviation
1.59
Variance
2.54
Skewness
2.46
Minimum
.00
Maximum
9.00
Foundations
Of Research
48
Example of a strong positive skew: American incomes.
This figure represents the bottom 99% of American household
incomes (2000).
Each  =100,000 households
Median =
$40,700
Mode ~
$20K
Mean =
$57,000
bottom
80% <
$80,500
$100K
Income in the U.S. is highly skewed:
the M household income is over
twice the modal income.
$200K
$300K
49
American income distribution (2000)
Foundations
Of Research
This dramatic positive skew
makes “M family income” a
very poor indicator of most
people’s financial resources
99% of families fit into the very
bottom of the complete U.S.
income distribution
The super-rich 1% extends far further…
Tiger
Woods
$50M
CEO of TimeWarner
$100M
CEO of
Citigroup
$200M
George
Lucas
Foundations
Of Research
Deceptiveness of the M in skewed data:
Example from the 2004-2008 tax cut data
50
$85,002

Did all Americans benefit equally from the Bush
era tax cuts?

The M annual tax cut for all tax payers is $1629 per
year from 2001 to 2010... a pretty good number.

The extreme
positive skew in
incomes & tax
legislation make
this highly
deceptive.
$10,000
$9,000
$8,000
$7,000
$6,000
Annual estimated tax cuts
$5,000
$4,000
$2,780
$3,000
$2,000
$1,000
$0
$98
Lowest
20%
$508
$791
2nd 20%
3rd 20%
$1,081
$1,225
4th 20% Next 10% Next 4%
Income groups
Source: Citizens for Tax Justice, 6/12/06; click here
Top 1%
Foundations
Of Research
51
$85,002
Deceptive Means in skewed data
 The M for tax cuts is pulled upward by a small
number of very high values
 Almost 90% of tax payers got less than the M
tax cut of $1,629
$10,000
$9,000
$8,000
$7,000
$6,000
Annual estimated tax cuts
$5,000
$4,000
$2,780
$3,000
$2,000
$1,000
$0
$98
Lowest
20%
$508
$791
2nd 20%
3rd 20%
$1,081
$1,225
4th 20% Next 10% Next 4%
Income groups
Top 1%
Foundations
Of Research
52
$85,002
Deceptive Means in skewed data
 In data this skewed a single measure of
central tendency cannot be accurate.
 It is more accurate to show income groups
separately.
 The bottom 66% of
tax payers got M =
$461 in cuts,
 Top 10% M =
$10,203
$10,000
$9,000
$8,000
$7,000
$6,000
Annual estimated tax cuts
$5,000
$4,000
$2,780
$3,000
$2,000
$1,000
$0
$98
Lowest
20%
$508
$791
2nd 20%
3rd 20%
$1,081
$1,225
4th 20% Next 10% Next 4%
Income groups
Top 1%
Foundations
Of Research
Mean v. Median in skewed data
53
The difference between the Mode,
Median & Mean in skewed data can
be crucial: Example from men’s v.
women’s number of sex partners.

M number of partners is
much higher for men
than for women.

However, both men &
women have highly
skewed distributions:

Most at the low end

A long tail to the right

A small mode at very
high numbers
Foundations
Of Research
Mean v. Median in skewed data
54
Expressing the data as Ms greatly exaggerates differences
between men & women.
For both genders:
mode = 1,
medians are similar (5 v. 3),
This suggests only modest
gender differences.
Men’s data are far more skewed – some men report A LOT of
partners – their M is very high (6 times their median…).
The genders are similar,
except for a few men at
the very top of the scale,
who pull their M
upwards.
Simply taking the Ms at
face value would be
deceiving.
Foundations
Of Research

Bottom line:
55
Many natural processes or variables show a (more or less)
normal distribution…

Age, Height, IQ, most attitude scales…

Many important behavioral variables have a highly
skewed distribution.

When a distribution is normal the Mode = Mean =
Median, all at the center of the distribution.

When a distribution is bimodal or skewed the M can be
deceptive.

Mode

Median
 Best for categorical data or a simple
summary
 Best for highly skewed data

Mean
 best for normally distributed data.
Foundations
Of Research
Summary: Introduction & number scales
Primary measures of Central Tendency
Summary
 Mode
 Most common value
 Median  Middle of the distribution
 Average
 Mean
Types of frequency distributions
 “Normal”  Best characterized by the
Mean
56
Foundations
Of Research
Summary: Introduction & number scales
Primary measures of Central Tendency
Summary
 Mode
 Most common value
 Median  Middle of the distribution
 Average
 Mean
Types of frequency distributions
 “Normal”
 Bimodal  Best characterized
by the Modes
57
Foundations
Of Research
Summary: Introduction & number scales
Primary measures of Central Tendency
Summary
 Mode
 Most common value
 Median  Middle of the distribution
 Average
 Mean
Types of frequency distributions
 “Normal”
 Bimodal
 Skewed  Best characterized
by the Median or
mode
58