Transcript Chapter 2

Chapter 2
Frequency Distributions, Stem-andleaf displays, and Histograms
Where have we been?
To calculate SS, the variance, and the standard
deviation: find the deviations from , square and
sum them (SS), divide by N (2) and take a square
root().
Example: Scores on a Psychology quiz
Student
X
John
7
Jennifer
8
Arthur
3
Patrick
5
Marie
7
X = 30
N=5
 = 6.00
X-
+1.00
+2.00
-3.00
-1.00
+1.00
(X- ) = 0.00
(X - )2
1.00
4.00
9.00
1.00
1.00
(X- )2 = SS = 16.00
2 = SS/N = 3.20
 = 3.20 = 1.79
Ways of showing how
scores are distributed
around the mean
• Frequency Distributions,
• Stem-and-leaf displays
• Histograms
Some definitions
• Frequency Distribution - a tabular display of the way
scores are distributed across all the possible values of a
variable
• Absolute Frequency Distribution - displays the count
(how many there are) of each score.
• Cumulative Frequency Distribution - displays the
total number of scores at and below each score.
• Relative Frequency Distribution - displays the
proportion of each score.
• Relative Cumulative Frequency Distribution displays the proportion of scores at and below each score.
Example Data
Traffic accidents by bus drivers
•Studied 708 bus drivers, all of whom had
worked for the company for the past 5 years or
more.
•Recorded all accidents for the last 4 years.
•Data looks like:
3, 0, 6, 0, 0, 2, 1, 4, 1, … 6, 0, 2
Frequency distributions –
Absolute & cumulative frequency
# of acdnts
0
1
2
3
4
5
6
7
8
9
10
11
Absolute
Frequency
117
157
158
115
78
44
21
7
6
1
3
1
708
Cumulative
Frequency
117
274
432
547
625
669
690
697
703
704
707
708
To calculate absolute
frequencies, tally and
count the number of
each kind of score.
To calculate cumulative
frequencies, add up the
absolute frequencies of
scores at or below each
score (or possible score ,
if a score is missing).
Frequency distributionsrelative frequencies
Absolute
# of acdnts Frequency
0
117
1
157
2
158
3
115
4
78
5
44
6
21
7
7
8
6
9
1
10
3
11
1
708
Cumulative
Frequency
117
274
432
547
625
669
690
697
703
704
707
708
Relative
Frequency
.165
.222
.223
.162
.110
.062
.030
.010
.008
.001
.004
.001
.998
Calculate relative
frequencies by dividing
each absolute frequency
by N, the total number
of scores. (For example
117/708 = .165.)
Relative frequencies
show the proportion of
scores at each point.
Note rounding error
What pops out of such a display
• Number of accidents = 0(117) + 1(157) + 2(158) + 3(115) +
4(78) + 5(44) +6(21)+7(7)+8(6)+9(1)+10(3)+11(1)=1623
• 18 drivers (about 2.5% of the drivers) had 7 or
more accidents during the 4 years just before
the study.
• Those 18 drivers caused 147 of the 1623
accidents or very close to 9% of the accidents
• Maybe they should be given eye/reflex
exams?
What pops out of such a display
• 5 drivers (about .7% of the drivers) had 9 or
more accidents during the 4 years just before
the study.
• Those 5 drivers caused 50 of the 1623
accidents or a little over 3% of the accidents
• They should be given eye/reflex exams!
• Probably, they should be given desk jobs.
Frequency distributionscumulative relative
frequencies
Cumulative
# of acdnts
0
1
2
3
4
5
6
7
8
9
10
11
Absolute
Frequency
117
157
158
115
78
44
21
7
6
1
3
1
708
Cumulative
Frequency
117
274
432
547
625
669
690
697
703
704
707
708
Relative
Frequency
.165
.387
.610
.773
.883
.945
.975
.984
.993
.994
.999
1.000
.
Calculate cumulative
relative frequencies, by
dividing the number of
scores at or below each
possible score by N, the
total number of scores.
For example: cumulative
relative frequency of a
score of 3 is 547/708 =
.773.
Cumulative relative
frequencies show the
proportion of scores at or
below each score.
Grouped Frequencies
Needed when
– number of values is large OR
– values are continuous.
To calculate group intervals
– First find the range.
– Determine a “good” interval based on
• on number of resulting intervals,
• meaning of data, and
• common, regular numbers.
– List intervals from largest to smallest.
Grouped Frequency Example
100 High school students’ average time in seconds to read
ambiguous sentences.
Values range between 2.50 seconds and 2.99 seconds.
2.72
2.58
2.87
2.85
2.83
2.83
2.87
2.88
2.84
2.60
2.87
2.61
2.79
2.96
2.84
2.85
2.63
2.63
2.74
2.54
2.76
2.93
2.84
2.51
2.62
2.70
2.73
2.75
2.89
2.80
2.54
2.73
2.52
2.96
2.86
2.92
2.65
2.98
2.80
2.75
2.90
2.58
2.98
2.70
2.61
2.79
2.99
2.75
2.87
2.59
2.61
2.93
2.96
2.66
2.76
2.89
2.81
2.89
2.87
2.58
2.58
2.93
2.89
2.78
2.83
2.76
2.50
2.71
2.64
2.52
2.95
2.85
2.58
2.82
2.51
2.85
2.59
2.96
2.52
2.66
2.83
2.87
2.70
2.54
2.95
2.66
2.86
2.90
2.87
2.56
2.54
2.56
2.74
2.86
2.91
2.75
2.51
2.85
2.59
2.73
Determining “i” (the size of the
interval)
• WHAT IS THE RULE FOR DETERMINING
THE SIZE OF INTERVALS TO USE IN WHICH
TO GROUP DATA?
• Whatever intervals seems appropriate to most
informatively present the data. It is a matter of
judgment. Usually we use 6 – 12 same size
intervals each of which uses an intuitively obvious
endpoint such as 0 or 5.
Grouped Frequencies
Range = 2.995 - 2.495 = .50 (see real/apparent class limits--discussed infra)
i = .1
#i = 5
i = .05
#i = 10
Reading
Time
Frequency
2.90-2.99
16
2.80-2.89
31
2.70-2.79
20
2.60-2.69
12
2.50-2.59
21
Reading
Time
Frequency
2.95-2.99
9
2.90-2.94
7
2.85-2.89
20
2.80-2.84
11
2.75-2.79
10
2.70-2.74
10
2.65-2.69
4
2.60-2.64
8
2.55-2.59
10
2.50-2.54
11
Either is acceptable.
• Use whichever display seems most
informative.
• In this case, the smaller intervals and 10
category table seems more informative.
• Sometimes it goes the other way and less
detailed presentation is necessary to prevent
the reader from missing the forest for the
trees.
How you organize the data is up
to you.
• When engaged in this kind of thing, there is
often more that one way to organize the data.
• You should organize the data so that people
can easily understand what is going on.
• Thus, the point is to use the grouped frequency
distribution to provide a simplified description
of the data.
Stem and Leaf Displays
• Used when seeing all of the values is
important.
• Shows
– data grouped
– all values
– visual summary
Stem and Leaf Display
• Reading time data
i = .05
#i = 10
Reading
Time
2.9
2.9
2.8
2.8
2.7
2.7
2.6
2.6
2.5
2.5
Leaves
5,5,6,6,6,6,8,8,9
0,0,1,2,3,3,3
5,5,5,5,5,6,6,6,7,7,7,7,7,7,7,8,9,9,9,9
0,0,1,2,3,3,3,3,4,4,4
5,5,5,5,6,6,6,8,9,9
0,0,0,1,2,3,3,3,4,4
5,6,6,6
0,1,1,1,2,3,3,4
6,6,8,8,8,8,8,9,9,9
0,1,1,1,2,2,2,4,4,4,4
Stem and Leaf Display
• Reading time data
i = .1
#i = 5
Reading
Time
2.9
2.8
2.7
2.6
2.5
Leaves
0,0,1,2,3,3,3,5,5,6,6,6,6,8,8,9
0,0,1,2,3,3,3,3,4,4,4,5,5,5,5,5,6,6,6,7,7,7,7,7,7,7,8,9,9,9,9
0,0,0,1,2,3,3,3,4,4,5,5,5,5,6,6,6,8,9,9
0,1,1,1,2,3,3,4,5,6,6,6
0,1,1,1,2,2,2,4,4,4,4,6,6,8,8,8,8,8,9,9,9
Purely figural displays of
frequency data
Bar graphs
• Bar graphs are used to show frequency of scores
when you have a discrete variable.
• Discrete data can only take on a limited number of
values.
• Numbers between adjoining values of a discrete
variable are impossible or meaningless.
• Bar graphs show the frequency of specific scores
or ranges of scores of a discrete variable.
• The proportion of the total area of the figure taken
by a specific bar equals the proportion of that kind
of score.
• Note, in this context proportion and relative
frequency are synonymous.
The results of rolling a six-sided
die 120 times
100
120 rolls – and it came out 20 ones, 20 twos, etc..
75
50
25
0
1
2
3
4
5
6
Bar graphs and Histograms
• Use bar graphs, not histograms, for discrete data.
(The bars don’t touch in a bar graph, they do in a
histogram.)
• You rarely see data that is really discrete.
• Discrete data are almost always categories or
rankings.ANYTHING ELSE IS ALMOST
CERTAINLY A CONTINUOUS VARIABLE.
• Use histograms for continuous variables.
• AGAIN, almost every score you will obtain reflects
the measurement of a continuous variable.
A stem and leaf display turned on its side
shows the transition to purely figural
displays of a continuous variable
4
4
4
4
2
2
2
1
1
1
0
9
9
9
8
8
8
8
8
6
6
4
3
3
2
1
1
1
0
2.502.54
2.552.59
2.60 –
2.64
6
6
6
5
2.65 –
2.69
4
4
3
3
3
2
1
0
0
0
2.70 –
2.74
9
9
8
6
6
6
5
5
5
5
2.75 –
2.79
4
4
4
3
3
3
3
2
1
0
0
2.80 –
2.84
9
9
9
9
7
7
7
7
7
7
7
6
6
6
5
5
5
5
3
3
3
2
1
0
0
2.85 –
2.89
2.90 –
2.94
9
8
8
6
6
6
6
5
5
2.95 –
2.99
Histogram of reading times –
notice how the bars touch at the
real
limits
of
each
class!
20
F
r
e
q
u
e
n
c
y
18
16
14
12
10
8
6
4
2
0
2.502.552.60 – 2.65 – 2.70 – 2.75 – 2.80 – 2.85 – 2.90 – 2.95 –
2.54
2.59
2.64
2.69
2.74
2.79
2.84
2.89
2.94
2.99
Reading Time (seconds)
Histogram concepts - 1
• Histograms must be used to display continuous data.
• Most scores obtained by psychologists are
continuous, even if the scores are integers.
• WHAT COUNTS IS WHAT YOU ARE
MEASURING, NOT THE PRECISION OF
MEASUREMENT.
• INTEGER SCORES IN PSYCHOLOGY ARE
USUALLY ROUGH MEASUREMENTS OF
CONTINUOUS VARIABLES.
Example and question
• You give a Psych Quiz with ten questions.
Scores can be 0,1,2,3,4,5,6,7,8,9, or 10.
• Are the resulting scores discrete or
continuous data?
Answer to example
• While scores on a ten question multiple choice
intro psych quiz ( 1, 2, …10) are integers, you are
measuring knowledge, which is a continuous
variable that could be measured with 10,000
questions, each counting .001 points. Or 1,000,000
questions each worth .00001 points.
• You measure at a specific level of precision,
because that’s all you need or can afford.
Logistics, not the nature of the variable, constrains
the measurement of a continuous variable.
Histogram concepts - 2
• If you have continuous data, you can use
histograms, but remember real class limits.
• Histograms can be used for relative
frequencies as well.
• Histograms can be used to describe theoretical
distributions as well as actual distributions.
Theoretical Histograms
Displaying theoretical
distributions is the most
important function of
histograms.
• Theoretical distributions
show how scores can be
expected to be distributed
around the mean.
TYPES OF THEORETICAL
DISTRIBUTIONS
• Distributions are named after the shapes of
their histograms. For psychologists, the
most important are:
– Rectangular
– J-shaped
– Bell (Normal)
– t distributions - Close to Bell shaped, but
a little flatter
Rectangular Distribution of
scores
The rectangular distribution is the “know
nothing” distribution
• Our best prediction is that everyone will score
at the mean.
• But in a rectangular distribution, scores far
from the mean occur as often as do scores
close to the mean.
• So the mean tells us nothing about where the
next score will fall (or how the next person
will behave).
• We know nothing in that case.
Flipping a coin: Rectangular distributions
are frequently seen in games of chance, but rarely
elsewhere.
100
100 flips - how many heads and tails do you expect?
75
50
25
0
Heads
Tails
Rolling a die
100
120 rolls - how many of each number do you expect?
75
50
25
0
1
2
3
4
5
6
Which distribution is this?
100
75
50
25
0
1
2
3
4
5
6
RECTANGULAR!
100
75
50
25
0
1
2
3
4
5
6
What happens when you sample two
scores at a time?
• All of a sudden things change.
• The distribution of scores begins to
resemble a normal curve!!!!
• The normal curve is the “we know
something” distribution, because most
scores are close to the mean.
Rolling 2 dice
Dice
Total
1
2
3
4
5
6
7
8
9
10
11
12
Absolute
Freq.
0
1
2
3
4
5
6
5
4
3
2
1
36
Relative
Frequency
.000
.028
.056
.083
.111
.139
.167
.139
.111
.083
.056
.028
1.001
Look at the histogram to
see how this resembles a
bell shaped curve.
Rolling 2 dice
100
90
80
70
60
50
40
30
20
10
0
360 rolls
1 2 3 4 5 6 7 8 9 10 11 12
Normal Curve
J Curve
Occurs when socially normative behaviors are measured.
Most people follow the norm,
but there are always a few outliers.
What does the J shaped distribution represent?
• The J shaped distribution represents situations in
which most everyone does about the same thing.
These are unusual social situations with very clear
contingencies.
• For example, how long do cars without
handicapped plates park in a handicapped spot
when there is a cop standing next to the spot.
• Answer: Zero minutes!
• So, the J shaped distribution is the “we know
almost everything” distribution, because we
can predict how a large majority of people will
behave.
Principles of Theoretical Curves
Expected frequency = Theoretical relative
frequency X N
Expected frequencies are your best estimates
because they are closer, on the average, than
any other estimate when we square the
difference between observed and predicted
frequencies.
Law of Large Numbers - The more observations
that we have, the closer the relative frequencies
we actually observe should come to the
theoretical relative frequency distribution.