Chapter 2: Frequency Distributions

download report

Transcript Chapter 2: Frequency Distributions

Chapter 2: Describing & Exploring Data
As mentioned in the introduction, the purpose of using statistics in research is to
aid in the summary of data. Two primary methods are used to accomplish such a
summary: graphical and numerical.
Graphical methods summarize data visually via graphs, charts, and tables.
•
frequency distributions
•
histograms
•
stem-and-leaf displays
•
boxplots
•
(dotplots)
•
scatterplots
Numerical methods summarize data using numbers to describe various
characteristics or trends within the distribution.
•
measures of central tendency
•
measures of variability
•
measures of association
1
Chapter 2: Example Data Set
Consider the following data (Chap2Ex.sav). If I were to tell you that these are test
scores, you would have a difficult time summarizing them by simply eyeballing the
data.
GRE SCORES
Obs
person
score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
680
492
540
392
724
438
551
491
441
503
426
475
569
420
426
420
534
470
365
543
631
643
458
661
394
Obs
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
person
score
Obs
person
score
Obs
person
score
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
405
595
539
492
622
437
436
466
492
597
378
618
466
609
486
364
267
459
565
540
453
587
408
628
355
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
535
494
377
515
416
490
414
576
505
693
465
661
398
473
518
418
547
346
578
586
535
534
394
446
557
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
572
638
401
416
580
638
401
437
488
574
590
459
489
604
400
495
650
438
369
455
405
252
417
482
630
2
Chapter 2: Frequency Distributions
One first step in summarizing data is to create a frequency distribution showing
the number of observations for each observed data value. (In my experience,
frequency distributions group the data, and counts are provided to show numbers
or percentages of observations within various ranges).
Sometimes frequency distributions include cumulative frequencies so that you can
determine the counts and percentages of observations at or below any given data
point or range. Percentile ranks are determined this way for educational tests.
Alternatively, you might be interested in depicting fewer score points than
percentiles. For example, a common quantile (AKA fractile) is the quartile.
•
Quartiles: The data values at or below which fall 25% (Q1), 50%(Q2 =
median), and 75% (Q3) of the remaining values, corresponding to the first,
second, and third quartiles. The fourth quartile is the maximum value.
3
Chapter 2: Frequency Procedure
Frequency procedure:
In SPSS, frequency distributions for specified variables can be obtained by
choosing the followings from the Menu bar.
Analyze  Descriptive Statistics  Frequencies.
At the pop-up dialog box Window, choose GRE in the left box, and then move it to
the right box by clicking on the arrow ( ►) button.
This procedure generates a table of frequency distribution for a variable called
GRE.
4
Chapter 2: Frequency Table
GRE
252
1
1.0
1.0
Cumulative
Percent
1.0
267
1
1.0
1.0
2.0
346
1
1.0
1.0
3.0
355
1
1.0
1.0
4.0
364
1
1.0
1.0
5.0
365
1
1.0
1.0
6.0
369
1
1.0
1.0
7.0
377
1
1.0
1.0
8.0
378
1
1.0
1.0
9.0
392
1
1.0
1.0
10.0
394
2
2.0
2.0
12.0
398
1
1.0
1.0
13.0
400
1
1.0
1.0
14.0
401
2
2.0
2.0
16.0
405
2
2.0
2.0
18.0
408
1
1.0
1.0
19.0
414
1
1.0
1.0
20.0
416
2
2.0
2.0
22.0
417
1
1.0
1.0
23.0
418
1
1.0
1.0
24.0
420
2
2.0
2.0
26.0
426
2
2.0
2.0
28.0
436
1
1.0
1.0
29.0
437
2
2.0
2.0
31.0
438
2
2.0
2.0
33.0
Frequency
Valid
Percent
Valid Percent
Q1
5
Chapter 2: Frequency Table (Cont’d)
Output Continued
441
1
1.0
1.0
Cumulative
Percent
34.0
446
1
1.0
1.0
35.0
453
1
1.0
1.0
36.0
455
1
1.0
1.0
37.0
458
1
1.0
1.0
38.0
459
2
2.0
2.0
40.0
465
1
1.0
1.0
41.0
466
2
2.0
2.0
43.0
470
1
1.0
1.0
44.0
473
1
1.0
1.0
45.0
475
1
1.0
1.0
46.0
482
1
1.0
1.0
47.0
486
1
1.0
1.0
48.0
488
1
1.0
1.0
49.0
489
1
1.0
1.0
50.0
490
1
1.0
1.0
51.0
491
1
1.0
1.0
52.0
492
3
3.0
3.0
55.0
494
1
1.0
1.0
56.0
495
1
1.0
1.0
57.0
503
1
1.0
1.0
58.0
505
1
1.0
1.0
59.0
515
1
1.0
1.0
60.0
518
1
1.0
1.0
61.0
534
2
2.0
2.0
63.0
535
2
2.0
2.0
65.0
539
1
1.0
1.0
66.0
Frequency
Percent
Valid Percent
Q2 (median)
6
Chapter 2: Frequency Table (Cont’d)
Output Continued
540
2
2.0
2.0
Cumulative
Percent
68.0
543
1
1.0
1.0
69.0
547
1
1.0
1.0
70.0
551
1
1.0
1.0
71.0
557
1
1.0
1.0
72.0
565
1
1.0
1.0
73.0
569
1
1.0
1.0
74.0
572
1
1.0
1.0
75.0
574
1
1.0
1.0
76.0
576
1
1.0
1.0
77.0
578
1
1.0
1.0
78.0
580
1
1.0
1.0
79.0
586
1
1.0
1.0
80.0
587
1
1.0
1.0
81.0
590
1
1.0
1.0
82.0
595
1
1.0
1.0
83.0
597
1
1.0
1.0
84.0
604
1
1.0
1.0
85.0
609
1
1.0
1.0
86.0
618
1
1.0
1.0
87.0
622
1
1.0
1.0
88.0
628
1
1.0
1.0
89.0
630
1
1.0
1.0
90.0
631
1
1.0
1.0
91.0
638
2
2.0
2.0
93.0
643
1
1.0
1.0
94.0
650
1
1.0
1.0
95.0
661
2
2.0
2.0
97.0
680
1
1.0
1.0
98.0
693
1
1.0
1.0
99.0
724
1
1.0
1.0
100.0
100
100.0
100.0
Frequency
Total
Percent
Valid Percent
Q3
7
Chapter 2: Histograms
The graphic version of a frequency
distribution is a histogram (also
known as a bar chart*).
25
20
Frequency
We examine frequency
distributions and histograms for
each variable of interest to get an
impression of the overall shape of
the data and to see whether there
are outliers in the data.
Both of these features may
influence your choice of a statistic
for summarizing your data.
Histogram of Scores
15
10
5
0
20
27
34
41
48
55
62
69
76
83
Score
*The difference between histogram and bar chart is that the histogram is for
continuous variables and the bar chart is for discrete (categorical) variable.
Thus for creating histogram, we need to decide the number of class intervals.
SPSS does it for you, but it is generally recommended using from 5 to 20 class
intervals.
8
Chapter 2: Distribution Shapes
As mentioned previously, graphic representations of data help us to to get an idea
of the shape of the data and to identify outliers because these characteristics may
influence the choice of summary statistic. There are a variety of terms used to
describe the shape of data.
•
•
Normal: Unimodal (one distinct peak), symmetrical (i.e., you can draw a
line between which each side is nearly a mirror image of the other side) &
bell-shaped (peaked in the middle and tapering in the tails). Actually
there is a mathematical formula for a truly normal distribution.
Bimodal: Having two distinct peaks (definitely NOT “normal”).
9
Chapter 2: Distribution Shapes
•
Skewed: Having most data points cluster tightly with a few data points
being extreme. Distributions may be negatively skewed (having a few data
points in the extreme low direction) or positively skewed (having a few
data points in the extreme high direction). The mnemonic is that you can
draw an arrow on the tail of the distribution, and the skew is in the
direction of the arrow.
Negatively (or left) skewed
distribution
Positively (or right) skewed
distribution
10
Chapter 2: Distribution Shapes
• Kurtosis: The relative concentration of data points in the center, shoulders, and
tails of the distribution.
•
Mesokurtic: A description of the relative concentration of data points in a
normal distribution.
•
Platykurtic: A distribution that tends to have fewer data points in the
center and tails (and more in the shoulders) than a normal distribution.
Remember that plateaus are flat.
•
Leptokurtic: A distribution that tends to have more data points in the
center and tails (and less in the shoulders) than a normal distribution.
Remember to “leap” is to jump up -- the distribution jumps up in the middle.
Platykurtic
(negative
kurtosis)
Normal distr,
Leptokurtic
(positive
kurtosis)
Normal distr,
11
Chapter 2: Chart Procedure
The chart procedure:
Chart Procedure is a SPSS procedure that request bar charts or histograms for
specific variables. In the data screen of SPSS, we choose the following option from
the menu bar.
Graphs -> Histogram, and move a variable called GRE on the left box to the
variable box on the upper right box by cricking the right arrow in the middle.
(Another way to create Histogram is to once we moved GRE variable in the right
box after the sequence of Analyze  Descriptive Statistics  Descriptives, we
choose Charts button in the bottom. Then choose Histogram and push continue.)
This action tells SPSS that we want to generate plot of the data. In this example,
we request charts for a variable called GRE.
12
Chapter 2: Chart Procedure
Histogram for GRE scores
14
12
Frequency
10
8
6
4
2
Mean = 497.02
Std. Dev. = 95.325
N = 100
0
400
600
GRE
Q. How many class intervals does this histogram have and what is the
width for each class interval (AKA bin size)?
A. 19, 200/8 = 25
13
Chapter 2: Chart Procedure
Sometimes it is a good idea to overlay the normal curve because the normal
distribution can serve as the reference distribution with unimodal, symmetry (i.e.,
skewness = 0) and mesokurtic (kurtosis = 0). We can do it by selecting the
Display normal curve (or With normal curve in Charts option in Descriptives)
option.
Histogram for GRE scores
14
Modality?
12
Skewness?
Frequency
10
Kurtosis?
8
6
4
2
Mean = 497.02
Std. Dev. = 95.325
N = 100
0
400
600
GRE
14
Chapter 2: Chart Procedure
Sometimes it is hard to figure out these characteristics of the
shape of the distribution by eyeballing the chart. In this case, we
get the values by choosing Statistics option in the pop-up dialog
box window of the Frequency procedure. Choose Mean, Median,
Mode from Central tendency section, and Skewness and kurtosis
from Distribution section.
Statistics
GRE
N
Mean
Median
Mode
Skewness
Std. Error of Skewnes s
Kurtos is
Std. Error of Kurtos is
Valid
Mis sing
100
0
497.02
489.50
492
.113
.241
-.387
.478
Note: A rule of thumb for skewness and kurtosis would be:
Between  1 (i.e., absolute value less than 1) --- Slightly
Between  2 (i.e., -2 to -1, 1 to 2) --- moderately
Outside of  2 (i.e., less than-2 or larger than 2) --- heavily
15
Chapter 2: Chart Procedure
Also there is a relationship among the mean , the median Md, and the model
Mo for a given distribution if the distribution is unimodal.
 = Md = Mo (a) When the distribution is
symmetric,  = Md = Mo.
Md
Mo

(b) When the distribution is
negatively skewed,
 < Md < Mo.
Md
Mo

(v) When the distribution is
positively skewed,
Mo < Md <  .
16
Chapter 2: Summation Notation
For simplicity of expression, we use symbols to represent various concepts in
statistics.
Variables—The codes (often numerical codes) we use to describe the constructs
we’re interested in. Variables are indicated by upper-case letters (X, Y). Individual
values are represented using subscripts (Xi, Yj).
Summation—We frequently need to add a series of observations for a variable.
The Greek upper-case sigma (S) is used to symbolize this. For example,
N
 X i is read as “the sum of the values of X ranging from 1 to N.”
i 1
S
X
i
N
stands for “the sum of”
stands for the variable we sum
referred to as a subscripting index, stands for the individual values of X
stands for the highest value we sum across (usually the number of cases).
N could be replaced by a number, but we usually use a letter like N to
indicate that we’re summing across all values of X (i.e., there are N values
of the X variable).
17
Chapter 2: Summation Notation
Some examples
Say the data are these pretest scores: 8, 7, 5, 10
X1 would be the score of the first person in the data set. Here X1 = 8. The
first score is not necessarily the largest (or smallest) score, because we don’t
assume the scores are ordered.
Xi is the “ith” score -- here you select what value of i you are interested in:
If i = 3, X3 = 5. If i = N, then here N = 4, so X4 = 10.
Saying “Xi, for i = 1 to N” means “the set of all N scores”.
18
Chapter 2: Summation Notation
Frequently, it is clear that we want to sum all values of X, so we can simply write
N
 X ( =  X i ) which equals (X1 + X2 + X3 + … + XN).
i 1
That is, omit all the subscripts which Howell does. But in my opinion, it is always a
good idea to keep them, because they will help you.
Other common summations are
• The sum of the squared values of X:
N
X
i 1
2
i
 X 12  X 22  X 32  ...  X N2
• The square of the sum of X:
N
( X i ) 2  ( X 1  X 2  X 3  ...  X N ) 2
i 1
Note that
N
X
i 1
N
2
i
 ( X i )
i 1
2
N
2
(i.e.,  X is not equal to ( X i ) ). (Check this by the
i 1
2
i
N
i 1
example above.)
In general, you perform the functions within parentheses prior to performing the19
functions outside of the parentheses.
Chapter 2: Summation Notation
•
The sum of X added to a constant C:
  X  C   X1  C  X 2  C  X 3  C  ...  X N  C   X   NC
•
The sum of X multiplied by a constant C:
  XC   X1C  X 2C  X 3C  ...  X N C  C X 
•
The product of matched pairs of X and Y:
  XY   X1Y1  X 2Y2  X 3Y3  ...  X NYN
•
The sum of a difference between X and Y:
  X  Y    X1  Y1    X 2  Y2    X 3  Y3   ...   X N  YN    X  Y
•
Note that   X  Y     X i  Yi    X i  Yi
It is easier to tell if a variable is being summed when it has a subscript, but
sometimes, as above, the subscript is dropped.
20
Chapter 2: Summation Notation
In crosstabulated or two-way tables, it is common to use two subscripts (one for the
rows and one for the columns). Hence, to sum across both rows and columns, we
would write
I
J
 X
i 1 j 1
ij
which is read, “the sum of X over the subscripts i and j, where i ranges from 1 to I
and j ranges from 1 to J.”
Ex. We have a cross-tabulated table on political party preference crossed by
gender. Each cell represents the number of people.
Political preference
Gender
1=D
2=R
3=I
1=Male
3 (= X11)
5 (= X12)
2 (= X13)
2=Female
5 (= X21)
3 (= X22)
2 (= X23)
21
Chapter 2: Summation Notation
I
In this example, I = 2, J = 3. Thus we have
J
 X
i 1 j 1
2
ij
=
3
 X
i 1 j 1
2
ij
.
3
 X
i 1 j 1
ij
simply
says add up all the number in the table. We can go with either the column first or
the row first. If we fix the row, add up the number in the same row, and then move
to the second row, then we do:
2
3
 X
i 1 j 1
ij

2
3
 ( X
i 1
j 1
2
ij
)   ( X i1  X i 2  X i 3 ) ( X 11  X 12  X 13 )  ( X 21  X 22  X 23 )
i 1
 10  10  X 1  X 2  20
Political preference
Gender
1=D
2=R
3=I
Row Total
1=Male
3 (= X11)
5 (= X12)
2 (= X13)
10( = X1·)
2=Female
5 (= X21)
3 (= X22)
2 (= X23)
10 ( = X2·)
Column
Total
8 ( = X·1)
8 ( = X·2)
4 ( = X·1) 20 ( = X··)
22
Chapter 2: Summation Notation
Or we can go the other way around. That is, fixing the column, add up the number
in the same column, and then do the same thing for the next column.
2
3
 X
i 1 j 1
ij
3
2
 ( X

j 1
i 1
3
ij
)   ( X 1 j  X 2 j ) ( X 11  X 21 )  ( X 12  X 22 )  ( X 13  X 23 )
j 1
 8  8  4  X 1  X 2  X 3  20
Political preference
Gender
1=D
2=R
3=I
Row Total
1=Male
3 (= X11)
5 (= X12)
2 (= X13)
10( = X1·)
2=Female
5 (= X21)
3 (= X22)
2 (= X23)
10 ( = X2·)
Column
Total
8 ( = X·1)
8 ( = X·2)
4 ( = X·1) 20 ( = X··)
We now realize that the summations are exchangeable. That is,
I
J
 X
i 1 j 1
ij
I
J
J
I
i 1
j 1
j 1
i 1
J
I
  ( X ij )   ( X ij )   X ij .
j 1 i 1
23
Chapter 2: Measures of Central Tendency
One characteristic of a distribution that we may wish to summarize is its location on
the underlying continuum. For example, in the following plot, the blue and red
distributions are identical, but the red one is shifted to the right on the horizontal
axis (AKA X-axis).
We refer to such a difference in the positions of distributions as a location shift.
We depict such
differences in location
using statistics that
are called
measures of central
tendency (AKA
measures of location).
24
Chapter 2: Measures of Central Tendency
There are three primary measures of central tendency:
1.
Mode (Mo): The most frequently occurring data value.
2.
Median (Med): When the data are rank ordered, the middle value (or average
of middle values when there is an even number of observations). The median,
therefore, represents the 50th percentile of the data values.
3.
Mean (also X or ”X-bar”): The arithmetic average. Obtained by adding all
data values and dividing by the number of observations:
N
X
 Xi
i 1
N
Q. There are 7 observations and they are [3 5 7 5 6 8 9].
What are the mean, median, and the mode of this
distribution?
A. Mean = 6.14, Median = 6, Mode = 5.
25
Chapter 2: Measures of Central Tendency
Each of the three measures of central tendency is more appropriate for some types
of data than for others.
1.
Mode: Nominal, ordinal, interval, ratio -- i.e., can be used for all types of
variables.
2.
Median: Ordinal, interval, ratio -- also, frequently used when there are
outliers in the data. No good for nominal variables, which are not ordered
at all.
3.
Mean: Interval, ratio -- this is used when the numbers themselves have
meaning beyond just ordering the data.
26
Chapter 2: Questions on Measures of Central Tendency
Why SPSS says:
The mean of 100 GRE score is 497.02.
The median of 100 GRE score is 489.50.
The mode of 100 GRE score is 492.
•
Can you explain how the mean, 497.02, was obtained?
•
Can you explain how the median, 489.50, was obtained using the frequency
table shown before? Since N = 100 (even number), choose (100+1)/2
= 50.5 th obs.  (489+490)/2 = 489.50
•
Can you explain how the mode, 492, was obtained using the frequency table
shown before?
27
Chapter 2: Measures of Central Tendency
Mean, Med, & Mo
The mean, median, and mode are
equal ONLY when the distribution
is symmetrical and unimodal.
When the distribution is skewed
and unimodal, the mode will be
the hump in the distribution.
The mean will be pulled out
toward the tail of the skew.
The median will be
between the other two values.
Mo
Med
Mean
28
Chapter 2: Measures of Variability
Another characteristic of a distribution that we may wish to summarize is its
dispersion or spread on the underlying continuum. For example, in this plot, the
blue and red distributions have the same measure of central tendency, but the red
one is more widely dispersed (wider and flatter) along the X-axis.
We refer to such a difference in the
spread of distributions as a difference
in dispersion or variability, and we
depict such differences in spread using
statistics that are called measures of
variability (AKA measures of
dispersion). A distribution with a
small measure of variability has more
homogeneous members than one with
greater variability (which has more
heterogeneous members).
29
Chapter 2: Measures of Variability
There are five primary measures of variability:
1.
Range: The difference between the two most extreme data points (maximum –
minimum).
2.
Interquartile Range (IQR): The difference between the 25th (Q1) and 75th (Q3)
percentiles.
3.
Variance ( s 2X or “s-squared sub X”): The average squared deviation of
scores from the mean:
N
s 2X 
 X i  X 
i 1
2
N 1
30
Chapter 2: Measures of Variability
4.
Standard Deviation (sx—“s-sub X”): The average absolute deviation of
scores from the mean—also the square root of the variance:
N
sX 
5.
 Xi  X 
i 1
2
N 1
Coefficient of Variation (CV): An index that rescales the standard deviations
from two groups that are measured on the same scale but have very different
means (useful for comparing group variability). Thus, the CV measures the
variability relative to the magnitude of the mean.
1
CV 
sX
2
s1
X
s2
In the figure, CV1>CV2
Large CV indicates a
potential “floor effect”
31
X1
X2
Chapter 2: Measures of Variability
Like the measures of central tendency, measures of variability are influenced by
certain characteristics of the data:
•
Range: sensitive to outliers
•
IQR: insensitive to outer 50% of the data
•
s 2X & sx: very sensitive to outliers
Also the measures of variability are more appropriate for some types of data than
others (none are suitable for nominal data).
•
IQR: Ordinal, interval, ratio
•
s 2X , sx, CV, & Range: Interval, ratio
32
Chapter 2: SPSS descriptive statistics for continuous
variables
The descriptives procedure:
Descriptives procedure in SPSS produces request simple descriptive statistics for
specific variables. In SPSS, Analyze  Descriptive Statistics  Descriptives.
At the pop-up dialog box Window, choose GRE in the left box, and then move it to
the right box by clicking on the arrow ( ►) button.
This procedure generates a table of descriptive statistics for a variable called GRE.
33
Chapter 2: Descriptive Statistics Procedure
Descriptive Statistics
N
GRE
Valid N (lis twis e)
100
100
Minimum
252
Maximum
724
Mean
497.02
Std. Deviation
95.325
34
Chapter 2: Transformations and Statistics
Frequently, we wish to transform data from their original scale to one that has more
meaning to us. For example, we might want to transform test scores to an IQ
scale (with a mean of 100 and standard deviation of 15) or an SAT/GRE scale
(with a mean of 500 and standard deviation of 100). Similarly, we might wish to
transform temperature from Celsius to Fahrenheit (with a freezing point of 32
rather than 0).
These are all examples of linear transformations in which a new mean and
standard deviation are applied to a scale. We can use linear transformations to
transform a variable to have any desired mean and standard deviation.
Scaling factor
Linear transformation: X′=
bX+a
Location factor
There are a few general rules that allow us to make such transformations without
losing the meaning of the variable in question.
35
Chapter 2: Transformations and Statistics
For example:
•
Adding a constant to all values in a dataset (Xi’ = Xi + a for all i) increases the
mean of the distribution by the value of the constant and leaves the variance
and standard deviation unchanged.
Xi’= Xi + a for all i
•
SX  SX
S X   S X2
2
Multiplying all values in a dataset (Xi’ = b Xi for all i) multiplies the mean and
the standard deviation of the distribution by the value of the constant—the
variance is increased by the square of the constant.
Xi’ = b Xi for all i
•
X  X  a
X   bX
S X   bS X
S X   b 2 S X2
2
Linear transformation is a combination of both addition and multiplication. That
is, Xi’ = bXi + a for all i.
Xi’ = bXi + a for all i
X   bX  a
S X   bS X
S X   b 2 S X2
2
36
Chapter 2: Transformations and Statistics
A common linear transformation is standardization in which scores are scaled to
have a mean of 0 and standard deviation of 1. The variable (or score) that has
a mean of 0 and standard deviation of 1 is frequently referred to as a
standardized variable (or score), and the symbol z is designated.
What would be the values of a (intercept) and b (slope) in the general formula for
the linear transformation, i.e.,
X   bX  a ?
If we want to center X’ on 0, then we can subtract the mean of X from all of the
observed values of X (why would this work?). Hence, a would equal  X .
Similarly, if we want X’ to have a standard deviation of 1, we can divide all values of
X by their standard deviation (why would this work?). Hence, b would equal
1
SD X .
37
Chapter 2: Transformations and Statistics
Hence, to get X′ scores with a new mean equal to 0 and standard deviation of 1 we
use a the following version of the linear equation:
X 
X X
SX
Thus, the standardized score Z for variable X (ZX) can be obtained by the formula:
X  X or for ith observation , Z  X i  X
Xi
ZX 
SX
SX
for i = 1,…,N.
More generally, to get X’ scores with a new mean  X  and standard deviation
sX  , we use a the following transformation formula:
X X
X   S X 
 SX

  X 

or
X   S X Z X  X 
where
ZX 
X X
.
SX
Or, for each observation i,
X X
X i  S X  i
 SX

  X 

or
X i  S X Z X i  X 
where
ZX i 
Xi  X
for i  1,..., N .
SX
38
Chapter 2: Transformations and Statistics
An example:
Our GRE variable has a mean = 497.02 and sX = 95.32
Suppose we want a new mean = 100 and new SD = 15.
We compute the following transformed score for the original score of 500:
X X
X   sX 
 sX

 500  497.02 


X

15


  100  100.47
95.32



So an original score X = 500 would be X′ = 100.47.
39
Chapter 2: Transformation in SPSS
You can transform variables in SPSS by using compute command, which actually
creates a new variable and computes the value for each case by following the
formula you provide. There are a variety of SPSS functions that may be useful
when doing transformations. Below is an example showing how to create three
transformations of the GRE variable (Chap2Ex.sav): adding a constant,
multiplying by a constant, and transforming to a scale with a mean of 100 and
standard deviation of 15.
Transform  Compute. At the pop-up window, you write SCORE_PLUS in the
target variable box, and GRE + 500 in numeric expression box, then click OK.
For other variables we do the same thing (Here I used the syntax window by
clicking Paste.
COMPUTE SCORE_PLUS = GRE + 500 .
COMPUTE SCORE_TIMES = GRE * 100 .
COMPUTE SCORE_SCALED =(15* (( GRE - 497.02) / 95.32)) + 100 .
EXECUTE .
Now we can compute the descriptive statistics.
DESCRIPTIVES
VARIABLES=GRE SCORE_PLUS SCORE_TIMES SCORE_SCALED
/STATISTICS=MEAN STDDEV MIN MAX .
40
Chapter 2: Transformation in SPSS
The output looks like this.
Descriptive Statistics
N
GRE
SCORE_PLUS
SCORE_TIMES
SCORE_SCALED
Valid N (lis twis e)
100
100
100
100
100
Minimum
252
752.00
25200.00
61.44
Maximum
724
1224.00
72400.00
135.72
Mean
497.02
997.0200
49702.00
100.0000
Std. Deviation
95.325
95.32464
9532.46425
15.00073
Q. Can you tell the general rule of linear transformation?
If we performed the linear transformation on a variable X, i.e., X′= b X +
a, then the mean, standard deviation, and the variance of the new
variable X′ are:
X   bX  a
S X   bS X
S X2   b 2 S X2
41
Chapter 2: Transformation in SPSS
Ex. X has a mean of 50 and a S.D. of 10. Now we created the new
variable Y by multiplying the scaling factor of 4 and adding the location
factor of 20 (i.e., b = 4, a = 20 in Y = b X + a), what would be the mean
and the standard deviation of the new variable Y?
We can check this empirically by generating 1000 X’s from Normal distribution
with a mean of 50 and a SD of 10.
The output looks like this.
Descriptive Statistics
X
Y
Valid N (lis twis e)
N
1000
1000
1000
Minimum
23.10
112.41
Maximum
81.08
344.31
Mean
50.0951
220.3805
Std. Deviation
10.01557
40.06227
42
Chapter 2: Stem-and-Leaf Displays
Another graphical method for summarizing data is the stem-and-leaf display,
which gives you a visual display of the shape of the distribution while preserving
the actual values for every data point in the data set.
The stem of a stem-and-leaf display contains the leading digits (or most
significant digits) of the data points (e.g., the 3s in 31, 34, 37, and 39).
The leaves (or trailing digits or less significant digits) of the display contain the
remaining portions of the data points, allowing you to identify individual data points
(e.g., the 1, 4, 7, and 9 of 31, 34, 37, and 39).
The display below summarizes these data points:
11, 15, 18, 21, 22, 22, 23, 25, 28, 30, 31, 32, 33, 33, 33, 34, 45, 51.
1|158
2|122358
3|0123334
4|5
5|1
43
Chapter 2: Boxplots
Another graphical method for summarizing data is the boxplot (AKA box-andwhisker plot), which gives you a summary of the data.
The following quantities are contained in a boxplot:
•
Median: The middlemost data value (50th percentile, i.e., Q2) when the
data are ordered.
•
Median Location: ML = (N + 1) / 2, where N is the number of scores. ML
tells us where, in the rank ordered data, the median lies.
•
Hinge: The median values of the upper and lower halves of the data when
the data are rank ordered. The Upper Hinge (UH) represents the data
value of 75th percentile, and the Lower Hinge (LH) represents the data
value of 25th percentile. Thus, UH = Q3 and LH = Q1.
•
Hinge Location: HL = (ML+1)/2. HL tells us where, in the rank ordered
data, the hinges lie.
•
H-Spread: HS = UH – LH, a value comparable to the IQR.
•
Inner Fence: UIF = Upper Hinge + (1.5 x HS) and LIF = Lower Hinge –
(1.5 x HS).
•
Adjacent Values: The data values that are no more extreme than Inner
Fence. LAV = max (smallest data value, LIF) and UAV = min (largest data
value, UIF).
44
Chapter 2: Boxplots
More simply, a boxplot represents the median and IQR of a data set. The median is
represented by the line in the middle of the box. The upper and lower quartiles are
represented by the outer edges of the box. The maximum and minimum reasonable
values (approximately the lower and upper 2.5% of the data) are represented by
the ends of the lines on each side of the box. Asterisks are used to represent data
points that lie outside of these “reasonable” limits.
*
Outliers
*
* *
LAV
LH
ML
H-S
UH
UAV
45
Chapter 2: Explore procedure
The explore procedure:
Explore is a SPSS procedure that requests comprehensive descriptive statistics for
a particular variable (including stem-and-leaf and box plots). The following strokes
take you to study the variables in detail (use Chap2Ex.sav).
Analyze  Descriptive statistics  Explore, then at the pop-up window, bring GRE
variable in the left box to the Dependent list box. Then click OK.
What the above key strokes do in Syntax is as follows.
EXAMINE
VARIABLES=GRE
/PLOT BOXPLOT STEMLEAF
/COMPARE GROUP
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
46
Chapter 2: Explore procedure
Descriptives
GRE
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Lower Bound
Upper Bound
Statis tic
497.02
478.11
Std. Error
9.532
515.93
496.66
489.50
9086.787
95.325
252
724
472
154
.113
-.387
.241
.478
47
Chapter 2: Explore procedure
GRE Stem-and-Leaf Plot
Frequency
2.00
1.00
10.00
22.00
22.00
13.00
14.00
10.00
5.00
1.00
Stem width:
Each leaf:
Stem & Leaf
2.
3.
3.
4.
4.
5.
5.
6.
6.
7.
56
4
5666779999
0000001111122223333344
5555566677788889999999
0011333334444
55667777888999
0012233334
56689
2
100
1 case(s)
48
Chapter 2: Explore procedure
Box and Whisker Plot
800
700
600
500
400
300
200
GRE
49
Chapter 2: Explore procedure
Before, we had the histogram for GRE variable.
14
12
Frequency
10
8
6
4
2
Mean = 497.02
Std. Dev. = 95.325
N = 100
0
400
600
GRE
50
Chapter 2: Parameters and Statistics
Recall that we use descriptive statistics as estimates of population parameters. The
table below shows the correspondence between several of the statistics and
parameters we will discuss this semester. Note. Parameter --- fixed number that
represents a certain characteristic of the population in which we are interested.
This is usually unknown; statistic --- a value that we can compute from our data
(i.e., sample) at hand. We use the sample statistic to estimate the corresponding
population parameter..
Statistic
Parameter
X
x
s 2X
sx2
sX
sx
rxy
rxy
bx|y
bx|y
estimate
Statistic ----- - -> parameter
(Roman letters) (Greek letters)
51
Chapter 2: Parameters and Statistics
There are four properties that are useful when we use statistics, particularly the
mean and variance, as estimators of parameters:
1.
Sufficiency: Statistic uses all of the information in the sample.
2.
Unbiasedness: Expected value (i.e., the average of) over a large
number of samples equals the parameter. Note that N-1 is used in the
denominator of the sample variance to make it an unbiased estimate of
the population variance. For example,
N p o p.
N
S X2 
 ( X i  X )2
i 1
N 1
2
s

X
to unbiasedly estimate
(X
i 1
i
  X )2
N pop.
where Npop. is the population size (i.e., number of cases in the entire
population).
3.
Efficiency: The variability of a large number of samples is smaller than
the variability of other, similar, descriptive statistics.
4.
Resistant: Not heavily influenced by outliers.
52
Chapter 2: Discussion Questions
I have the IQ scores of 1000 students and I ran SPSS (Frequencies & Explore) and
obtained descriptive statistics, histogram, stem-and-leaf, and box plot.
The output appears on the following pages. Based on the output, comment on the
following characteristics of the IQ scores. Be sure to cite the indices you
considered concerning each characteristic.
* Shape
* Location
* Dispersion
* Skewness
* Kurtosis
* Outliers
* Percentiles (Especially, Quartiles)
What would happen if you assumed that these IQ scores were normally
distributed? In other words, you only know the mean and the standard
deviation of IQ scores. And if you assume that IQ scores are normally
distributed with the given mean and SD, what kind of errors you might make?
53
Chapter 2: Discussion Questions
Output from Frequencies
Statistics
IQ
N
Mean
Median
Mode
Std. Deviation
Skewness
Std. Error of Skewnes s
Kurtos is
Std. Error of Kurtos is
Percentiles
Valid
Mis sing
25
50
75
1000
0
119.17
117.00
99
22.460
.060
.077
-1.287
.155
99.00
117.00
139.00
54
Chapter 2: Discussion Questions
Output from Frequencies
Histogram
100
Frequency
80
60
40
20
Mean = 119.17
Std. Dev. = 22.46
N = 1,000
0
60
80
100
120
140
160
180
IQ
55
Chapter 2: Discussion Questions
Output from Explore
Descriptives
IQ
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtos is
Lower Bound
Upper Bound
Statis tic
119.17
117.77
Std. Error
.710
120.56
119.09
117.00
504.452
22.460
68
171
103
40
.060
-1.287
.077
.155
56
Chapter 2: Discussion Questions
IQ Stem-and-Leaf Plot
Frequency
1.00
.00
8.00
22.00
50.00
76.00
95.00
99.00
77.00
54.00
33.00
30.00
51.00
63.00
93.00
86.00
75.00
50.00
25.00
10.00
1.00
1.00
Stem width:
Each leaf:
Stem & Leaf
6. &
7.
7 . 689&
8 . 0123344
8 . 5556667778888999
9 . 000000111111222222333344444
9 . 5555566666777778888889999999999
10 . 000000011111112222222223333344444
10 . 5555556666666777777888899
11 . 000001112222333444
11 . 556666778899
12 . 001223344
12 . 556677788888899999
13 . 0000111112222333334444
13 . 555555566666777778888888999999
14 . 0000011111112222223333344444
14 . 55556666677777778888889999
15 . 0000111112222334
15 . 55566799&
16 . 0124&
16 . &
17 . &
10
3 case(s)
57
& denotes fractional leaves.
Chapter 2: Discussion Questions
180
160
140
120
100
80
60
IQ
58
Chapter 2: Discussion Questions
Knowing that the mean = 119.17, SD = 22.5, you assumed that IQ is
distributed as normal. Then using the well-known fact that in normal
distribution 68 % of the scores lie within the range of mean  1 SD, you
predicted that 68 % of the scores lie within the range of (96.67, 141.67)
computed from 119.17  22.5. But actually only 58.5 % of the students
are in the range. So by 9.5% we overestimated the people in the middle
range.
Another way of saying this is that we expect that 16% of the scores are
below 97 and another 16% of the scores are above 141.
The actual observation we have is:
Below 97 --- 20.4%
Above 141 --- 21.1%
Thus if you blindly assume that IQ scores are normally distributed, you
underestimate the percentages of the students in high and low ranges
about 4 ~ 5 % and over-estimate the middle range about 10 %.
Whenever we make some statements based on distribution, we can’t
assume normality all the time. We need to base our inferences on the
actual distribution.
59
Chapter 2: Discussion Questions
I calculated the above percentages from the frequency table below.
IQ
68
1
.1
.1
Cumulative
Percent
.1
76
3
.3
.3
.4
77
1
.1
.1
.5
78
2
.2
.2
.7
79
2
.2
.2
.9
80
4
.4
.4
1.3
81
2
.2
.2
1.5
82
4
.4
.4
1.9
83
6
.6
.6
2.5
84
6
.6
.6
3.1
85
9
.9
.9
4.0
Frequency
Valid
Percent
Valid Percent
86
9
.9
.9
4.9
87
10
1.0
1.0
5.9
88
12
1.2
1.2
7.1
89
10
1.0
1.0
8.1
90
17
1.7
1.7
9.8
91
17
1.7
1.7
11.5
92
17
1.7
1.7
13.2
93
11
1.1
1.1
14.3
94
14
1.4
1.4
15.7
95
16
1.6
1.6
17.3
96
15
1.5
1.5
18.8
97
16
1.6
1.6
20.4
60
Chapter 2: Frequency Table (continued)
IQ
98
18
1.8
1.8
Cumulative
Percent
22.2
99
30
3.0
3.0
25.2
100
21
2.1
2.1
27.3
101
21
2.1
2.1
29.4
102
26
2.6
2.6
32.0
103
16
1.6
1.6
33.6
104
15
1.5
1.5
35.1
105
18
1.8
1.8
36.9
106
21
2.1
2.1
39.0
107
19
1.9
1.9
40.9
108
12
1.2
1.2
42.1
109
7
.7
.7
42.8
110
14
1.4
1.4
44.2
111
8
.8
.8
45.0
112
13
1.3
1.3
46.3
113
10
1.0
1.0
47.3
114
9
.9
.9
48.2
Frequency
Percent
Valid Percent
115
5
.5
.5
48.7
116
12
1.2
1.2
49.9
117
5
.5
.5
50.4
61
Chapter 2: Frequency Table (continued)
IQ
118
6
.6
.6
Cumulative
Percent
51.0
119
5
.5
.5
51.5
120
7
.7
.7
52.2
121
3
.3
.3
52.5
122
7
.7
.7
53.2
123
6
.6
.6
53.8
124
7
.7
.7
54.5
125
5
.5
.5
55.0
126
7
.7
.7
55.7
127
8
.8
.8
56.5
128
17
1.7
1.7
58.2
129
14
1.4
1.4
59.6
130
12
1.2
1.2
60.8
131
14
1.4
1.4
62.2
132
12
1.2
1.2
63.4
133
14
1.4
1.4
64.8
134
11
1.1
1.1
65.9
135
22
2.2
2.2
68.1
136
16
1.6
1.6
69.7
137
16
1.6
1.6
71.3
138
22
2.2
2.2
73.5
139
17
1.7
1.7
75.2
140
16
1.6
1.6
76.8
Frequency
Percent
Valid Percent
62
Chapter 2: Frequency Table (continued)
IQ
141
Frequency
21
Percent
2.1
Valid Percent
2.1
Cumulative
Percent
78.9
142
18
1.8
1.8
80.7
143
15
1.5
1.5
82.2
144
16
1.6
1.6
83.8
145
11
1.1
1.1
84.9
146
15
1.5
1.5
86.4
147
20
2.0
2.0
88.4
148
18
1.8
1.8
90.2
149
11
1.1
1.1
91.3
150
11
1.1
1.1
92.4
151
16
1.6
1.6
94.0
152
12
1.2
1.2
95.2
153
7
.7
.7
95.9
154
4
.4
.4
96.3
155
9
.9
.9
97.2
156
5
.5
.5
97.7
157
4
.4
.4
98.1
158
1
.1
.1
98.2
159
6
.6
.6
98.8
160
2
.2
.2
99.0
161
3
.3
.3
99.3
162
2
.2
.2
99.5
163
1
.1
.1
99.6
164
2
.2
.2
99.8
169
1
.1
.1
99.9
100.0
171
Total
1
.1
.1
1000
100.0
100.0
63