Transcript Lecture 1

Lecture 1
Review
•
•
•
Measurement Typologies
Univariate Distributions
Location and Dispersion
Measurement Typology
Stevens (1946) identified 4 levels of
measurement. They are:
 Nominal or categorical
 Ordinal
 Interval
 Ratio
Levels of Measurement

Nominal or categorical: A series of unordered
categories.
Eg: (male, female) or (Manitoba, Saskatchewan,
Alberta)

Ordinal: A series in which there is an underlying
order. We are, however, unaware of the “distance”
between each category in terms of the characteristic.
Eg: (Short, medium, tall) or (high school, College
diploma, B.A., graduate degree).
Levels of Measurement

Interval scales: The distance between possible values is
constant, allowing us compare precisely the difference in
outcomes.

Ratio scales are interval scales in which a value of zero is
possible when there is none of the phenomenon being
measured present. For example, for a living person, an age
value of zero is not really possible, but it is possible to have
zero years of formal schooling.

Practically, these two levels are often combined as
“interval/ratio” measurement.
Continuous vs. Discrete Measures
A continuous variable can take on an infinite
number of possible values, For example, age
could theoretically be measured in infinitely
precise units. On the other hand, the number
of children one has is fundamentally discrete,
as are other count data.
Continuous vs. Discrete Measures
It is very important to consider carefully the underlying
process that generates the measurement, and not only
the variables as provided in the data set.
For example, many survey datasets include the number
of years of schooling as a measure of educational
attainment. This is a discrete measure (only whole
years), but which can often be considered
conceptually continuous. However, we must
consider whether is makes sense to think of the
variable in this way.
Univariate Distributions
Frequency
distribution: an
ordered list of the
possible values of a
variable and the
number or proportion
of observations in the
variable.
General Happiness
Valid
Mis sing
Total
Very Happy
Pretty Happy
Not Too Happy
Total
NA
Frequency
467
872
165
1504
13
1517
Percent
30.8
57.5
10.9
99.1
.9
100.0
Valid Percent
31.1
58.0
11.0
100.0
Cumulative
Percent
31.1
89.0
100.0
Univariate Distributions
Histograms and
Bar Charts
graphically display
the absolute or
relative frequencies.
300
200
100
Std. Dev = 17.81
Mean = 45.6
N = 1514.00
0
20.0
30.0
25.0
40.0
35.0
45.0
Age of Respondent
50.0
60.0
55.0
70.0
65.0
75.0
80.0
85.0
90.0
Univariate Distributions
Stem- and-Leaf Plots
(Tukey, 1972, 1977)
display more information
than histograms, and are
good for small data sets.
They group together all
the data with the same
leading digits on the
“stem”, and show the final
digits as the “leaves”.
R's Occupational Prestige Score (1980) Stem-and-Leaf Plot
Frequency
Stem &
7.00
1
85.00
2
145.00
2
190.00
3
174.00
3
189.00
4
216.00
4
162.00
5
42.00
5
103.00
6
73.00
6
21.00
7
7.00
7
4.00 Extremes
Stem width:
Each leaf:
.
.
.
.
.
.
.
.
.
.
.
.
.
Leaf
79
01222222223333444
555677788888888888899999999999
00000000001111111112222222223333344444
5555556666666666666666778899999999
00000000000111122222222222333344444444
5555566666666666666777777777777888889999999
000001111111111111111111112223444
556778899
000000111234444444444
5555666666689&
1344&
5
(>=86)
10
5 case(s)
& denotes fractional leaves.
Central Tendency and Dispersion



Common measures of central tendency
(Location) are the mean, median, and the
mode.
The mode is the most common value.
The median is the value above which half of
the subjects fall (the 50th percentile).
Central Tendency

The arithmetic mean, or average, is the sum
of the values, divided by the number of
subjects,
n
Y 
Y
i 1
n
i
Dispersion


Distributions can have similar central
tendencies, but be dramatically different in
their spread, or dispersion.
One measure of dispersion is the range or
the difference between the largest and
smallest observations. The range is a good
measure, but is very sensitive to extreme
values, or outlying values.
Dispersion



Another is the interquartile range, which measures
the distance between the upper and lower quartiles.
Quartiles are the values below which 0, 25, 50, 75,
and 100% of the cases fall.
Other quantiles, besides quartiles can be used.
Deciles can be used to describe the difference in
mean income for the lowest decile (bottom 10%)
compared to the highest decile (top 10%).
Dispersion


Box-and-whisker plots
show the dispersion
through the use of
quartiles.
The box contains the
middle 50% of cases,
the line indicates the
median, and the
whiskers extend to the
25 and 75 percentiles.
100
199
424
682
454
80
60
40
20
0
N=
1418
R's Occupational Pre
Histogram and Boxplot compared
100
300
199
424
682
454
80
200
60
100
40
Std. Dev = 13.07
Mean = 42.9
20
N = 1418.00
0
15.0
25.0
20.0
35.0
30.0
40.0
45.0
55.0
50.0
65.0
60.0
70.0
75.0
80.0
85.0
0
N=
R's Occupational Prestige Score (1980)
1418
R's Occupational Pre
Sums of Squares, Variance, and
Standard Deviation

The variance is the average of the squared
deviations of the elements in the sample or
the population around their mean. Squaring
the deviations keeps their sum from being
zero. Therefore, we use the sum of
squares in calculating the variance.
SS   (Yi  Y )
2
Variance

The formula for the variance in a population is:
2 

2


Y


 i
N
The sample variance is:
s
2
y


 y
2
i
n 1
Where n is the sample size. (n-1) in the denominator is
used when we are using the sample variance to
estimate the population variance
Standard Deviation

The Standard Deviation is the square root
of the variance. It is used particularly
because the variance is sensitive to the
choice of units.
Population standard deviation:  
2


Y


 i
N
Where N is the population size, Yi is the value of Y for the ith
unit in the population, and μ-bar is the population mean.
Standard Deviation
Sample standard deviation:
s
2


y

y
 i
n 1
Where n is the sample size, yi is the value of y for the ith unit
in the sample, and y-bar is the sample mean.
The Shape of a Distribution
It is common to describe
distributions in terms of
their shape when
graphed. For example,
some distributions can be
described as “bathtub”,
“inverted bathtub” or Ushaped, or as “bellshaped”.
Mortality
Rate
Time
Frequency
Age
Modality
Distributions may have
only one mode, or have
several distinct modes.
For a bimodal distribution,
a single mode will provide
a poor description of the
central tendency.
Bimodal Distribution
Skewness
Asymmetrical curves may also
be described with regard to
their skewness. If the mean is
higher than the median, or the
right tail is considerably longer,
we say that the distribution is
skewed to the right. If the mean
is less than the median, or the
left tail is longer, it is skewed to
the left.
Normal Curve (symmetrical)
Median
Mean
Right-Skewed
Normal Curve (symmetrical)
Mean
Left-Skewed
Median
Kurtosis
A kurtotic distribution is
one that is significantly
more peaked than a
normal distribution.
Kurtotic Distribution
Outliers



Outliers are cases which score much higher or
lower than the bulk of the other cases in the sample
or the population.
They can be due to problems with the data, such as
the inclusion of a case which should have been
excluded from the sample (frame problems), or misentry of data.
They may also be legitimate cases, which simply
have unusually high or low values on some variable.
For this reason, outliers should never simply be
discarded without investigation.
Questions?




What is kurtoisis?
Define an “outlier”
What information do box-and-whisker plots
include?
What is the “sum of squares?”
Next Class:



Probability Distributions
Standard Normal probability distribution
Sampling distributions and estimation