Transcript Slide 1
Quantitative Data Analysis
Edouard Manet: In the Conservatory, 1879
Quantification of Data
1. Introduction
• To conduct quantitative analysis, responses to
open-ended questions in survey research and
the raw data collected using qualitative
methods must be coded numerically.
Quantification of Data
1. Introduction (Continued)
• Most responses to survey research questions
already are recorded in numerical format.
• In mailed and face-to-face surveys,
responses are keypunched into a data file.
• In telephone and internet surveys,
responses are automatically recorded in
numerical format.
Quantification of Data
2. Developing Code Categories
• Coding qualitative data can use an existing
scheme or one developed by examining the
data.
• Coding qualitative data into numerical
categories sometimes can be a straightforward
process.
• Coding occupation, for example, can rely
upon numerical categories defined by the
Bureau of the Census.
Quantification of Data
2. Developing Code Categories (Continued)
• Coding most forms of qualitative data, however,
requires much effort.
• This coding typically requires using an iterative
procedure of trial and error.
• Consider, for example, coding responses to the
question, “What is the biggest problem in
attending college today.”
• The researcher must develop a set of codes
that are:
• exhaustive of the full range of responses.
• mutually exclusive (mostly) of one another.
Quantification of Data
2. Developing Code Categories (Continued)
• In coding responses to the question, “What is
the biggest problem in attending college today,”
the researcher might begin, for example, with a
list of 5 categories, then realize that 8 would be
better, then realize that it would be better to
combine categories 1 and 5 into a single
category and use a total of 7 categories.
• Each time the researcher makes a change in
the coding scheme, it is necessary to restart the
coding process to code all responses using the
same scheme.
Quantification of Data
2. Developing Code Categories (Continued)
• Suppose one wanted to code more complex
qualitative data (e.g., videotape of an
interaction between husband and wife) into
numerical categories.
• How does one code the many statements,
facial expressions, and body language inherent
in such an interaction?
• One can realize from this example that coding
schemes can become highly complex.
Quantification of Data
2. Developing Code Categories (Continued)
• Complex coding schemes can take many
attempts to develop.
• Once developed, they undergo continuing
evaluation.
• Major revisions, however, are unlikely.
• Rather, new coders are required to learn the
existing coding scheme and undergo continuing
evaluation for their ability to correctly apply the
scheme.
Quantification of Data
3. Codebook Construction
• The end product of developing a coding
scheme is the codebook.
• This document describes in detail the
procedures for transforming qualitative data into
numerical responses.
• The codebook should include notes that
describe the process used to create codes,
detailed descriptions of codes, and guidelines
to use when uncertainty exists about how to
code responses.
Quantification of Data
4. Data Entry
• Data recorded in numerical format can be
entered by keypunching or the use of
sophisticated optical scanners.
• Typically, responses to internet and telephone
surveys are entered directly into a numerical
data base.
5. Cleaning Data
• Logical errors in responses must be reconciled.
• Errors of entry must be corrected.
Quantification of Data
6. Collapsing Response Categories
• Sometimes the researcher might want to
analyze a variable by using fewer response
categories than were used to measure it.
• In these instances, the researcher might want
to “collapse” one or more categories into a
single category.
• The researcher might want to collapse
categories to simplify the presentation of the
results or because few observations exist within
some categories.
Quantification of Data
6. Collapsing Response Categories: Example
Response
Strongly disagree
Disagree
Neither agree nor disagree
Agree
Strongly Agree
Frequency
2
22
45
31
1
Quantification of Data
6. Collapsing Response Categories: Example
One might want to collapse the extreme
responses and work with just three categories:
Response
Disagree
Neither agree nor disagree
Agree
Frequency
24
45
32
Quantification of Data
7. Handling “Don’t Knows”
• When asking about knowledge of factual
information (“Does your teenager drink
alcohol?”) or opinions on a topic the subject
might not know much about (“Do school
officials do enough to discourage teenagers
from drinking alcohol?”), it is wise to include a
“don’t know” category as a possible response.
• Analyzing “don’t know” responses, however,
can be a difficult task.
Quantification of Data
7. Handling “Don’t Knows” (Continued)
• The research-on-research literature regarding
this issue is complex and without clear-cut
guidelines for decision-making.
• The decisions about whether to use “don’t
know” response categories and how to code
and analyze them tends to be idiosyncratic to
the research and the researcher.
Quantitative Data Analysis
• Descriptive statistics attempt to explain or
predict the values of a dependent variable
given certain values of one or more
independent variables.
• Inferential statistics attempt to generalize the
results of descriptive statistics to a larger
population of interest.
Quantitative Data Analysis
1. Data Reduction
• The first step in quantitative data analysis is to
calculate descriptive statistics about variables.
• The researcher calculates statistics such as the
mean, median, mode, range, and standard
deviation.
• Also, the researcher might choose to collapse
response categories for variables.
Quantitative Data Analysis
2. Measures of Association
• Next, the researcher calculates measures of
association: statistics that indicate the strength
of a relationship between two variables.
• Measures of association rely upon the basic
principle of proportionate reduction in error
(PRE).
Quantitative Data Analysis
2. Measures of Association (Continued)
• PRE represents how much better one would be
at guessing the outcome of a dependent
variable by knowing a value of an independent
variable.
• For example: How much better could I predict
someone’s income if I knew how many years of
formal education they have completed? If the
answer to this question is “37% better,” then
the PRE is 37%.
Quantitative Data Analysis
2. Measures of Association (Continued)
• Statistics are designated by Greek letters.
• Different statistics are used to indicate the
strength of association between variables
measured at different levels of data.
• Strength of association for nominal-level
variables is indicated by λ (lambda).
• Strength of association for ordinal-level
variables is indicated by γ (gamma).
• Strength of association for interval-level
variables is indicated by correlation (r).
Quantitative Data Analysis
2. Measures of Association (Continued)
• Covariance is the extent to which two variables
“change with respect to one another.”
• As one variable increases, the other variable
either increases (positive covariance) or
decreases (negative covariance).
• Correlation is a standardized measure of
covariance.
• Correlation ranges from -1 to +1, with
figures closer to one indicating a stronger
relationship.
Quantitative Data Analysis
2. Measures of Association (Continued)
• Technically, covariance is the extent to which
two variables co-vary about their means.
• If a person’s years of formal education is
above the mean of education for all persons
and his/her income is above the mean of
income for all persons, then this data point
would indicate positive covariance between
education and income.
Statistics
1. Introduction
• To make inferences from descriptive statistics,
one has to know the reliability of these
statistics.
• In the same sense that the distribution of one
variable has a standard deviation, a parameter
estimate has a standard error—the distribution
of the estimate from its mean with respect to
the normal curve.
Statistics
1. Introduction (Continued)
• To better understand the concepts standard
deviation and standard error, and why these
concepts are important to our course, please
review the presentation regarding standard
error.
• Presentation on Standard Error.
Statistics
2. Types of Analysis
• The presentation on inferential statistics will
cover univariate, bivariate and multivariate
analysis.
• Univariate Analysis:
• Mean.
• Median.
• Mode.
• Standard deviation.
Statistics
2. Types of Analysis (Continued)
• Bivariate Analysis
• Tests of statistical significance.
• Chi-square.
• Multivariate Analysis:
• Ordinary least squares (OLS) regression.
• Path analysis.
• Time-series analysis.
• Factor analysis.
• Analysis of variance (ANOVA).
Univariate Analysis
1. Distributions
• Data analysis begins by examining
distributions.
• One might begin, for example, by examining
the distribution of responses to a question
about formal education, where responses are
recorded within six categories.
• A frequency distribution will show the number
and percent of responses in each category of a
variable.
Univariate Analysis
2. Central Tendency
• A common measure of central tendency is the
average, or mean, of the responses.
• The median is the value of the “middle” case
when all responses are rank-ordered.
• The mode is the most common response.
• When data are highly skewed, meaning heavily
balanced toward one end of the distribution, the
median or mode might better represent the
“most common” or “centered” response.
Univariate Analysis
2. Central Tendency (Continued)
• Consider this distribution of respondent ages:
• 18, 19, 19, 19, 20, 20, 21, 22, 85
• The mean equals 27. But this number does not
adequately represent the “common” respondent
because the one person who is 85 skews the
distribution toward the high end.
• The median equals 20.
• This measure of central tendency gives a more
accurate portrayal of the “middle of the
distribution.”
Univariate Analysis
3. Dispersion
• Dispersion refers to the way the values are
distributed around some central value, typically
the mean.
• The range is the distance separating the lowest
and highest values (e.g., the range of the ages
listed previously equals 18-85).
• The standard deviation is an index of the
amount of variability in a set of data.
Univariate Analysis
3. Dispersion (Continued)
• The standard deviation represents dispersion
with respect to the normal (bell-shaped) curve.
• Assuming a set of numbers is normally
distributed, then each standard deviation
equals a certain distance from the mean.
• Each standard deviation (+1, +2, etc.) is the
same distance from each other on the bellshaped curve, but represents a declining
percentage of responses because of the shape
of the curve (see: Chapter 7).
Univariate Analysis
3. Dispersion (Continued)
• For example, the first standard deviation
accounts for 34.1% of the values below and
above the mean.
• The figure 34.1% is derived from probability
theory and the shape of the curve.
• Thus, approximately 68% of all responses fall
within one standard deviation of the mean.
• The second standard deviation accounts for the
next 13.6% of the responses from the mean
(27.2% of all responses), and so on.
Univariate Analysis
3. Dispersion (Continued)
• If the responses are distributed approximately
normal and the range of responses is low—
meaning that most responses fall close to the
mean—then the standard deviation will be
small.
• The standard deviation of professional
golfer’s scores on a golf course will be low.
• The standard deviation of amateur golfer’s
scores on a golf course will be high.
Univariate Analysis
4. Continuous and Discrete Variables
• Continuous variables have responses that form
a steady progression (e.g., age, income).
• Discrete (i.e., categorical) variables have
responses that are considered to be separate
from one another (i.e., sex of respondent,
religious affiliation).
Univariate Analysis
4. Continuous and Discrete Variables
• Sometimes, it is a matter of debate within the
community of scholars about whether a
measured variable is continuous or discrete.
• This issue is important because the statistical
procedures appropriate for continuous-level
data are more powerful, easier to use, and
easier to interpret than those for discrete-level
data, especially as related to the measurement
of the dependent variable.
Univariate Analysis
4. Continuous and Discrete Variables (Continued)
• Example: Suppose one measures amount of
formal education within five categories: less
than hs, hs, 2-years vocational/college, college,
post-college).
• Is this measure continuous (i.e., 1-5) or
discrete?
• In practice, five categories seems to be a cutoff
point for considering a variable as continuous.
• Using a seven-point response scale will give
the researcher a greater chance of deeming a
variable to be continuous.
Bivariate Analysis
1. Introduction
• Bivariate analysis refers to an examination of
the relationship between two variables.
• We might ask these questions about the
relationship between two variables:
• Do they seem to vary in relation to one
another? That is, as one variable increases
in size does the other variable increase or
decrease in size?
• What is the strength of the relationship
between the variables?
Bivariate Analysis
1. Introduction (Continued)
• Divide the cases into groups according to the
attributes of the independent variable (e.g.,
men and women).
• Describe each subgroup in terms of attributes
of the dependent variable (e.g., what percent of
men approve of sexual equality and what
percent of women approve of sexual equality).
Bivariate Analysis
1. Introduction (Continued)
• Read the table by comparing the independent
variable subgroups with one another in terms of
a given attribute of the dependent variable
(e.g., compare the percentages of men and
women who approve of sexual equality).
• Bivariate analysis gives an indication of how the
dependent variable differs across levels or
categories of an independent variable.
• This relationship does not necessarily indicate
causality.
Bivariate Analysis
1. Introduction (Continued)
• Tables that compare responses to a dependent
variable across levels/categories of an
independent variable are called contingency
tables (or sometimes, “crosstabs”).
• When writing a research report, it is common
practice, even when conducting highly
sophisticated statistical analysis, to present
contingency tables also to give readers a sense
of the distributions and bivariate relationships
among variables.
Bivariate Analysis
2. Tests of Statistical Significance
• If one assumes a normal distribution, then one
can examine parameters and their standard
errors with respect to the normal curve to
evaluate whether an observed parameter
differs from zero by some set margin of error.
• Assume that the researcher sets the probability
of a Type-1 error (i.e., the probability of
assuming causality when there is none) at 5%.
• That is, we set our margin of error very low,
just 5%.
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• To evaluate statistical significance, the
researcher compares a parameter estimate to
a “zero point” on a normal curve (its center).
• The question becomes: Is this parameter
estimate sufficiently large, given its standard
error, that, within a 5% probability of error, we
can state that it is not equal to zero?
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• To achieve a probability of error of 5%, the
parameter estimate must be almost two (i.e.,
1.96) standard deviations from zero, given its
standard error.
• Sometimes in sociological research, scholars
say “two standard deviations” in referring to a
5% error rate. Most of the time, they are more
precise and state 1.96.
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• Consider this example:
• Suppose the unstandardized estimate of the
effect of self-esteem on marital satisfaction
equals 3.50 (i.e., each additional amount of
self-esteem on its scale results in 3.50
additional amount of marital satisfaction on
its scale).
• Suppose the standard error of this estimate
equals 1.20.
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• If we divide 3.50 by 1.20 we obtain the ratio of
2.92. This figure is called a t-ratio (or, t-value).
• The figure 2.92 means that the estimate 3.50 is
2.92 standard deviations from zero.
• Based upon our set margin of error of 5%
(which is equivalent to 1.96 standard
deviations), we can state that at prob. < .05,
the effect of self-esteem on marital satisfaction
is statistically significant.
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• The t-ratio is the ratio of a parameter estimate
to its standard error.
• The t-ratio equals the number of standard
deviations that an estimate lies from the “zero
point” (i.e., center) of the normal curve.
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• Why do we state that we need to have 1.96
standard deviations from the zero point of the
normal curve?
• Recall the area beneath the normal curve:
• The first standard deviation covers 34.1% of
the observations on one side of the zero
point.
• The second standard deviation covers the
next 13.6% of the observations.
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• Let’s assume for a moment that our estimate is
greater than the “real” effect of self-esteem on
marital satisfaction.
• Then, at 1.96 standard deviations, we have
covered the 50% probability below the “real”
effect, and we have covered 34.1% + 13.4%
probability above this effect.
• In total, we have accounted for 97.5% of the
probability that our estimate does not equal
zero.
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• That leaves 2.5% of the probability above the
“real” estimate.
• But we have to recognize that our estimate
might have fallen below the “real” estimate.
• So, we have the probability of error on both
sides of “reality.”
• 2.5% + 2.5% equals 5%
• This is our set margin of error!
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• Thus, inferential statistics are calculated with
respect to the properties of the normal curve.
• There are other types of distributions besides
the normal curve, but the normal distribution is
the one most often used in sociological
analysis.
Bivariate Analysis
2. Tests of Statistical Significance (Continued)
• If we know the properties of the normal curve,
and we have calculated an estimate of a
parameter, and we know the standard error of
this estimate (e.g., the range of values that the
estimate might be), then we can calculate
statistical significance.
• Recall that statistical significance does not
necessarily equal substantive significance.
Bivariate Analysis
3. Chi-Square
• Chi-square is a test of independence between
two variables.
• Typically, one is interested in knowing whether
an independent variable (x) “has some effect”
on a dependent variable (y).
• Said another way, we want to know if y is
independent of x (e.g., if it goes its own way
regardless of what happens to x).
• Thus, we might ask, “Is church attendance
independent of the sex of the respondent?”
Bivariate Analysis
3. Chi-Square (Continued)
• Scenario 1: Consider these data on sex of the
subject and church attendance:
Sex
Male
Female
Total:
Church Attendance
Yes
No
Total
28
12
40
42
18
60
70
30
100
Bivariate Analysis
3. Chi-Square (Continued)
• Note that:
• 70% of all persons attend church.
• 70% of men attend church.
• 70% of women attend church.
• Thus, we can say that church attendance is
independent of the sex of the respondent
because, if the total number of church goers
equals 70%, then, with independence, we
expect 70% of men and 70% of women to
attend church, and they do.
Bivariate Analysis
3. Chi-Square (Continued)
• Scenario 2: Now, suppose we observed this
pattern of church attendance:
Sex
Male
Female
Total:
Church Attendance
Yes
No
Total
20
20
40
50
10
60
70
30
100
Bivariate Analysis
3. Chi-Square (Continued)
• Note that:
• 70% of all persons attend church.
• Therefore, if church attendance is independent
of the sex of the respondent, then we expect
70% of the men and 70% of the women to
attend church.
• But they do not.
• Instead, 50% of the men attend church and
83.3% of the women attend church.
Bivariate Analysis
3. Chi-Square (Continued)
• So, for this second set of data, is church
attendance independent of the sex of the
respondent?
• Let’s begin by calculating how much error we
would make by assuming men and women
behave as expected.
• That is, for each cell of the table, we will
calculate the difference between the observed
and expected values.
Bivariate Analysis
3. Chi-Square (Continued)
• Observed in Red
• Expected in White
Sex
Male
Female
Church Attendance
Yes
No
20-28 = -8
20-12 = 8
50-42 = 8
10-18 = -8
Bivariate Analysis
3. Chi-Square (Continued)
• Note that in each cell, if we assume
independence, we make a mistake equal to “8”
(sometimes positive and sometimes negative).
• If we add all of our mistakes, we obtain a sum
of zero, which we know is not true.
• So, we will square each mistake to give every
number a positive valence.
Bivariate Analysis
3. Chi-Square (Continued)
• How badly did we do in each cell?
• To know the magnitude of our mistake in each
cell, we will divide the size of the mistake by
the expected value in the cell (a PRE
measure).
• The following table shows our proportionate
degree of error in each cell and our total
amount of proportionate error for the entire
table.
Bivariate Analysis
3. Chi-Square (Continued)
• Proportionate error is calculated for each cell:
Sex
Male
Female
Church Attendance
Yes
No
(-8 )2 / 28 = 2.29 (8)2 / 12 = 5.33
(8)2 / 42 = 1.52 (-8)2 / 18 = 3.56
The total of all proportionate error = 12.70.
This is the chi-square value for this table.
Bivariate Analysis
3. Chi-Square (Continued)
• Our chi-square value of 12.70 gives us a
number that summarizes our proportionate
amount of mistakes for the whole table.
• Is this number big enough to indicate a lack of
independence between church attendance and
sex of the respondent?
• To make this assessment, we compare our
observed chi-square with a standardized
distribution of PRE measures: the chi-square
distribution.
Bivariate Analysis
3. Chi-Square (Continued)
• The chi-square distribution looks like a lopsided
version of the normal curve.
• To compare our observed chi-square with this
distribution, we need some indication of where
we should be on the distribution, as we did with
standard errors on the normal curve.
• On the chi-square distribution, we are “allowed”
a certain amount of error depending upon our
degrees of freedom.
Bivariate Analysis
3. Chi-Square (Continued)
• To understand degrees of freedom, reconsider
our table on observed church attendance:
Sex
Male
Female
Total:
Church Attendance
Yes
No
Total
20
20
40
50
10
60
70
30
100
Given the margin totals, once we fill in one cell with
the correct number, all the other cells are given.
Bivariate Analysis
3. Chi-Square (Continued)
• A degree of freedom is the number of correct
guesses one must make to reach a point where
all the other cells are given.
• Our table has one degree of freedom.
• The more correct guesses one must make, the
greater the degrees of freedom and the more
proportionate amount of error one is “allowed”
within the chi-square distribution before
claiming a lack of independence.
Bivariate Analysis
3. Chi-Square (Continued)
• The amount of chi-square we are allowed, at a
probability of error set to 5%, for one degree of
freedom, equals 3.841.
• Our chi-square exceeds this amount. Thus, we
can claim a lack of independence between
church attendance and sex of the subject at a
probability of error equal to less than 5%.
Bivariate Analysis
3. Chi-Square (Continued)
• Are you wondering where the number 3.841
comes from? It is 1.96 squared.
• Remember 1.96? It is the number of standard
deviations within the normal curve that
indicates a 5% Type-I error rate.
• The t-ratios for the effects of the independent
variables in regression analysis each had one
degree of freedom.
• So, we are working with the same principles we
used for the normal curve, but with a different
distribution: the chi-square distribution.
Bivariate Analysis
4. Some Words of Caution
1. Recognize that statistical significance does not
necessarily mean that one has substantive
significance.
2. Statistical significance refers to mistakes made
from sampling error only.
3. Tests of statistical significance depend upon
assumptions about sampling and distributions
of data, which are not always met in practice.
Multivariate Analysis
1. Regression Analysis
• Regression analysis is a procedure for
estimating the outcome of a dependent
variable based upon the value of an
independent variable.
• Thus, for just two variables, regression analysis
is the same as analysis using the covariance or
correlation between the variables.
Multivariate Analysis
1. Regression Analysis (Continued)
• Typically, regression analysis is used to
simultaneously examine the effects of more
than one independent variable on a dependent
variable.
• One might want to know, for example, the
ability to predict income by knowing the
education, age, race, and sex of the
respondent.
Multivariate Analysis
1. Regression Analysis (Continued)
• The statistic used to summarize the total PRE
of multiple variables is the correlation squared,
or R-square.
• R-square represents the total variance
explained in the dependent variable.
• It represents “how well we did” in explaining the
topic we wanted to explain.
Multivariate Analysis
1. Regression Analysis (Continued)
• R-square ranges from 0 to +1, wherein the
larger the value of R-square, the greater the
predictive ability of the independent variables.
• The predictive ability of each variable is
indicated by the statistic β (beta).
Multivariate Analysis
1. Regression Analysis (Continued)
• Consider this equation:
• y = α + β1 x 1 + β2 x 2 + β3 x 3 + β4 x 4 + ε
•
•
•
•
•
where:
y = the value of the dependent variable,
α = the intercept, or “starting point” of y,
βi = the strength of the effect of xi on y,
ε = the amount of error in the prediction of y.
Multivariate Analysis
1. Regression Analysis (Continued)
• β is called a parameter estimate. It represents
the amount of change in y for a one unit
change in x.
• For example, a beta of .42 would mean that for
each one unit change in x (e.g., education) we
would expect to observe a .42 unit change in y.
Multivariate Analysis
1. Regression Analysis (Continued)
• For the example we discussed earlier, we can
rewrite the equation as:
• Income = + 1education + β2age + β3race +
β4sex + ε
• where each of the beta’s (β) ranges in size
from - to + to let us know the direction
and strength of the relationship between each
independent variable and income.
Multivariate Analysis
1. Regression Analysis (Continued)
• In standardized form, this equation is:
• Income = β*1education + β*2age + β*3race +
β*4sex + ε
• where each of the standardized beta’s (β*)
ranges in size from -1 to +1.
• Note that the intercept () is omitted because,
in standardized form, it equals zero.
Multivariate Analysis
1. Regression Analysis (Continued)
• Each of the beta terms in these equations
represents the partial effect of the variable on
the dependent variable, meaning the effect of
the independent variable on y after controlling
for the effects of all other variables on y.
• The partial effects of independent variables in
explaining the variance in a dependent variable
can be visualized by thinking about the
contributions of each player on a basketball
team to the overall team performance.
Multivariate Analysis
1. Regression Analysis (Continued)
• Suppose the team wins, 65-60. The player at
center is the leading scorer with 18 points.
• So, we might say that the center is the most
important contributor to the win. “Not so fast,”
says regression analysis.
• Regression analysis also wants to know the
contributions of the other players on the team
and how they helped the center.
Multivariate Analysis
1. Regression Analysis (Continued)
• Suppose that the point guard had 10 assists, 8
of which went to the center. Eight times the
point guard drove the lane and then passed the
ball to the center for an easy layup, accounting
for 16 of the 18 points scored by the center.
• To best understand the contributions of the
center, we would calculate the contributions of
the center while “controlling for” the
contributions of the point guard.
Multivariate Analysis
1. Regression Analysis (Continued)
• Similarly, regression analysis shows the
contribution to R-square for each variable,
while controlling for the contributions of the
other variables.
• The contribution of each variable in explaining
variance in the dependent variable is
summarized as a partial beta coefficient.
Multivariate Analysis
1. Regression Analysis (Continued)
• In summary, regression analysis provides two
indications of our ability to explain how
societies work:
• The R-Square shows how much variance is
explained in the dependent variable.
• The standardized beta’s (parameter estimates)
show the partial effects of the independent
variables in explaining the dependent variable.
Multivariate Analysis
1. Regression Analysis (Continued)
• The graphic shown on the next slide shows a
diagram of a regression of education (x) on
income (y).
• The regression equation (Y2) is shown as bluecolored line. The intercept (α) is located where
the regression line meets the y axis.
• The slope of the line is the beta coefficient (β),
which equals .42.
Multivariate Analysis
1. Regression Analysis (Continued)
Multivariate Analysis
1. Regression Analysis (Continued)
• We would interpret the results of the regression
equation shown on the preceding slide in this
manner: “A one unit change in education will
result in a .42 unit change in income.”
• We can adjust this interpretation into actual
units of education and income as we measured
them in our study, to state, for example, “Each
additional year of education results in an
additional $4,200 in annual income.”
Multivariate Analysis
1. Regression Analysis (Continued)
• One should be cautious about interpreting the
results of regression analysis:
• A high R-square value does not necessarily
mean that the researcher can be confident
of knowing cause and effect.
• Predictions regarding the dependent
variable are valid only within the range of
the independent variables used in the
regression analysis.
Multivariate Analysis
1. Regression Analysis (Continued)
• The preceding discussion has focused upon
linear regression.
• Regression lines can be curvilinear or some
combination of straight and curved lines.
Multivariate Analysis
2. Path Analysis
• Path analysis is the simultaneous calculation of
regression coefficients within a complex model
of direct and indirect relationships.
• The example of an elaboration model regarding
the success of women-owned businesses is an
example of path analysis .
• Path analysis is a very powerful tool for
examining cause and effect within a complex
theoretical model.
Multivariate Analysis
3. Time-Series Analysis
• Time-series analysis uses comparisons of
statistics and/or parameter estimates across
time to learn how changes in the independent
variable(s) affect changes in the dependent
variable(s).
• Time-series analysis, when the data are
available, can be a powerful tool for gaining a
stronger indication of cause and effect than
one learns from a cross-sectional analysis.
Multivariate Analysis
4. Factor Analysis
• Factor analysis indicates the extent to which a
set of variables measures the same underlying
concept.
• This procedure assesses the extent to which
variables are highly correlated with one another
compared with other sets of variables.
• Consider the table of correlations (i.e., a
“correlation matrix”) on the following slide:
Multivariate Analysis
4. Factor Analysis (Continued)
X1
X2
X3
X4
X5
X6
X1
1
.52
.60
.21
.15
.09
X2
.52
1
.59
.12
.13
.11
X3
.60
.59
1
.08
.10
.10
X4
.21
.12
.08
1
.72
.70
X5
.15
.13
.10
.72
.68
.73
X6
.09
.11
.10
.70
.73
1
Multivariate Analysis
4. Factor Analysis (Continued)
• Note that variables X1-X3 are moderately
correlated with one another, but have weak
correlations with variables X4-X6.
• Similarly, variables X4-X6 are moderately
correlated with one another, but have weak
correlations with variables X1-X3.
• The figures in this table indicate that variables
X1-X3 “go together” and variables X4-X6 “go
together.”
Multivariate Analysis
4. Factor Analysis (Continued)
• Factor analysis would separate variables X1X3 into “Factor 1” and variables X4-X6 into
“Factor 2.”
• Suppose variables X1-X3 were designed by
the researcher to measure self-esteem and
variables X4-X6 were designed to measure
marital satisfaction.
Multivariate Analysis
4. Factor Analysis (Continued)
• The researcher could use the results of factor
analysis, including the statistics produced by it,
to evaluate the construct validity of using X1X3 to measure self-esteem and using X4-X6 to
measure marital satisfaction.
• Thus, factor analysis can be a useful tool for
confirming the validity of measures of latent
variables.
Multivariate Analysis
4. Factor Analysis (Continued)
• Factor analysis can be used also for exploring
groupings of variables.
• Suppose a researcher has a list of 20
statements that measure different opinions
about same-sex marriage.
• The researcher might wonder if the 20 opinions
might reflect a fewer number of “basic”
opinions.
Multivariate Analysis
4. Factor Analysis (Continued)
• Factor analysis of responses to these
statements might indicate, for example, that
they can be reduced into three latent variables,
related to religious beliefs, beliefs about civil
rights, and beliefs about sexuality.
• Then, the researcher can create scales of the
grouped variables to measure religious beliefs,
civil beliefs, and beliefs about sexuality to
examine support for same-sex marriage.
Multivariate Analysis
5. Analysis of Variance
• Analysis of variance (ANOVA) examines
whether a difference in the mean value for one
group differs from that of another group.
• Is the mean income for males, for example,
statistically different from the mean income for
females?
Multivariate Analysis
5. Analysis of Variance (Continued)
• For examining mean differences across just
one other variable, the researcher uses oneway ANOVA, which is equivalent to a t-test.
• For two or more other variables, the researcher
uses two-way ANOVA. The researcher might
be interested, for example, in knowing how
mean incomes differ based upon sex of subject
and level of education.
Multivariate Analysis
5. Analysis of Variance (Continued)
• The logic of a statistical test of a difference in
means is identical to that of testing whether an
estimate differs from zero, except that the
comparison point is the mean of the other
group rather than zero.
• Rather than using just the estimate and its
standard error for a single group, the procedure
is to use the estimates and standard errors of
two groups to assess statistical significance.
Multivariate Analysis
5. Analysis of Variance (Continued)
• Suppose we wanted to know if the mean height
of male ISU students differs significantly from
the mean height of female ISU students.
• Rather than comparing the mean height of
male ISU students to a hypothetical zero point,
we would compare it to the mean height of
female ISU students, where this comparison
takes place within the context of standard
errors and the shape of the normal curve.
Multivariate Analysis
5. Analysis of Variance (Continued)
• Suppose we find in our sample of 100 female
ISU students that their mean height equals 65
inches with a standard error of 1.5 inches.
These figures indicate that most females
(68.2%) are 63.5 to 66.5 inches in height.
• Suppose that a sample of 100 male ISU
students shows a mean height for them of 70
inches with a standard error of 2.0 inches.
Multivariate Analysis
5. Analysis of Variance (Continued)
• Let’s set our margin of error (probability of a
Type-1 error) at 5%, meaning that we are
looking at 1.96 standard deviations on the
normal curve to indicate statistical significance.
• Here is our question: If we allow the mean of
females to “grow” by 1.96 standard deviations
and the mean of males to “shrink” by 1.96
standard deviations, will they reach one
another?
Multivariate Analysis
5. Analysis of Variance (Continued)
• The answer is no, not even close. The t-ratio
(number of standard deviations on the normal
curve needed to join the two groups) equals
26.7.
• We can state that the difference in mean
heights between ISU males and females is
statistically significant at prob. < .05 (actually,
considerably less than that; but that was our
test margin).
Summary of Data Analysis
• Sociologists have at their disposal a wide range of
statistical techniques to help them understand
relationships among their variables of interest.
• These techniques, when used properly, can help
sociologists understand human societies for the
purpose of improving human well-being.
• Students who want to be professional sociologists
must learn statistics and the proper applications of
these statistics to data analysis.
• Enjoy!