Data Analysis

Download Report

Transcript Data Analysis

Data Analysis
In most social research the data analysis
involves three major steps, done in
roughly this order:
• Cleaning and organizing the data for
analysis (Data Preparation)
• Describing the data (Descriptive
Statistics)
• Testing Hypotheses and Models
(Inferential Statistics)
Data Preparation
• involves checking or logging the data in;
checking the data for accuracy; entering
the data into the computer; transforming
the data; and developing and documenting
a database structure that integrates the
various measures.
Descriptive Statistics
• Used to describe the basic features of the
data in a study. They provide simple
summaries about the sample and the
measures. Together with simple graphics
analysis, they form the basis of virtually
every quantitative analysis of data. With
descriptive statistics you are simply
describing what is, what the data shows.
Inferential statistics
• investigate questions, models and hypotheses. In
many cases, the conclusions from inferential
statistics extend beyond the immediate data
alone.
• For instance, we use inferential statistics to try to
infer from the sample data what the population
thinks. Or, we use inferential statistics to make
judgments of the probability that an observed
difference between groups is a dependable one
or one that might have happened by chance in
this study.
Types of Statistical Analysis
• Univariate Statistical Analysis
– Tests of hypotheses involving only one
variable.
– Testing of statistical significance
• Bivariate Statistical Analysis
– Tests of hypotheses involving two variables.
• Multivariate Statistical Analysis
– Statistical analysis involving three or more
variables or sets of variables.
6
Statistical Analysis: Key Terms
• Hypothesis
– Unproven proposition: a supposition that
tentatively explains certain facts or
phenomena.
– An assumption about nature of the world.
• Null Hypothesis
– No difference in sample and population.
• Alternative Hypothesis
– Statement that indicates the opposite of the
null hypothesis.
7
Statistical Analysis: Key Terms
• Hypothesis
– Unproven proposition: a supposition that
tentatively explains certain facts or
phenomena.
– An assumption about nature of the world.
• Null Hypothesis
– No difference in sample and population.
• Alternative Hypothesis
– Statement that indicates the opposite of the
null hypothesis.
8
Choosing the Appropriate Statistical
Technique
• Choosing the correct statistical technique
requires considering:
– Type of question to be answered
– Number of variables involved
– Level of scale measurement
9
Univariate analysis
Univariate analysis involves the examination
across cases of one variable at a time.
There are three major characteristics of a
single variable that we tend to look at:
– the distribution
– the central tendency
– the dispersion
In most situations, we would describe all three of
these characteristics for each of the variables in our
study.
The Distribution
The distribution is a
summary of the
frequency of individual
values or ranges of
values for a variable.
The simplest
distribution would list
every value of a
variable and the
number of persons who
had each value.
Distributions may also be displayed using
percentages. For example, you could use
percentages to describe the:
• percentage of people in different income
levels
• percentage of people in different age ranges
• percentage of people in different ranges of
standardized test scores
Central Tendency
The central tendency of a distribution is
an estimate of the "center" of a
distribution of values. There are three
major types of estimates of central
tendency:
• Mean
• Median
• Mode
15, 20, 21, 20, 36, 15, 25, 15
The sum of these 8 values is 167, so
the mean is 167/8 = 20.875.
If we order the 8 scores shown above, we would get:
15,15,15,20,20,21,25,36
There are 8 scores and score #4 and #5 represent
the halfway point. Since both of these scores are 20,
the median is 20. If the two middle scores had
different values, you would have to interpolate to
determine the median.
To determine the mode, you might again
order the scores as shown above, and
then count each one. The most frequently
occurring value is the mode. In our
example, the value 15 occurs three times
and is the mode
Dispersion
Dispersion refers to the spread of the values
around the central tendency. There are
two common measures of dispersion, the
range and the standard deviation.
The range is simply the highest value minus the
lowest value. In our example distribution, the high
value is 36 and the low is 15, so the range is 36 - 15 =
21.
The Standard Deviation
is a more accurate and detailed estimate of
dispersion because an outlier can greatly
exaggerate the range (as was true in this
example where the single outlier value of
36 stands apart from the rest of the values.
The Standard Deviation shows the relation
that set of scores has to the mean of the
sample.
15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875
N
8
Mean
20.8750
Median
20.0000
Mode
Std. Deviation
Variance
Range
15.00
7.0799
50.1250
21.00
Bivariate analysis
The correlation is one of the most common
and most useful statistics.
A correlation is a single number that describes the
degree of relationship between two variables.
Let's assume that we want to look at the
relationship between two variables, height
(in inches) and self esteem.
Person
Height is measured in inches.
Self esteem is measured based
on the average of 10 1-to-5
rating items (where higher
scores mean higher self esteem)
Height
Self
Esteem
1
68
4.1
2
71
4.6
3
62
3.8
4
75
4.4
5
58
3.2
6
60
3.1
7
67
3.8
8
68
4.1
9
71
4.3
10
69
3.7
11
68
3.5
12
67
3.2
13
63
3.7
14
62
3.3
15
60
3.4
16
63
4
17
65
4.1
18
67
3.8
19
63
3.4
20
61
3.6
Variable
Height
Self Esteem
Mean
StDev
Variance
Sum
Minimum
Maximum
Range
65.4
4.4057
19.4105
1308
58
75
17
3.755
0.4261
0.18155
75.1
3.1
4.6
1.5
Calculating the Correlation
So, the correlation for our twenty cases is .73,
which is a fairly strong positive relationship
Testing the Significance of a Correlation
Once you've computed a correlation, you can
determine the probability that the observed
correlation occurred by chance. That is, you can
conduct a significance test. Most often you are
interested in determining the probability that the
correlation is a real one and not a chance
occurrence. In this case, you are testing the
mutually exclusive hypotheses:
Null Hypothesis:
Alternative Hypothesis:
r=0
r <> 0
you need to first determine the significance level. Here, use the
common significance level of alpha = .05
The df is simply equal to N-2 or, in this example, is 20-2 = 18.
Finally, decide whether you are doing a one-tailed or two-tailed test. In this
example, since there is no strong prior theory to suggest whether the
relationship between height and self esteem would be positive or negative, we
opt for the two-tailed test
With these three pieces of information
-- the significance level (alpha = .05)), degrees of
freedom (df = 18), and type of test (two-tailed)
the critical value is .4438. This means that
if our correlation is greater than .4438 or less
than -.4438 (remember, this is a two-tailed test),
we can conclude that the odds are less than 5
out of 100 that this is a chance occurrence.
Since the correlation of .73 (higher), we conclude
that it is not a chance finding and that the
correlation is "statistically significant".
The null hypothesis is rejected and the
alternative is accepted
Pearson Product-Moment Correlation Matrix for
Salesperson
26
Other Correlations
The specific type of correlation illustrated here is known as
the Pearson Product Moment Correlation.
It is appropriate when both variables are measured at an
interval level.
However there are a wide variety of other types of
correlations for other circumstances. for instance,
if you have two ordinal variables, you could use the
Spearman rank Order Correlation (rho) or the Kendall rank
order Correlation (tau).
When one measure is a continuous interval level one and
the other is dichotomous (i.e., two-category) you can use
the Point-Biserial Correlation.
For other situations, consulting the web-based statistics
selection program, Selecting Statistics at
http://trochim.human.cornell.edu/selstat/ssstart.htm.
Regression Analysis
• Simple (Bivariate) Linear Regression
– A measure of linear association that investigates straightline relationships between a continuous dependent variable
and an independent variable that is usually continuous, but
can be a categorical dummy variable.
• The Regression Equation (Y = α + βX )
– Y = the continuous dependent variable
– X = the independent variable
– α = the Y intercept (regression line intercepts Y axis)
– β = the slope of the coefficient (rise over run)
28
The Regression Equation
• Parameter Estimate Choices
– β is indicative of the strength and direction of the
relationship between the independent and
dependent variable.
– α (Y intercept) is a fixed point that is considered a
constant (how much Y can exist without X)
• Standardized Regression Coefficient (β)
– Estimated coefficient of the strength of relationship
between the independent and dependent variables.
– Expressed on a standardized scale where higher
absolute values indicate stronger relationships
(range is from -1 to 1).
29
Simple Regression Results Example
30
What is Multivariate Data
Analysis?
• Research that involves three or more variables, or that is
concerned with underlying dimensions among multiple
variables, will involve multivariate statistical analysis.
– Methods analyze multiple variables or even multiple
sets of variables simultaneously.
– Business or economic problems involve multivariate
data analysis:
• most employee motivation research
• customer psychographic profiles
• research that seeks to identify viable market segments
31
Which Multivariate Approach Is
Appropriate?
32
Classifying Multivariate
Techniques
• Dependence Techniques
– Explain or predict one or more dependent
variables.
– Needed when hypotheses involve distinction
between independent and dependent
variables.
– Types:
• Multiple regression analysis
• Multiple discriminant analysis
• Multivariate analysis of variance
33
Classifying Multivariate
Techniques (cont’d)
• Interdependence Techniques
– Give meaning to a set of variables or seek to
group things together.
– Used when researchers examine questions
that do not distinguish between independent
and dependent variables.
– Types:
• Factor analysis
• Cluster analysis
• Multidimensional scaling
34
Classifying Multivariate
Techniques (cont’d)
• Influence of Measurement Scales
– The nature of the measurement scales will
determine which multivariate technique is
appropriate for the data.
– Selection of a multivariate technique requires
consideration of the types of measures used
for both independent and dependent sets of
variables.
– Nominal and ordinal scales are nonmetric.
– Interval and ratio scales are metric.
35
Which Multivariate Dependence Technique Should I
Use?
36
Which Multivariate Interdependence Technique
Should I Use?
37
Interpreting Multiple Regression
• Multiple Regression Analysis
– An analysis of association in which the effects
of two or more independent variables on a
single, interval-scaled dependent variable are
investigated simultaneously.
Yi  b0  b1 X1  b2 X 2  b3 X 3    bn X n  ei

Dummy variable
 The way a dichotomous (two group) independent
variable is represented in regression analysis by
assigning a 0 to one group and a 1 to the other.
38
Multiple Regression Analysis
• A Simple Example
– Assume that a toy manufacturer wishes to explain
store sales (dependent variable) using a sample of
stores from Canada and Europe.
– Several hypotheses are offered:
• H1: Competitor’s sales are related negatively to
sales.
• H2: Sales are higher in communities with a sales
office than when no sales office is present.
• H3: Grammar school enrollment in a community is
related positively to sales.
39
Multiple Regression Analysis
(cont’d)
• Regression Coefficients in Multiple Regression
– Partial correlation
•
The correlation between two variables after taking into
account the fact that they are correlated with other variables
too.
• R2 in Multiple Regression
– The coefficient of multiple determination in multiple
regression indicates the percentage of variation in Y
explained by all independent variables.
40
Interpreting
Multiple
Regression
Results
41
ANOVA (n-way) and MANOVA
• Multivariate Analysis of Variance
(MANOVA)
– A multivariate technique that predicts multiple
continuous dependent variables with multiple
categorical independent variables.
42
ANOVA (n-way) and MANOVA
(cont’d)
Interpreting N-way (Univariate) ANOVA
1. Examine overall model F-test result. If significant,
proceed.
2. Examine individual F-tests for individual variables.
3. For each significant categorical independent variable,
interpret the effect by examining the group means.
4. For each significant, continuous covariate, interpret
the parameter estimate (b).
5. For each significant interaction, interpret the means
for each combination.
43
Discriminant Analysis
• A statistical technique for predicting the probability
that an object will belong in one of two or more
mutually exclusive categories (dependent variable),
based on several independent variables.
– To calculate discriminant scores, the linear
function used is:
Z i  b1 X1i  b2 X 2i    bn X ni
44
Factor Analysis
• A type of analysis used to discern the
underlying dimensions or regularity in
phenomena. Its general purpose is to
summarize the information contained in a
large number of variables into a smaller
number of factors.
45
Multidimensional Scaling
• Multidimensional Scaling
– Measures objects in multidimensional space
on the basis of respondents’ judgments of the
similarity of objects.
46