Section 10 - Data Ana+

Download Report

Transcript Section 10 - Data Ana+

Mgt 540
Research Methods
Data Analysis
1
Additional “sources”
Compilation
of sources:
http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
 http://web.utk.edu/~dap/Random/Order/Start.htm

Data
Analysis Brief Book
(glossary)
 http://rkb.home.cern.ch/rkb/titleA.html
Exploratory

http://www.itl.nist.gov/div898/handbook/eda/eda.ht
m
Statistical

Data Analysis
Data Analysis
http://obelia.jde.aca.mmu.ac.uk/resdesgn/arsham/opre330.
htm
2
3
Copyright © 2003 John Wiley & Sons, Inc. Sekaran/RESEARCH 4E
FIGURE 12.1
Data Analysis

Get the “feel” for the data
 Get
Mean, variance' and standard deviation
on each variable
 See if for all items, responses range all
over the scale, and not restricted to one
end of the scale alone.
 Obtain Pearson Correlation among the
variables under study.
 Get Frequency Distribution for all the
variables.
 Tabulate your data.
 Describe your sample's key characteristics
(Demographic details of sex composition,
education, age, length of service, etc. )
 See Histograms, Frequency Polygons, etc.
4
Quantitative Data
 Each
type of data requires
different analysis method(s):
Nominal
Labeling
 No
inherent “value” basis
 Categorization purposes only
Ordinal
Ranking,
Interval
sequence
Relationship
basis (e.g. age)
5
Descriptive Statistics
Describing key features of data
 Central
Mean,
 Spread
Tendency
median mode
Variance,
range
standard deviation,
 Distribution
Skewness,
(Shape )
kurtosis
6
Descriptive Statistics
Describing key features of data
 Nominal
Identification
only
/ categorization
 Ordinal (Example on pg. 139)
Non-parametric
Do
statistics
not assume equal intervals
 Frequency
counts
 Averages (median and mode)
 Interval
Parametric
Mean,
Standard Deviation, variance
7
Testing “Goodness of Fit”
Split Half
Reliability
Internal
Consistency
Convergent
Validity
Involves Correlations
and Factor Analysis
Discriminant
Factorial
8
Testing Hypotheses
 Use
appropriate statistical
analysis
T-test (single or twin-tailed)
Test
the significance of differences
of the mean of two groups
ANOVA
Test
the significance of differences
among the means of more than two
different groups, using the F test.
Regression (simple or multiple)
Establish
the variance explained in
the DV by the variance in the IVs
9
Statistical Power
 Claiming
Errors
Type
a significant difference
in Methodology
1 error
 Reject
the null hypothesis when you should
not.
 Called an “alpha” error
Type
 Fail
2 error
to reject the null hypothesis when you
should.
 Called a “beta” error
Statistical
power refers to the
ability to detect true differences
avoiding
type 2 errors
10
Statistical Power see discussion at
http://my.execpc.com/4A/B7/helberg/pitfalls/
 Depends
Sample
on 4 issues
size
The effect size you want to
detect
The alpha (type 1 error rate) you
specify
The variability of the sample
 Too
little power
 Too
much power
Overlook
Any
effect
difference is significant
11
Parametric vs.
nonparametric
 Parametric (characteristics referring
to specific population parameters)
Parametric
assumptions
Independent
samples
Homogeneity of variance
Data normally distributed
Interval or better scale
 Nonparametric
Sometimes
samples
assumptions
independence of
12
t-tests

(Look at t tables; p. 435)
Used to compare two means or one
observed mean against a guess
about a hypothesized mean
 For
large samples t and z can be
considered equivalent

Calculate
t
=
- μ
S
Where S is the standard error of
the mean,
S/√n and df = n-1
13
t-tests
 Statistical
programs will give
you a choice between a matched
pair and an independent t-test.
Your
sample and research design
determine which you will use.
14
z-test for Proportions
(Look at t tables; p. 435)
 When
data are nominal
Describe
by counting occurrences
of each value
From counts, calculate proportions
Compare proportion of occurrence
in sample to proportion of
occurrence in population
Hypotheses
testing allows only one of
two outcomes: success or failure
15
z-test for Proportions
(Look at t tables; p. 435)
Comparing sample proportion to the
population proportion
 H0:
 = k, where k is a value
 H1:
k
between 0 and
z=p- =
p
 Equivalent
1
p-
√((1- )/n)
to χ2 for df = 1
16
Chi-Square Test(sampling distribution)
One Sample

Measures sample variance
 Squared
deviations from the mean –
based on normal distribution
Nonparametric
 Compare expected with observed
proportion
 H0: Observed proportion =
expected proportion
 df = number of data points

 categories,
χ2
cells (k) minus 1
=
(O – E)2
E
17
Univariate z Test
 Test
a guess about a proportion
against an observed sample;
eg.,
MBAs constitute 35% of the
managerial population
 H0:
π = .35
 H1: π  .35 (two-tailed test suggested)
18
Univariate Tests
 Some
univariate tests are
different in that they are
among statistical procedures
where you, the researcher, set
the null hypothesis.
 In many other statistical tests
the null hypothesis is implied by
the test itself.
19
Contingency Tables
Relationship between nominal variables


http://www.psychstat.smsu.edu/introbook/sbk28m.htm
Relationship between subjects' scores on
two qualitative or categorical variables
(Early childhood intervention)

If the columns are not contingent on the
rows, then the rows and column
frequencies are independent. The test of
whether the columns are contingent on
the rows is called the chi square test of
independence. The null hypothesis is that
there is no relationship between row and
column frequencies.
20
Correlations
A
statistical summary of the
degree and direction of
association between two
variables
 Correlation itself does not
distinguish between
independent and dependent
variables
 Most common – Pearson’s r
21
Correlations
 You
believe that a linear
relationship exists between two
variables
 The range is from –1 to +1
 R2, the coefficient of
determination, is the % of
variance explained in each
variable by the other
22
Correlations
r = Sxy/SxSy or the covariance
between x and y divided by their
standard deviations
 Calculations needed

 The
means, x-bar and y-bar
 Deviations from the means, (x – x-bar)
and (y – y-bar) for each case
 The squares of the deviations from the
means for each case to insure positive
distance measures when added, (x - xbar)2 and (y – y-bar)2
 The cross product for each case (x – xbar) times
(y – y-bar)
23
Correlations
 The
null hypothesis for
correlations is
H0: ρ = 0
and the alternative is usually
H1: ρ ≠ 0
However, if you can justify it
prior to analyzing the data you
might also use
H1: ρ > 0 or H1: ρ < 0 ,
a one-tailed test
24
Correlations
 Alternative
Spearman
rranks
measures
rank correlation, rranks
and r are nearly always
equivalent measures for the same data
(even when not the differences are
trivial)
Phi
coefficient, rΦ, when both
variables are dichotomous; again,
it is equivalent to Pearson’s r
25
Correlations
 Alternative
measures
Point-biserial,
rpb when correlating
a dichotomous with a continuous
variable
 If
a scatterplot shows a
curvilinear relationship there
are two options:
A
data transformation, or
Use the correlation ratio, η2 (etasquared)
SSwithin
1SStotal
26
ANOVA
 For
two groups only the t-test
and ANOVA yield the same
results
 You must do paired comparisons
when working with three or
more groups to know where the
means lie
27
Multivariate Techniques
 Dependent
Regression
variable
in its various forms
Discriminant analysis
MANOVA
 Classificatory
Cluster
or data reduction
analysis
Factor analysis
Multidimensional scaling
28
Linear Regression
 We
would like to be able to
predict y from x
 Simple
linear regression with
raw scores
y
= dependent variable
sy
x = independent variable
sx
b = regression coefficient = rxy
c = a constant term
 The
general model is
y = bx + c (+e)
29
Linear Regression


The statistic for assessing the
overall fit of a regression model is
the R2 , or the overall % of variance
explained by the model
R2 = 1 –
unpredictable variance
total variance
=
predictable variance
total variance
= 1 – (s2e / s2y), where s2e is the
variance of the error or residual
30
Linear Regression

Multiple regression: more than one
predictor
y
= b1x1 + b2x2 + c
Each regression coefficient b is
assessed independently for its
statistical significance;
H0: b = 0
 So, in a statistical program’s output
a statistically significant b rejects
the notion that the variable
associated with b contributes
nothing to predicting y

31
Linear Regression


Multiple regression
 R2
still tells us the amount of variation
in y explained by all of the predictors
(x) together
 The F-statistic tells us whether the
model as a whole is statistically
significant
Several other types of regression models
are available for data that do not meet
the assumptions needed for least-squares
models (such as logistic regression for
dichotomous dependent variables)
32
Regression by SPSS &
other Programs

Methods for developing the model
 Stepwise: let’s computer try to fit all chosen
variables, leaving out those not significant and
re-examining variables in the model at each step
 Enter: researcher specifies that all variables
will be used in the model
 Forward, backward: begin with all
(backward) or none (forward) of the variables
and automatically adds or removes variables
without reconsideration of variables already in
the model
33
Multicollinearity
 Best
regression model has
uncorrelated IVs
 Model stability low with
excessively correlated IVs
 Collinearity diagnostics identify
problems, suggesting variables
to be dropped
 High tolerance, low variance
inflation factor are desirable
34
Discriminant Analysis
 Regression
requires DV to be
interval or ratio
 If DV categorical (nominal) can
use discriminant analysis
 IVs should be interval or ratio
scaled
 Key result is number of cases
classified correctly
35
MANOVA
 Compare
DVs
means on two or more
(ANOVA
 Pure
limited to one DV)
MANOVA via SPSS only
from command syntax
 Can use the general linear
model though
36
Factor Analysis



A data reduction technique – a large set
of variables can be reduced to a smaller
set while retaining the information from
the original data set
Data must be on an interval or ratio scale
E.g., a variable called socioeconomic status
might be constructed from variables such
as household income, educational
attainment of the head of household, and
average per capita income of the census
block in which the person resides
37
Cluster Analysis



Cluster analysis seeks to group cases
rather than variables; it too is a data
reduction technique
Data must be on an interval or ratio scale
E.g., a marketing group might want to
classify people into psychographic profiles
regarding their tendencies to try or adopt
new products – pioneers or early adopters,
early majority, late majority, laggards
38
Factor vs. Cluster Analysis
 Factor
analysis focuses on
creating linear composites of
variables
Number
of variables with which
we must work is then reduced
Technique begins with a
correlation matrix to seed the
process
 Cluster
cases
analysis focuses on
39
Potential Biases
Asking the inappropriate or wrong
research questions.
 Insufficient literature survey and
hence inadequate theoretical model.
 Measurement problems
 Samples not being representative.
 Problems with data collection:

 researcher biases
 respondent biases
 instrument biases

Data analysis biases:

Biases (subjectivity) in
interpretation of results.
 coding errors
 data punching & input errors
 inappropriate statistical analysis
40
Questions to ask:
Adopted from Robert Niles
Where did the data come from?
 How (Who) was the data reviewed,
verified, or substantiated?
 How were the data collected?
 How is the data presented?

 What
is the context?
Cherry-picking?
 Be
skeptical when dealing with
comparisons
Spurious
correlations
41
Copyright © 2003 John Wiley & Sons, Inc. Sekaran/RESEARCH 4E
FIGURE 11.2