Multivariate Statistics with Data Analysis for Academic

Download Report

Transcript Multivariate Statistics with Data Analysis for Academic

Multivariate Statistical Data
Analysis with Its Applications
Hua-Kai Chiou
Ph.D., Assistant Professor
Department of Statistics, NDMC
[email protected]
September, 2005
1
Agenda
1.
2.
3.
4.
5.
6.
7.
8.
Introduction
Examining Your Data
Sampling & Estimation
Hypothesis & Testing
Multiple Regression Analysis
Logistic Regression
Multivariate Analysis of Variance
Principal Components Analysis
2
9.
10.
11.
12.
13.
14.
15.
Factor Analysis
Cluster Analysis
Discriminant Analysis
Multidimensional Scaling
Canonical Correlation Analysis
Conjoint Analysis
Structural Equation Modeling
3
Introduction
4
Some Basic Concept of MVA
•
•
•
•
•
What is Multivariate Analysis (MVA)?
Impact of the Computer Revolution
Multivariate Analysis Defined
Measurement Scales
Type of Multivariate Techniques
5
• Dependence technique – the objective is
prediction of the dependent variable(s) by the
independent variable(s), e.g., regression analysis.
• Dependent variable – presumed effect of, or
response to, a change in the independent
variable(s).
• Dummy variable – nometrically measured
variable transformed into a metric variable by
assigning 1 or 0 to a subject, depending on
whether it possesses a particular characteristic.
• Effect size – estimate of the degree to which the
phenomenon being studied (e.g., correlation or
difference in means) exists in population.
6
• Indicator – single variable used in conjunction
with one or more other variables to form a
composite measure.
• Interdependence technique – classification of
statistical techniques in which the variables are
not divided into dependent and independent
sets (e.g., factor analysis).
• Metric data – also called quantitative data,
interval data, or ratio data, these measurements
identify or describe subjects (or objects) not only
on the possession of an attribute but also by the
amount or degree to which the subject may be
characterized by attribute. For example, a
person’s age and weight are metric data.
7
• Multicollinearity – extent to which a variable
can be explained by the other variables in the
analysis. As multicollinearity increases, it
complicates the interpretation of the variate as it
is more difficult to ascertain the effect of any
single variable, owing to their interrelationships.
• Nonmetric data – also called qualitative data.
• Power – probability of correctly rejecting the
null hypothesis when it is false, that is, correctly
finding a hypothesized relationship when it
exists. Determined as a function of (1)the
statistical significance level (α) set by the
researcher for a Type I error, (2) the sample size
used in the analysis, and (3) the effect size being
examined.
8
• Practical significance – means of assessing
multivariate analysis results based on their
substantive findings rather than their statistical
significance. Whereas statistical significance
determines whether the result is attributable to
chance, practical significance assesses whether
the result is useful.
• Reliability – extent to which a variable or set of
variables is consistent in what it is intended to
measure. Reliability relates to the consistency of
the measure(s).
• Validity – extent to which a measure or set of
measures correctly represents the concept of
study. Validity is concerned with how well the
concept is defined by the measure(s).
9
• Type I error – probability of incorrectly rejecting
the null hypothesis.
• Type II error - probability of incorrectly failing
to reject the null hypothesis, it meaning the
chance of not finding a correlation or mean
difference when it does exist.
• Variate – linear combination of variables formed
in the multivariate technique by deriving
empirical weights applied to a set of variables
specified by the researcher.
10
• The Relationship between Multivariate
Dependence Methods
Analysis of Variance (ANOVA)
Y1  X1  X 2  X 3  ...  X n
(metric)
(nometric)
Multivariate Analysis of Variance (MANOVA)
Y1  Y2  Y3  ...  Yn  X1  X 2  X 3  ...  X n
(metric)
(nometric)
Canonical Correlation
Y1  Y2  Y3  ...  Yn  X1  X 2  X 3  ...  X n
(metric, nometric)
(metric, nometric)
11
Discriminant Analysis
Y1  X1  X 2  X 3  ...  X n
(nometric)
(metric)
Multiple Regression Analysis
Y1  X1  X 2  X 3  ...  X n
(metric)
(metric, nometric)
Conjoint Analysis
Y1  X1  X 2  X 3  ...  X n
(metric, nometric)
(nometric)
12
Structural Equation Modeling
Y1  X 11  X 12  X 13  ...  X 1n
Y2  X 21  X 22  X 23  ...  X 2 n
Ym  X m1  X m 2  X m 3  ...  X mn
(metric)
(metric, nometric)
13
What type
of
relationship
is being
examined?
Dependence
Interdependence
How many
variables
are being
predicted?
Multiple relationships
of dependent and
independent variables
Is the
structure of
relationship
s among:
Several dependent
variables in single
relationship
One dependent
variables in single
relationship
Variable
Factor
analysis
What is the
measurement
scale of the
dependent
variable?
Structural
Equation
Modeling
What is the
measurement
scale of the
dependent
variable?
Cases/Respondent
Object
How are
the
attributes
measured?
Cluster
analysis
Metric
Nometric
Nometric
Metric
Nometric
Metric
Nometric
Multidimensiona
l scaling
What is the
measurement
scale of the
dependent
variable?
Canonical
correlation
analysis with
dummy
variables
Metric
Nometric
Canonical
correlation
analysis
Multivariate
analysis of
variance
(MANOVA)
Multiple
regression
Conjoint
analysis
Correspondenc
e analysis
Multiple
discriminant
analysis
Linear
probability
models
14
A Structured Approach to Multivariate Model
Building
Stage 1: Define the research problem, objectives,
and multivariate technique to be used
Stage 2: Develop the analysis plan
Stage 3: Evaluate the assumptions underlying the
multivariate technique
Stage 4: Estimate the multivariate model and
assess overall model fit
Stage 5: Interpret the variate(s)
Stage 6: Validate the multivariate model
15
Examining
Your Data
16
HATCO Case
• Primary Database
– This example investigates a business-to-business case
from existing customers of HATCO.
– The primary database consists 100 observations on 14
separate variables.
• Three types of information were collected:
– The perceptions of HATCO, 7 attributes (X1 – X7);
– The actual purchase outcomes, 2 specific measures
(X9,X10);
– The characteristics of the purchasing companies, 5
characteristics (X8, X11-X14).
17
Table 2.1 Description of Database Variables (Hair et al., 1998)
Variables Description
Perceptions of HATCO
X1
Delivery Speed
X2
Price Level
X3
Price Flexibility
X4
Manufacturer’s Image
X5
Overall Service
X6
Salesforce Image
X7
Product Quality
Purchase Outcomes
X9
Usage Level
X10 Satisfaction Level
Purchaser Characteristics
X8
Size of Firm
X11 Specification Buying
X12 Structure of Procurement
X13 Type of Industry
X14 Type of Buying Situation
Variable Type
Rating Scale
Metric
Metric
Metric
Metric
Metric
Metric
Metric
0 – 10
0 – 10
0 – 10
0 – 10
0 – 10
0 – 10
0 – 10
Metric
Metric
100-point percentage
0 – 10
Nonmetric
Nonmetric
Nonmetric
Nonmetric
Nonmetric
{0,1}
{0,1}
{0,1}
{0,1}
{1,2,3}
18
Fig 2.1 Scatter Plot Matrix of Metric Variables (Hair et al., 1998)
19
Fig 2.2 Examples of Multivariate Graphical Displays (Hair et al., 1998)
20
Missing Data
• A missing data process is any systematic event
external to the respondent (e.g. data entry errors
or data collection problems) or action on the part
of the respondent (such as refusal to answer)
that leads to missing values.
• The impact of missing data is detrimental not
only through its potential “hidden” biases of the
results but also in its practical impact on the
sample size available for analysis.
21
• Understanding the missing data
– Ignorable missing data
– Remediable missing data
• Examining the pattern of missing data
22
Table 2.2 Summary Statistics of Pretest Data (Hair et al., 1998)
23
Table 2.3 Assessing the Randomness of Missing Data through Group
Comparisons of Observations with Missing versus Valid Data (Hair et al., 1998)
24
Table 2.4 Assessing the Randomness of Missing Data through Dichotomized
Variable Correlations and the Multivariate Test for Missing Completely at
Random (MCAR) (Hair et al., 1998)
25
Table 2.5 Comparison of Correlations Obtained with All-Available (Pairwise),
Complete Case (Listwise), and Mean Substitution Approaches (Hair et al., 1998)
26
Table 2.6 Results of the Regression and EM Imputation Methods (Hair et al., 1998)
27
Outliers
• Four classes of outliers:
–
–
–
–
Procedural error
Extraordinary event can be explained
Extraordinary observations has no explanation
Observations fall within the ordinary range of values
on each of the variables but are unique in their
combination of values across the variables.
• Detecting outliers
– Univariate detection
– Bivariate detection
– Multivariate detection
28
Outliers detection
• Univariate detection threshold:
– For small samples, within ±2.5 standardized variable
values
– For larger samples, within ±3 or ± 4 standardized
variable values
• Bivariate detection threshold:
– Varying between 50 and 90 percent of the ellipse
representing normal distribution.
• Multivariate detection:
– The Mahalanobis distance D2
29
Table 2.7 Identification of Univariate and Bivariate Outliers (Hair et al., 1998)
30
Fig 2.3 Graphical Identification of Bivariate Outliers (Hair et al., 1998)
31
Table 2.8 Identification of Multivariate Outliers (Hair et al., 1998)
32
Testing the Assumptions of Multivariate Analysis
• Graphical analyses of normality
– Kurtosis refers to the peakedness or flatness of the
distribution compared with the normal distribution.
– Skewness indicates the arc, either above or below the
diagonal.
• Statistical tests of normality
zskewness
skewness

;
6N
zkurtosis
kurtosis

24 N
33
Fig 2.4 Normal Probability Plots and Corresponding Univariate Distribution
34
(Hair et al., 1998)
Homoscedasticity vs. Heteroscedasticity
• Homoscedasticity is an assumption related
primarily to dependence relationships between
variables.
• Although the dependent variables must be
metric, this concept of an equal spread of
variance across independent variables can be
applied either metric or nonmetric.
35
Fig 2.5 Scatter Plots of Homoscedastic and Heteroscedastic Relationships
36
(Hair et al., 1998)
Fig 2.6 Normal Probability Plots of Metric Variables (Hair et al., 1998)
37
Table 2.9 Distributional Characteristics, Testing for Normality, and Possible
Remedies (Hair et al., 1998)
38
Fig 2.7 Transformation of X2 (Price Level) to Achieve Normality (Hair et al.,
1998)
39
Table 2.10 Testing for Homoscedasticity (Hair et al., 1998)
40
Sampling
Distribution
41
Understanding sampling distributions
• A histogram is constructed from a frequency
table. The intervals are shown on the X-axis and
the number of scores in each interval is
represented by the height of a rectangle located
above the interval.
42
• A bar graph is much like a histogram, differring
in that the columns are separated from each
other by a small distance. Bar graphs are
commonly used for qualitative variables.
43
What is a normal distribution?
• Normal distributions are a family of
distributions that have the same general shape.
They are symmetric with scores more
concentrated in the middle than in the tails.
Normal distributions are sometimes described as
bell shaped. The height of a normal distribution
can be specified mathematically in terms of two
parameters: the mean (m) and the standard
deviation (s).
44
45