Introduction to Computer Science

Transcript Introduction to Computer Science

Multivariate Statistical Analysis
Shyh-Kang Jeng
Department of Electrical Engineering/
Graduate Institute of Communication/
Graduate Institute of Networking and
Multimedia
1
What Is Multivariate Analysis?
Statistical methodology to analyze
data with measurements on many
variables
controllable factors

input

output
Process

uncontrollable factors
2
Why to Learn Multivariate Analysis?
Explanation of a social or physical
phenomenon must be tested by
gathering and analyzing data
Complexities of most phenomena
require an investigator to collect
observations on many different
variables
3
Application Examples
Is one product better than the other?
Which factor is the most important to
determine the performance of a
system?
How to classify the results into
clusters?
What are the relationships between
variables?
4
Course Outline
Introduction
Matrix Algebra and Random Vectors
Sample Geometry and Random
Samples
Multivariate Normal Distribution
Inference about a Mean Vector
Comparison of Several Multivariate
Means
Multivariate Linear Regression
Models
5
Course Outline
Principal Components
Factor Analysis and Inference for
Structured Covariance Matrices
Canonical Correlation Analysis
Discrimination and Classification
Clustering, Distance Methods, and
Ordination
Multidimensional Scaling*
Structural Equation Modeling*
6
Text Book and Website
R. A. Johnson and D. W. Wichern,
Applied Multivariate Statistical
Analysis, 5th ed., Prentice Hall,
2002. (雙葉)
http://cc.ee.ntu.edu.tw/~skjeng/
MultivariateAnalysis2006.htm
7
References
J. F. Hair, Jr., B. Black, B. Babin, R. E.
Anderson, and R. L. Tatham,
Multivariate Data Analysis, 6th ed.,
Prentice Hall, 2006. (華泰)
D. C. Montgomery, Design and
Analysis of Experiments, 6th ed.,
John Wiley, 2005. (歐亞)
D. Salsberg著, 葉偉文譯,統計,改變了世
界, 天下遠見, 2001.
8
References
張碧波，推理統計學，三民，1976.
張輝煌編譯，實驗設計與變異分析，建興，
1986.
9
Time Management
Importance
II
I
Emergency
III
IV
10
Some Important Laws
First things first
80 – 20 Law
Fast prototyping and evolution
11
Major Uses of Multivariate Analysis
Data reduction or structural
simplification
Sorting and grouping
Investigation of the dependence
among variables
Prediction
Hypothesis construction and testing
12
Array of Data
 x11
x
 21
 
x
 x j1
 

 xn1
x12
x22

x j2

xn 2
 x1k
 x2 k

 x jk

 xnk
 x1 p 

 x2 p 
 

 x jp 
 

 xnp 
13
Descriptive Statistics
Summary numbers to assess the
information contained in data
Basic descriptive statistics
– Sample
– Sample
– Sample
– Sample
– Sample
mean
variance
standard deviation
covariance
correlation coefficient
14
Sample Mean and
Sample Variance
n
1
xk   x jk
n j 1
n

1
2
s  skk   x jk  xk
n j 1
2
k

k  1, 2, , p
15
Sample Covariance and
Sample Correlation Coefficient
1 n
sik   x ji  xi x jk  xk 
n j 1
 x
n
rik 
sik
sii skk

j 1
 x
n
j 1
ji
 xi x jk  xk 
 xi 
2
ji
 x
n
j 1
 xk 
2
jk
i  1, 2, , p; k  1, 2, , p
sik  ski , rik  rki
16
Standardized Values
(or Standardized Scores)
Centered at zero
Unit standard deviation
Sample correlation coefficient can be
regarded as a sample covariance of
two standardized variables
x jk  xk
skk
17
Properties of Sample Correlation
Coefficient
Value is between -1 and 1
Magnitude measure the strength of the
linear association
Sign indicates the direction of the
association
Value remains unchanged if all xji’s and xjk’s
are changed to yji = a xji + b and yjk = c xjk
+ d, respectively, provided that the
constants a and c have the same sign
18
Arrays of Basic
Descriptive Statistics
 x1 
 s11 s12
x 
s
s22
2
21


x
, Sn 

 

 

 x p 
 s p1 s p 2
 1 r12  r1 p 
r

1  r2 p 
21

R

   


rp1 rp 2  1 
 s1 p 

 s2 p 
  

 s pp 
19
Example
Four receipts from
a university
bookstore
Variable 1: dollar
sales
Variable 2: number
of books
42
52
x
48

58
4

5
4

3
20
Arrays of Basic Descriptive
Statistics
50
 34  1.5
x   , S n  

4
 1.5 0.5 
 0.36
 1
R

1 
 0.36
21
Using SAS
Create New Project
– Name  Project
Insert Data
– New: Name  Data1
Change column name
– Right button, select Properties…
Enter data in the data grid
Select Data1 under Project
Analysis Descriptive
– Summary statistics
– Correlations
22
Summary Statistics
Save
– Personal  Enterprise Guide Sample 
Data Data1.sas7bdat
Columns  Variables to assign 
Analysis variables
Statistics  Mean
23
Report on Means
24
Correlations
Delete Summary statistics node
Save
– Personal  Enterprise Guide Sample 
Data Data1.sas7bdat
Columns  Variables to assign 
Correlation variables
Correlations  Pearson 
Covariances  Show Pearson
correlations in results  Divisor for
variances (Number of rows)
25
Correlations
Results  Show results  (uncheck)
show statistics for each variable 
(uncheck) show significance
probabilities associated with
correlations
26
Report on Correlations
27
Scatter Plot and
Marginal Dot Diagrams
28
Scatter Plot and Marginal Dot
Diagrams for Rearranged Data
29
Effect of Unusual Observations
30
Effect of Unusual Observations
  0.39
  0.56

r12  
  0.39
 0.50
for all 16 firms
for all firms but Dun & Bradstreet
for all firms but Time Warner
for all firms but Dun & Bradstreet and Time Warner
31
Paper Quality Measurements
32
Lizard Size Data
*SVL: snoutvent length; HLS: hind limb span
33
3D Scatter Plots of Lizard Data
34
Female Bear Data and
Growth Curves
35
Utility Data as Stars
36
Chernoff Faces over Time
37
Euclidean Distance
Each coordinate contributes equally
to the distance
P( x1 , x2 ,, x p ), Q( y1 , y2 ,, y p )
d ( P, Q)  ( x1  y1 ) 2  ( x2  y2 ) 2    ( x p  y p ) 2
38
Statistical Distance
Weight coordinates subject to a great
deal of variability less heavily than
those that are not highly variable
39
Statistical Distance for
Uncorrelated Data
P( x1 , x2 ), O(0,0)
x  x1 / s11 , x  x2 / s22
*
1
d (O, P) 
*
2
x   x 
* 2
1
* 2
2

2
1
2
2
x
x

s11 s22
40
Ellipse of Constant Statistical
Distance for Uncorrelated Data
x2
c s22
0
 c s11
x1
c s11
 c s22
41
Scattered Plot for
Correlated Measurements
42
Statistical Distance under Rotated
Coordinate System
~
~
O(0,0), P( x1 , x2 )
2
2
~
~
x1 x2
d (O, P)  ~  ~
s
s
11
22
~
x1  x1 cos   x2 sin 
~
x   x sin   x cos 
2
1
2
d (O, P)  a x  2a12 x1 x2  a x
2
11 1
2
22 2
43
General Statistical Distance
P( x1 , x2 ,  , x p ), O(0,0, ,0), Q( y1 , y2 , , y p )
d (O, P) 
[a11 x12  a22 x22    a pp x 2p 
2a12 x1 x2  2a13 x1 x3    2a p 1, p x p 1 x p ]
[a11 ( x1  y1 ) 2  a22 ( x2  y2 ) 2   
d ( P, Q ) 
a pp ( x p  y p ) 2 
2a12 ( x1  y1 )( x2  y2 )  2a13 ( x1  y1 )( x3  y3 )
   2a p 1, p ( x p 1  y p 1 )( x p  y p )]
44
Necessity of Statistical Distance
45
Necessary Conditions for
Statistical Distance Definitions
d ( P, Q)  d (Q, P)
d ( P, Q)  0 if P  Q
d ( P, Q)  0 if P  Q
d ( P, Q )  d ( P, R )  d ( R, Q )
(Triangle inequality )
46
Reading Assignments
Text book
– pp. 50-60
– pp. 84-97
47
Outliers
Observations with a unique
combination of characteristics
identifiable as distinctly different
from the other observations
Impact
– Limiting the generalizability of any type
of analysis
– Must be viewed in light of how
representative it is of the population to
be retained or deleted
48
Sources of Outliers
Procedure error
Extraordinary event
– e,g., hurricane for daily rainfall analysis
Extraordinary observations
– Researcher has no explanation
Unique in their combinations of
values across the variables
– Falls within ordinary range of values of
variables
– Retain it unless proved invalid
49
Rules of Thumb for
Univariate Outlier Detection
Small samples (80 or fewer
observations)
– Cases with standard scores of 2.5 or
greater
Larger sample size
– Threshold increases up to 4
Standard score not used
– Cases falling outside the range of 2.5
versus 4 standard deviations, depending
on the sample size
50
Rules of Thumb for Bivariate and
Multivariate Outlier Detection
Bivariate
– Use scatterplots with confidence
intervals at a specified alpha level
Multivariate
– Threshold levels for the D2/df measure
should be conservative (0.005 or 0.001)
resulting in values of 2.5 (small samples)
versus 3 or 4 in larger samples
– D2: Mahalanobis measure
– df: degrees of freedom
51
Outlier Description and Profiling
Generate profiles of each outlier
observation
Identify the variable(s) responsible
for its being an outlier
Discriminant analysis or multiple
regression can be applied to identify
the differences between outliers and
other observations
52
Examples of Outliers
53
Example of Bivariate Outliers
54
Missing Data
Valid values on one or more variables
are not available for analysis
Affects the generalizability of the
results
Remedy is applied
– to maintain as close as possible the
original distribution of values
55
Missing Data Process
Systematic event that leads to
missing values
– Event external to the respondent (data
entry errors or data collection problem)
– Action on the part of the respondent
(such as refusal to answer)
Causes some patterns and
relationships underlying the missing
data
56
Impact of Missing Dara
Practical impact
– Reduction of the sample size available
for analysis (adequate  inadequate)
Substantive perspective
– Statistical results based on data with a
non-random data process could be
biased
– e.g., individuals did not provide their
income tended to be almost exclusively
in the high income bracket
57
Hypothetical Example of
Missing Data
58
Practical Considerations
Complete data required
– Only 5 cases are usable (too few)
A possible remedy: Eliminate V3
– 12 cases have complete data
– Eliminate cases 3, 13, 15
– Total number of missing data is reduced
to 7.4% for all values
59
Substantive Impact
5 still with missing data, all occur in
V4
These cases compared with those
with valid V4 data
– The 5 cases with missing V4 data have
the five lowest scores on V2
– Affects any analysis in which V2 and V4
are both included
– e.g, mean for V2 = 8.4 if cases with
missing V4 data are excluded, =7.8 if
included
60
Dealing with Missing Data
Determine the type of missing data
Determine the extent of missing data
Diagnose the randomness of the
missing data
Select the imputation method
61
Ignorable Missing Data
Expected and part of the research
design
The missing data process is
operating at random
Specific remedies are not needed
62
Examples of
Ignorable Missing Data
Taking a sample of the population
rather than gathering data from the
entire population
Missing data due to the design of the
data collection instrument
– e.g., respondents skip sections of
questions that are not applicable
63
Missing Data Process
Known to the Researchers
Can be identified due to procedural
factors
– Data entry errors
– Disclosure restrictions
– Failure to complete the entire
questionnaire
– Morbidity of the respondents
Little control over the process
Some remedies may be applicable
64
Missing Data Process
Unknown to the Researchers
Most often are related directly to the
respondent
Examples
– Refusal to respond to certain questions
– Respondents have no opinion or
insufficient knowledge to answer the
question
Should anticipated and minimized in
the research design and data
collection stages
Some remedies may be applicable
65
Assessing the Extent and Patterns
of Missing Data
Tabulate
– Percentage of variables with missing data for
each case
– Number of cases with missing data for each
variable
Look for non-random patterns in the data
Determine the number of cases without
missing data on any variables
– sample size available for analysis if remedies
are not applied)
66
Rules of Thumb to Ignore
Missing Data
Missing data under 10% for an
individual case can generally be
ignored, except when the missing
data occurs in a specific non-random
fashion
The number of cases with no missing
data must be sufficient for the
selected analysis technique if
replacement values will not be
substituted (imputed) to the missing
data
67
Rules of Thumb for Deletions
Variables with as little as 15%
missing data are candidates for
deletion
– Higher levels of missing data
(20%~30%) can often be remedied
Be sure the overall decrease in
missing data is large enough to
justify deleting an individual variable
or case
68
Rules of Thumb for Deletions
Cases with missing data for
dependent variables typically are
deleted
When deleting a variable, ensure
that alternative variables, hopefully
highly correlated, are available to
represent the intent of the original
variable
69
Levels of Randomness
Missing at Random (MAR)
– e.g., missing data of gender are random
for both male and female, but those of
household income occur at a higher
frequency for males than females
Missing Completely at Random
(MCAR)
– e.g., missing data for household income
were randomly missing in equal
proportions for both male and female
70
Modeling approach for
MAR Process
Involves maximum likelihood
estimation techniques
– e.g., EM approach
71
Imputation Using Only Valid Data
Complete Case Approach
– Include only those observations with
complete data
Using All-Available Data
– Use only valid data
– Imputes the distribution characteristics
(e.g., means or standard deviation) or
relationship (e.g., correlation) from
every valid value
72
Imputation Using Known
Replacement Values
Hot deck imputation
– Use the value from another observation
in the sample that is deemed similar
Cold deck imputation
– Derive the replacement value from an
external source (e.g., prior studies,
other samples, etc.)
Case substitution
– Choose another nonsampled
observation
73
Imputation by Calculating
Replacement Values
Mean Substitution
– Use the mean value of that variable
calculated from all valid responses
Regression imputation
– Predict the missing values of a variable
based on its relationship to other
variables in the data set
74
Summary of Imputation Using
Only Valid Data
75
Summary of Imputation Using
Known Replacement Values
76
Summary of Imputation by
Calculating Replacement Values
77
Summary of Model-Based Methods
78
Rule of Thumbs for Imputation of
Missing Data
Under 10%
– Any imputation method
10% to 20%
– All-available, hot deck case substitution,
regression for MCAR
– Model-based for MAR
Over 20%
– Regression for MCAR
– Model-based for MAR
79
Summary Statistics of Missing Data
for Original Sample
80
Comparison of Four
Imputation Methods
81
Comparison of Four
Imputation Methods
82

Introduction to Computer Science

Transcript Introduction to Computer Science

Directory