Transcript pptx
Multivariate data and
multivariate analysis
Multivariate data
Multivariate data - values recorded for several random
variables on a number of units.
Unit
1
Variable 1
…
Variable q
…
x11
.
.
.
.
.
.
n
xn1
x1q
.
.
.
…
.
.
.
xnq
n – number of units
q – number of variables recorded on each unit
xij - value of the jth variable for the ith unit
Multivariate analysis
Multivariate statistical analysis – simultaneous statistical
analysis of a collection of variables (e.g.handle apples,
oranges, and pears at the same time).
Simultaneous statistical analysis of a collection of variables
improves upon separate univariate analyses of each variable
by using information about the relationships between the
variables.
The general aim of most multivariate analysis is to uncover,
display, or extract any “signal” in the data in the presence of
noise and to discover what the data have to tell us.
Multivariate analysis
- Exploratory methods allow the detection of possible
unanticipated patterns in the data, opening up a wide
range of competing explanations. These methods are
characterized by an emphasis on the importance of
graphical display and visualization of the data and the
lack of any associated probabilistic model that would
allow for formal inferences (discussed in this course).
- Statistical inferences methods are used when
individuals from a multivariate data set have been
sampled from some population and the investigator
wishes to test a well defined hypothesis about the
parameters of that population’s probability density
function (multivariate normal).
History of the development
of multivariate analysis
- late 19th century Francis Galton and Karl Pearson
quantified the relationship between offspring and parental
characteristics and developed the correlation coefficient.
- early 20th century Charles Spearman introduced the
factor analysis while investigating correlated intelligence
quotient (IQ) tests. Over the next two decades,
Spearman’s work was extended by Hotelling and by
Thurstone.
- in 1920s Fisher’s introduction of the analysis of
variance was followed by its multivariate generalization,
multivariate analysis of variance, based on work by
Bartlett and Roy.
History of the development
of multivariate analysis
- At the beginning, the computational aids to take the burden
of the vast amounts of arithmetic involved in the application
of the multivariate methods were very limited.
- In the early years of the 21st century, the wide availability of
cheap and extremely powerful personal computers and
laptops, and flexible statistical software has meant that all
the methods of multivariate analysis can be applied
routinely even to very large data sets.
- Application of multivariate techniques to large data sets
was named “data mining” (the nontrivial extraction of implicit,
previously unknown and potentially useful information from
data).
multivariate data
ExampleExample
of ofmultivariate
data
individual
sex
age
IQ
depression
health
weight
1
Male
21
120
Yes
Very good
150
2
Male
43
NA
No
Very good
160
3
Male
22
135
No
Average
135
4
Male
86
150
No
Very poor
140
5
Male
60
92
Yes
Good
110
6
Female
16
130
Yes
Good
110
7
Female
NA
150
Yes
Very good
120
8
Female
43
NA
Yes
Average
120
9
Female
22
84
No
Average
105
10
Female
80
70
No
Good
100
NA – Not Available
Number of units n=10
Number of variables q=7
Example
of multivariate data
Types of
measurements
Nominal - Unordered categorical variables (e.g. treatment
allocation, the sex of the responder, hair color).
Ordinal - Where there is an ordering but no implication of
equal distance between the different points of the scale
(e.g. educational level: no schooling, primary, secondary, or
tertiary education).
Interval - Where there are equal differences between
successive points on the scale but the position of zero is
arbitrary (e.g. measurement of temperature using the
Celsius or Fahrenheit scales).
Ratio – The highest level of measurement, relative
magnitudes of scores as well as the differences between
them. The position of zero is fixed (e.g. absolute measure
of temperature (Kelvin), age, weight, and length).
Example of multivariate
data
Missing
values
Observations and measurements that should have been recorded but for
one reason or another, were not (e.g. non-response in sample surveys,
dropouts in longitudinal data).
How to deal with missing values?
1. complete - case analysis by omitting any case with a missing value
on any of the variables (not recommended because might lead to
misleading conclusion and inferences).
2. available-case analysis – exploit incomplete information by using all
the cases available to estimate the quantities of interest (difficulties arise
when missing data is not missing completely at random).
3. multiple imputation – a Monte Carlo technique in which the missing
values are replaced by m>1 simulated versions (3< m <10) (the most
appropriate way).
Covariance
Example of multivariate data
Covariance of two random variables is a measure of their
linear dependence.
Cov( Xi, Xj ) E ( Xi i )( Xj j )
i E ( Xi )
j E ( Xj )
E - expectation
If the two variables are independent of each other, their
covariance is equal to zero.
Larger values of the covariance show greater degree of
linear dependence between two variables.
Example of multivariate data
Covariance
If i j the covariance of the variable with itself is simply
its variance.
The variance of variable
Xi is:
E((Xi i) )
2
2
In a multivariate data with q observed variables, there are
q variances and q(q-1)/2 covariances.
Covariance depends on the scales on which the two variables
are measured.
Example of multivariate
d
Covariance
matrix
12 12
21 22
q1 q 2
1q
2q
q2
ij covariance of Xi and Xj
i2 - varianceof variableXi
ij ji
Example of multivariate
Measure
datad
Measurements of chest, waist, and hips on a sample 20 men and women .
R data “Measure”
chest
34
37
38
36
38
43
40
38
40
41
waist
30
32
30
33
29
32
33
30
30
32
hips
32
37
36
39
33
38
42
40
37
39
gender
male
male
male
male
male
male
male
male
male
male
chest
36
36
34
33
36
37
34
36
38
35
waist
24
25
24
22
26
26
25
26
28
23
hips
35
37
37
34
37
37
38
37
40
35
gender
female
female
female
female
female
female
female
female
female
female
Example of multivariate d
Covariance
Calculate the covariance matrix for the numerical
variables : chest, waist, hips in “Measure” data. We
remove the categorical variable “gender” (column 4).
> cov(Measure[,1:3]
chest
waist
chest 6.631579 6.368421
waist 6.368421 12.526316
hips
3.052632 3.684211
hips
3.052632
3.684211
5.894737
Example of multivariate d
Covariance
Covariance matrix for gender female in “Measure” data
>cov(subset(Measure,Measure$gender=="female")[,c(1:3)])
chest waist
hips
chest 2.277778 2.166667 1.500000
waist 2.166667 2.988889 2.633333
hips 1.500000 2.633333 2.900000
Covariance matrix for gender male in “Measure” data
>cov(subset(Measure, Measure$gender=="male")[,c(1:3)])
chest
waist
hips
chest 6.7222222 0.9444444 3.944444
waist 0.9444444 2.1000000 3.077778
hips 3.9444444 3.0777778 9.344444
Example of multivariate d
Correlation
Correlation is independent of the scales of the two variables.
Correlation coefficient ( ij ) is the covariance divided by the
product of standard deviations of the two variables.
ij
ij
ij
i
where
2
i
i - standard deviation
The correlation coefficient lies between -1 and +1and gives a
measure of the linear relationship of the variables Xi and Xj. It is
positive if high values of Xi are associated with high values of Xj
and negative if high values of Xi are associated with low values of
Xj.
Example of multivariate d
Correlation
Correlation matrix for “Measure” data
> cor(Measure[,1:3])
chest
waist
hips
chest 1.0000000 0.6987336 0.4882404
waist 0.6987336 1.0000000 0.4287465
hips 0.4882404 0.4287465 1.0000000
Example of multivariate d
Distances
Distance between the units in the data is often of considerable
importance in some multivariate techniques. The most common
measure used :
Euclidian distance
dij
q
2
(
x
x
)
ik jk
k 1
xik and xjk, k = 1,…, q variable values for units i and j.
When the variables in a multivariate data set are on different
scales we need to do a standardization before calculating the
distances (e.g. divide each variable by its standard deviation).
Example of multivariate d
Distances
The distance matrix for the first 12 observations in “Measure”
data after standardization.
>dist(scale(Measure[,1:3],center = FALSE))
1 2
3 4
5
6 7
8
9 10
11
2 0.17
3 0.15 0.08
4 0.22 0.07 0.14
5 0.11 0.15 0.09 0.22
6 0.29 0.16 0.16 0.19 0.21
7 0.32 0.16 0.20 0.13 0.28 0.14
8 0.23 0.11 0.11 0.12 0.19 0.16 0.13
9 0.21 0.10 0.06 0.16 0.12 0.11 0.17 0.09
10 0.27 0.12 0.13 0.14 0.20 0.06 0.09 0.11 0.09
11 0.23 0.28 0.22 0.33 0.19 0.34 0.38 0.25 0.24 0.32
12 0.22 0.24 0.18 0.28 0.18 0.30 0.32 0.20 0.20 0.28 0.06
of multivariate
d
MultivariateExample
normal
density
function
Multivariate normal density function for two variables x1 and
x2:
f ( x1 , x2 ); ( 1 , 2 ), 1 , 2 , )
2
x 2
1
x
x
x
2 1 / 2
2
2
(21 2 (1 )) exp
1 1 2 1 1 2
2
2
1
2
2
2(1 ) 1
µ1 and µ2 – population means of the two variables
σ1 and σ2 - population variances
ρ - population correlation between two variables X1 and X2
Linear combinations of the variables are themselves normally
distributed.
Example of multivariate
d
Multivariate normal
density
function
Methods to test the multivariate normal distribution:
- normal probability plots for each variable separately
- convert each multivariate observation to a single number before
plotting (i.e. each q-dimensional observation xi could be converted into a
generalized distance di2 , giving a measure of the distance of the
particular observation from the mean vector of the complete sample x̅ ).
di2 = (xi - x̅)T S-1 (xi - x̅)
S- sample covariance matrix
If the observations are from a multivariate normal distribution, then
distances have approximately a chi-squared distribution with q degrees
2
of freedom and are denoted by the symbol q .
Example of multivariate
d
Air pollution
data
Air pollution in 41 cities in the USA.
R data “USairpollution”
Variables:
SO2: SO2 content of air in micrograms per cubic meter
temp: average annual temperature in degrees Fahrenheit
manu: number of manufacturing enterprises employing 20 or more
workers
popul: population size (1970 census) in thousands
wind: average annual wind speed in miles per hour
precip: average annual precipitation in inches
predays: average number of days with precipitation per year
Multivariate normal density function
Read the “USairpollution” data with the first column as row names:
>USairpollution=read.csv("E:/Multivariate_analysis/Data/USairpollution.csv"
,header=T,row.names=1)
Normal probability plots for “manu” and “popul” variables in “USairpollution”
data.
> qqnorm(USairpollution$manu,main=“manu”)
> qqline(USairpollution$manu)
> qqnorm(USairpollution$popul,main=“popul”)
> qqline(USairpollution$popul)
Multivariate normal density function
popul
2500
0
1000
Sample Quantiles
2500
1000
0
Sample Quantiles
manu
-2 -1
0
1
2
Theoretical Quantiles
-2 -1
0
1
2
Theoretical Quantiles
Multivariate normal density function
Normal probability plots for each variable separately in “USairpollution”
data.
layout(matrix(1:8,nc=2))
sapply(colnames(USairpollution), function(x){
qqnorm(USairpollution[[x]], main=x)
qqline(USairpollution[[x]])
})
0
1
2
11
wind
6 8
Sample Quantiles
20 60
-1
-2
-1
0
1
2
precip
60
45
-2
-1
0
1
2
10 30 50
temp
Sample Quantiles
Theoretical Quantiles
75
Theoretical Quantiles
-2
-1
0
1
2
manu
predays
-2
-1
0
1
2
0 1500 3500
popul
-2
-1
0
1
2
Theoretical Quantiles
40 100
Theoretical Quantiles
Sample Quantiles
Theoretical Quantiles
0 1500
Sample Quantiles
Sample Quantiles
Sample Quantiles
-2
Theoretical Quantiles
Sample Quantiles
The plots for SO2 concentration
and precipitation both deviate
considerably from linearity, and
the plots for manufacturing and
population show evidence of a
number of outliers.
SO2
-2
-1
0
1
2
Theoretical Quantiles
Multivariate normal density function
Chi-square plot:
>x=USairpollution
>cm=colMeans(x)
>S=cov(x)
>d=apply(x,1,function(x)t(x-cm)%*%solve(S)%*%(x-cm))
>plot(qc <- qchisq((1:nrow(x)-1/2)/nrow(x),df=6),
sd<-sort(d),
xlab=expression(paste(chi[6]^2,"Quantile")),
ylab="Ordered distances",xlim=range(qc)*c(1,1.1))
>oups=which(rank(abs(qc-sd),ties="random")>nrow(x)-3)
>text(qc[oups],sd[oups]-1.5,names(oups))
>abline(a=0,b=1)
Multivariate normal density function
Plotting the ordered distances against the corresponding quantiles
of the appropriate chi-square distribution should lead to a straight
line through the origins. Chi-square plot is also useful for detecting outliers
(i.g. Chicago, Phoenix, Providence).
25
Chicago
15
Phoenix
Providence
5
Ordered distances
Chi-square plot
5
10
2
Quantile
6
15