Multivariate Methods

Download Report

Transcript Multivariate Methods

1
MACHINE LEARNING
6. Multivariate Methods
Motivating Example
2


Loan Application
Observation Vector: Information About Customer
 Age
 Marital
Status
 Yearly Income
 Savings


Inputs/Attribute/Features associated with a
customer
The variables are correlated (savings vs. age)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation
3


Suppose we have two random variables X and Y.
We want to estimate the degree of “correlation”
among them
 Positive
Correlation: If one happens to be large so the
probability that another one will be large is significant
 Negative Correlation: If one happens to be large so
the probability that another one will be small is
significant
 Zero correlation: Value of one tells nothing about the
value of other
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation
4

Some reasonable assumptions
 The
“correlation” between X and Y is the same as
between X+a and X+b where a,b constant
 The “correlation” between X and Y is the same as
between aX and bY
 a,b are constant

Example
 If
there is a connection between temperature inside the
building and outside the building , it’s does not mater
what scale is used
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation
5

Let’s do a “normalization”
X1 



X  EX
X
, Y1 
Y  EY
Y
Both these variables have zero mean and unit
variance
Filtered out the individual differences
Let’s check mean (expected) square differences
between them
2
E ( X 1  Y1 )
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Correlation
6
E ( X 1  Y1 )

2
The result should be
 Small
when positively “correlated”
 Large when negatively correlated
 Medium when “uncorrelated”
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation
7
E ( X1  Y1 )  E ( X  Y  2 X 1Y1 ) 
2
2
1
2
1
 EX  EY  2EX1Y1  2  2 
2
1
  EX 1Y1 

2
1
E ( X  EY )( X  EY )
 1 2

Cov( X , Y )
 1 2
Larger covariance means larger correlation coefficient
means smaller average square differences
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation vs. Dependance
8




Not the same thing
Independent=>Have zero correlation
Have zero correlation=> May not be independent
We look at square differences between two
variables
E ( X 1  Y1 ) 2

Two variables might have “unpredictable” square
differences but still be dependant
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation vs. Independence
9





Random variable X from {-1,0,1} with p=1/3
Random variable Y=X^2
Clearly dependant but
COV(X,Y)=E((X-0)(Y-EY))=EXY-EY*EX=EXY=EX^3=0
Correlation only measures “linear” independence
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Distribution
10






Assume all members of class came from join
distribution
Can learn distributions from data P(x|C)
Assign new instance for most probable class P(C|x)
using Bayes rule
An instance described by a vector of correlated
parameters
Realm of multivariate distributions
Multivariate normal
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Parameters
11
Mean : E x   μ  1 ,..., d 
T
Covariance : ij  CovX i , X j 
Correlatio n : Corr X i , X j   ij 

  CovX   E X  μ X  μ 
T

ij
i  j
  12  12   1d 


2




2
2d 
  21
 


2 
 d 1  d 2   d 
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Data
12



Multiple measurements (sensors)
d inputs/features/attributes: d-variate
N instances/observations/examples
 X 11
 2
X1

X
 
 N
 X 1
X
X
1
2
2
2
X 2N
X 

X 

N
 X d 


1
d
2
d
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation
t
x
t 1 i
N
Sample mean m : mi 
N
,i  1,..., d

x


N
Covariance matrix S : sij
Correlatio n matrix R : rij 
t 1
t
i

 mi x tj  m j

N
sij
si s j
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT
13 Press (V1.1)
Estimation of Missing Values
14




What to do if certain instances have missing
attributes?
Ignore those instances: not a good idea if the
sample is small
Use ‘missing’ as an attribute: may give information
Imputation: Fill in the missing value
 Mean
imputation: Use the most likely value (e.g., mean)
 Imputation by regression: Predict based on other
attributes
Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Normal
15




Have d-attributes
Often can assume each one distributed normally
Attributes might be dependant/correlated
Joint distribution of correlated several variables
 P(X1=x1,
X2=x2, … Xd=xd)=?
 X1 is normally distributed with mean µi and variance 𝜎i
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Multivariate Normal
16
x ~ N d μ, Σ 
1
 1

T 1
p x  
exp x  μ Σ x  μ
1/ 2
d/2
 2

2 Σ





Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
2 variables are correlated
Divided by inverse of covariance (large)
Contribute less to Mahalanobis distance
Contribute more to the probability
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Bivariate Normal
17
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Normal Distribution
18

Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑
(normalizes for difference in variances and
correlations)

Bivariate: d = 2
p x 1 , x 2  
1
212
 12
12 

2 
2 
12

1
2
2 
exp
z1  2z1z2  z2 
2
2
1 
 21 




zi  x i  i  / i
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bivariate Normal
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Independent Inputs: Naive Bayes
20

If xi are independent, offdiagonals of ∑ are 0,
Mahalanobis distance reduces to weighted (by
1/σi) Euclidean distance:
2
d

1
1  x i  i  
 
p x    pi x i  
exp  
d
2 i 1  i  
d/2

i 1

2  i 
d
i 1

If variances are also equal, reduces to Euclidean
distance
Based on Introduction to Machine Learning © The MIT Press (V1.1)
Projection Distribution
21





Example: vector of 3 features
Multivariate normal distribution
Projection to 2 dimensional space (e.g. XY plane)
Vectors of 2 features
Projection are also multivariate normal distribution
Projection of d-dimensional normal to k-dimensional
space is k-dimensional normal
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1D projection
22
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Classification
23


Assume members of class from a single multivariate
distribution
Multivariate normal is a good choice
 Easy
to analyze
 Model many natural phenomena
 Model a class as having single prototype source (mean)
slightly randomly changed
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example
24







Matching cars to customers
Each cat defines a class of matching customers
Customers described by (age, income)
There is a correlation between age and income
Assume each class is multivariate normal
Need to learn P(x|C) from data
Use Bayes to compute P(C|x)
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Parametric Classification

If p (x | Ci ) ~ N ( μi , ∑i )
1
 1

T
1
p x | Ci  
exp x  μi  Σi x  μi 
1/ 2
d/2
 2

2 Σi

Discriminant functions are
P  x|Ci  P  Ci 
gi  x   log P  Ci | x  =log
=log p  x|Ci   log P  Ci   log P( x)
P( x)
d
1
1
T
  log2  log Σi   x  μ i  Σi 1  x  μ i   log P  Ci   LogP( x)
2
2
2


Need to know Covariance Matrix and mean to compute discriminant
functions.
Can ignore P(x) as the same for all classes
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT25
Press (V1.1)
Estimation of Parameters
26
P̂ Ci  
mi 
t
r
t i
N
t t
r
t i x
t
r
t i
r x


t
Si
t i
t

t
 mi x  mi
t
r
t i

T
1
1
T
1
gi x    log Si  x  mi  Si x  mi   log P̂ Ci 
2
2
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Covariance Matrix per Class

Quadratic discriminant


1
1 T 1
1
T
1
gi x    log Si  x Si x  2x T Si mi  mi Si mi  log P̂ Ci 
2
2
T
 x T Wi x  w i x  w i 0
where
1 1
Wi   Si
2
1
w i  Si mi
1 T 1
1
w i 0   mi Si mi  log Si  log P̂ Ci 
2
2

Requires estimation of K*d*(d+1)/2 parameters for covariance matrix
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT
27 Press (V1.1)
discriminant:
P (C1|x ) = 0.5
likelihoods
posterior for C1
28
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Common Covariance Matrix S
29

If not enough data can assume all classes have same
common sample covariance matrix S
S   P̂ C i Si

i
Discriminant
reduces to a linear discriminant (xTS-1x
is common to all discriminant and can be removed)
1
T
gi x    x  mi  S 1 x  mi   log P̂ Ci 
2
gi  x   w iT x  wi 0
1
where w i  S 1m i wi 0   m iT S 1m i  log Pˆ  Ci 
2
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Common Covariance Matrix S
30
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S
31

When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci)(Naive Bayes’ assumption)
1  x  mij
gi x    
2 j 1  s j
d
t
j
2

  log P̂ Ci 


Classify based on weighted Euclidean distance (in sj
units) to the nearest mean
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S
32
variances may be
different
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S, equal variances
33

Nearest mean classifier: Classify based on Euclidean
distance to the nearest mean
gi x   
x  mi
2s
2

2
 log P̂ C i 
1
  2  x tj  mij
2s j 1
d

2

 log P̂ C i 
Each mean can be considered a prototype or
template and this is template matching
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Diagonal S, equal variances
34
*?
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Model Selection
35






Different covariance matrix for each class
Have to estimate many parameters
Small bias , large variance
Common covariance matrices, diagonal covariance
etc. reduce number of parameters
Increase bias but control variance
In-between states?
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Regularized Discriminant Analysis(RDA)
36




a=b=0: Quadratic classifier
a=0, b=1:Shared Covariance, linear classifier
a=1,b=0: Diagonal Covariance
Choose best a,b by cross validation
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Model Selection: Example
37
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Model Selection
38
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)