Multivariate Methods
Download
Report
Transcript Multivariate Methods
1
MACHINE LEARNING
6. Multivariate Methods
Motivating Example
2
Loan Application
Observation Vector: Information About Customer
Age
Marital
Status
Yearly Income
Savings
Inputs/Attribute/Features associated with a
customer
The variables are correlated (savings vs. age)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation
3
Suppose we have two random variables X and Y.
We want to estimate the degree of “correlation”
among them
Positive
Correlation: If one happens to be large so the
probability that another one will be large is significant
Negative Correlation: If one happens to be large so
the probability that another one will be small is
significant
Zero correlation: Value of one tells nothing about the
value of other
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation
4
Some reasonable assumptions
The
“correlation” between X and Y is the same as
between X+a and X+b where a,b constant
The “correlation” between X and Y is the same as
between aX and bY
a,b are constant
Example
If
there is a connection between temperature inside the
building and outside the building , it’s does not mater
what scale is used
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation
5
Let’s do a “normalization”
X1
X EX
X
, Y1
Y EY
Y
Both these variables have zero mean and unit
variance
Filtered out the individual differences
Let’s check mean (expected) square differences
between them
2
E ( X 1 Y1 )
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Correlation
6
E ( X 1 Y1 )
2
The result should be
Small
when positively “correlated”
Large when negatively correlated
Medium when “uncorrelated”
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation
7
E ( X1 Y1 ) E ( X Y 2 X 1Y1 )
2
2
1
2
1
EX EY 2EX1Y1 2 2
2
1
EX 1Y1
2
1
E ( X EY )( X EY )
1 2
Cov( X , Y )
1 2
Larger covariance means larger correlation coefficient
means smaller average square differences
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation vs. Dependance
8
Not the same thing
Independent=>Have zero correlation
Have zero correlation=> May not be independent
We look at square differences between two
variables
E ( X 1 Y1 ) 2
Two variables might have “unpredictable” square
differences but still be dependant
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Correlation vs. Independence
9
Random variable X from {-1,0,1} with p=1/3
Random variable Y=X^2
Clearly dependant but
COV(X,Y)=E((X-0)(Y-EY))=EXY-EY*EX=EXY=EX^3=0
Correlation only measures “linear” independence
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Distribution
10
Assume all members of class came from join
distribution
Can learn distributions from data P(x|C)
Assign new instance for most probable class P(C|x)
using Bayes rule
An instance described by a vector of correlated
parameters
Realm of multivariate distributions
Multivariate normal
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Parameters
11
Mean : E x μ 1 ,..., d
T
Covariance : ij CovX i , X j
Correlatio n : Corr X i , X j ij
CovX E X μ X μ
T
ij
i j
12 12 1d
2
2
2d
21
2
d 1 d 2 d
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Data
12
Multiple measurements (sensors)
d inputs/features/attributes: d-variate
N instances/observations/examples
X 11
2
X1
X
N
X 1
X
X
1
2
2
2
X 2N
X
X
N
X d
1
d
2
d
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation
t
x
t 1 i
N
Sample mean m : mi
N
,i 1,..., d
x
N
Covariance matrix S : sij
Correlatio n matrix R : rij
t 1
t
i
mi x tj m j
N
sij
si s j
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT
13 Press (V1.1)
Estimation of Missing Values
14
What to do if certain instances have missing
attributes?
Ignore those instances: not a good idea if the
sample is small
Use ‘missing’ as an attribute: may give information
Imputation: Fill in the missing value
Mean
imputation: Use the most likely value (e.g., mean)
Imputation by regression: Predict based on other
attributes
Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Normal
15
Have d-attributes
Often can assume each one distributed normally
Attributes might be dependant/correlated
Joint distribution of correlated several variables
P(X1=x1,
X2=x2, … Xd=xd)=?
X1 is normally distributed with mean µi and variance 𝜎i
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Multivariate Normal
16
x ~ N d μ, Σ
1
1
T 1
p x
exp x μ Σ x μ
1/ 2
d/2
2
2 Σ
Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
2 variables are correlated
Divided by inverse of covariance (large)
Contribute less to Mahalanobis distance
Contribute more to the probability
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Bivariate Normal
17
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Normal Distribution
18
Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑
(normalizes for difference in variances and
correlations)
Bivariate: d = 2
p x 1 , x 2
1
212
12
12
2
2
12
1
2
2
exp
z1 2z1z2 z2
2
2
1
21
zi x i i / i
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bivariate Normal
19
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Independent Inputs: Naive Bayes
20
If xi are independent, offdiagonals of ∑ are 0,
Mahalanobis distance reduces to weighted (by
1/σi) Euclidean distance:
2
d
1
1 x i i
p x pi x i
exp
d
2 i 1 i
d/2
i 1
2 i
d
i 1
If variances are also equal, reduces to Euclidean
distance
Based on Introduction to Machine Learning © The MIT Press (V1.1)
Projection Distribution
21
Example: vector of 3 features
Multivariate normal distribution
Projection to 2 dimensional space (e.g. XY plane)
Vectors of 2 features
Projection are also multivariate normal distribution
Projection of d-dimensional normal to k-dimensional
space is k-dimensional normal
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
1D projection
22
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Multivariate Classification
23
Assume members of class from a single multivariate
distribution
Multivariate normal is a good choice
Easy
to analyze
Model many natural phenomena
Model a class as having single prototype source (mean)
slightly randomly changed
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Example
24
Matching cars to customers
Each cat defines a class of matching customers
Customers described by (age, income)
There is a correlation between age and income
Assume each class is multivariate normal
Need to learn P(x|C) from data
Use Bayes to compute P(C|x)
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Parametric Classification
If p (x | Ci ) ~ N ( μi , ∑i )
1
1
T
1
p x | Ci
exp x μi Σi x μi
1/ 2
d/2
2
2 Σi
Discriminant functions are
P x|Ci P Ci
gi x log P Ci | x =log
=log p x|Ci log P Ci log P( x)
P( x)
d
1
1
T
log2 log Σi x μ i Σi 1 x μ i log P Ci LogP( x)
2
2
2
Need to know Covariance Matrix and mean to compute discriminant
functions.
Can ignore P(x) as the same for all classes
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT25
Press (V1.1)
Estimation of Parameters
26
P̂ Ci
mi
t
r
t i
N
t t
r
t i x
t
r
t i
r x
t
Si
t i
t
t
mi x mi
t
r
t i
T
1
1
T
1
gi x log Si x mi Si x mi log P̂ Ci
2
2
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Covariance Matrix per Class
Quadratic discriminant
1
1 T 1
1
T
1
gi x log Si x Si x 2x T Si mi mi Si mi log P̂ Ci
2
2
T
x T Wi x w i x w i 0
where
1 1
Wi Si
2
1
w i Si mi
1 T 1
1
w i 0 mi Si mi log Si log P̂ Ci
2
2
Requires estimation of K*d*(d+1)/2 parameters for covariance matrix
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT
27 Press (V1.1)
discriminant:
P (C1|x ) = 0.5
likelihoods
posterior for C1
28
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Common Covariance Matrix S
29
If not enough data can assume all classes have same
common sample covariance matrix S
S P̂ C i Si
i
Discriminant
reduces to a linear discriminant (xTS-1x
is common to all discriminant and can be removed)
1
T
gi x x mi S 1 x mi log P̂ Ci
2
gi x w iT x wi 0
1
where w i S 1m i wi 0 m iT S 1m i log Pˆ Ci
2
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Common Covariance Matrix S
30
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S
31
When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci)(Naive Bayes’ assumption)
1 x mij
gi x
2 j 1 s j
d
t
j
2
log P̂ Ci
Classify based on weighted Euclidean distance (in sj
units) to the nearest mean
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S
32
variances may be
different
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Diagonal S, equal variances
33
Nearest mean classifier: Classify based on Euclidean
distance to the nearest mean
gi x
x mi
2s
2
2
log P̂ C i
1
2 x tj mij
2s j 1
d
2
log P̂ C i
Each mean can be considered a prototype or
template and this is template matching
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Diagonal S, equal variances
34
*?
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Model Selection
35
Different covariance matrix for each class
Have to estimate many parameters
Small bias , large variance
Common covariance matrices, diagonal covariance
etc. reduce number of parameters
Increase bias but control variance
In-between states?
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Regularized Discriminant Analysis(RDA)
36
a=b=0: Quadratic classifier
a=0, b=1:Shared Covariance, linear classifier
a=1,b=0: Diagonal Covariance
Choose best a,b by cross validation
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Model Selection: Example
37
Lecture Notes for E Alpaydın 2004 Introduction to Machine
Learning © The MIT Press (V1.1)
Model Selection
38
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)