Transcript ax home

Neural Computation
0368-4149-01
Prof. Nathan Intrator
Tuesday 16:00-19:00 Schreiber 7
Office hours: Wed 4-5
[email protected]
Outline
• Goals for neural learning - Unsupervised
• Goals for statisical/computational learning
–
–
–
–
PCA
ICA
Exploratory Projection Pursuit
Search for non-Gaussian distributions
• Practical implementations
Statistical Approach to Unsupervised
Learning
•
•
•
•
•
Understanding the nature of data variability
Modeling the data (sometimes very flexible model)
Understanding the nature of the noise
Applying prior knowledge
Extracting features based on:
– Prior knowledge
– Class prediction
– Unsupervised learning
Principal Component Analysis.
Włodzisław Duch
SCE, NTU, Singapore
http://www.ntu.edu.sg/home/aswduch
Neuronal Goal
We look for axes which minimise projection errors and maximise
the variance after projection
n-dimensional
m-dimensional
vectors
vectors
Ex:
m<n
transform from 2 to 1 dimension
Algorithm (cont’d)

Preserve as much of the variance as possible
more information (variance)
rotate
less
information
project
Linear transformations – example
2D vectors X in a unit circle with mean (1,1); Y = A*X, A = 2x2 matrix
 Y1   2 1  X 1 
 Y    1 1  X 
 2 
 2 
The shape is elongated, rotated and the mean is shifted.
Invariant distances
Euclidean distance is not invariant to general linear transformations
Y  AX
1
Y Y
2
2



  X   X   A A  X   X  
1
 Y Y
2
1
2
T
T
Y 1  Y  2 
T
1
2
This is invariant only for orthonormal matrices ATA = I
that make rigid rotations, without stretching or shrinking distances.
Idea: standardize the data in some way to create invariant distances.
Data standardization
For each vector component X(j)T=(X1(j), ... Xd(j)), j=1 .. n
calculate mean and std: n – number of vectors, d – their dimension
1 n ( j)
1 n ( j)
Xi   X i ; X  X
n j 1
n j 1
X
(1)
X
(2)
X
Vector of mean
feature values.
(n)
X1
X 1(1)
X 1(2)
X 1( n )
X2
X 2(1)
X 2(2)
X 2( n )
Xd
X d(1)
X d(2)
X d( n )
Averages over
rows.
Standard deviation
Calculate standard deviation:
1 n ( j)
Xi   X i
n j 1

n
1
( j)
 i2 
X
 Xi

i
n  1 j 1
Vector of mean feature values.

2
Variance = square of standard
deviation (std), sum of all
deviations from the mean value.
Transform X => Z, standardized data vectors
Z
( j)
i
 X
( j)
i
 Xi  i
Std data
Std data: zero mean and unit variance.


1 n ( j) 1 n
Zi   Z i   X i( j )  X i  i  0
n j 1
n j 1

2
Z ,i

1 n
( j)

Z
 Zi

i
n  1 j 1

2

1 n
( j)

X
 Xi

i
n  1 j 1

2
 i2  1
Standardize data after making data transformation.
Effect: data is invariant to scaling only (diagonal transformation).
Distances are invariant, data distribution is the same??
How to make data invariant to any linear transformations?
Terminology (Covariance)
• How two dimensions vary from the mean with respect to each other
n
cov( X , Y ) 
(X
i 1
i
 X )(Yi  Y )
(n  1)

cov(X,Y) > 0: Dimensions increase together
cov(X,Y) < 0: One increases, one decreases

cov(X,Y) = 0: Dimensions are independent

Terminology (Covariance Matrix)
• Contains covariance values between all possible dimensions:
•
C nxn  (cij | cij  cov( Dimi , Dim j ))
Example for three dimensions (x,y,z) (Always symetric):
 cov( x, x) cov( x, y ) cov( x, z ) 


C   cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z, x) cov( z, y ) cov( z, z ) 


cov(x,x)  variance of component x
Properties of the Cov matrix
• Can be used for creating a distance that
is not sensitive to linear transformation
• Can be used to find directions which
maximize the variance
• Determines a Gaussian distribution
uniquely (up to a shift)
Data standardization example
For our example Y=AX, assuming X means=1 and variances = 1
 Y1   2 1  X 1 
 Y    1 1  X 
 2 
 2 
Transformation
Vector of mean
feature values.
1
 3  2 1 1
X  Y 
 1
1
2
1
1

  
 
1
 5
σ X    σ Y     Diag  AA T 
1
 2
1
Y Y
2
2

1
 X X
2

T

A T A X 1  X 2
Variance
check it!

How to make this
invariant?
Covariance matrix
Variance (spread around mean value) + correlation between features.

1 n
(k )
Cij 
X
 Xi

i
n  1 k 1
 X
(k )
j
 X j ; i, j  1 d
CX is d x d
T
1 n
1
(k )
(k )
T
CX 
X

X
X

X

XX




n  1 k 1
n 1
where X is d x n dimensional matrix of vectors shifted to their means.
Covariance matrix is symmetric Cij = Cji and positive definite.
Diagonal elements are variances (square of std), i2 = Cii
Pearson correlation coefficient
rij  Cij  i j  [1, 1]
Spherical distribution of data has Cij=I (unit matrix).
Elongated ellipsoids: large off-diagonal elements, strong correlations
between features.
Mahalanobis distance
Linear combinations of features leads to rotations and scaling of data.
Y  AX; Y  AX; CY  AC X A T
X
Mahalanobis distance:
is invariant to linear transformations:
1
Y Y
2
2
CY


  X   X  
1
 Y Y
2
1
2
1
2
 X X

2
CX
 X TCX1X

T
CY1 Y 1  Y  2 
T
T 1

A  A  CX1A 1  A X 1  X  2 


2
CX
T


Principal components
How to avoid correlated features?
Correlations  covariance matrix is non-diagonal !
Solution: diagonalize it, then use transformation that makes it
diagonal to de-correlate features. Z are the eigen vectors of Cx
Y  Z X; CX Z  i Z ; CX Z  ZΛ
T
(i )
(i )
CY  ZT CX Z  ZT ZΛ  Λ
In matrix form,
X, Y are dxn,
Z, CX, CY are
dxd
C – symmetric, positive definite matrix XTCX > 0 for ||X||>0;
its eigenvectors are orthonormal:
Z(i )T  Z( j )   ij
its eigenvalues are all non-negative
Z – matrix of orthonormal eigenvectors (because Z is real+symmetric),
transforms X into Y, with diagonal CY, i.e. decorrelated.
Matrix form
Eigenproblem for C matrix in matrix form:
 C11 C12
C
 21 C22


 Cd 1 Cd 2
 Z11 Z12
Z
 21 Z 22


 Zd1 Zd 2
C1d  Z11 Z12
C2 d 
 Z 21 Z 22


Cdd  Z d 1 Z d 2
Z1d  1 0
Z 2 d 
 0 2


Z dd  0 0
C X Z  ZΛ
Z 1d 
Z 2 d 



Z dd 
0
0 


d 
Principal components
PCA: old idea, C. Pearson (1901), H. Hotelling 1933
Y  Z X;
T
CY  Z C X Z  Λ
T
Z – principal components, of vectors X
transformed using eigenvectors of CX
Covariance matrix of transformed
vectors is diagonal => ellipsoidal
distribution of data.
Result: PC are linear combinations of all features, providing new
uncorrelated features, with diagonal covariance matrix = eigenvalues.
Small i  small variance  data change little in direction Yi
PCA minimizes C matrix reconstruction errors:
Zi vectors for large i are sufficient to get:
ZΛZ
because vectors for small eigenvalues will have very
small contribution to the covariance matrix.
T
 CX
Two components for visualization
Diagonalization methods: see Numerical Recipes, www.nr.com
New coordinate system:
axis ordered according to variance
= size of the eigenvalue.
First k dimensions account for
k
Vk 

i 1
i
d

i 1
i
fraction of all variance (please note that i are variances);
frequently 80-90% is sufficient for rough description.
Solving for Eigenvalues & Eigenvectors
• Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x,  is called an eigenvalue
of A.
• Ax=x  (A-I)x=0
• How to calculate x and :
– Calculate det(A-I), yields a polynomial (degree n)
– Determine roots to det(A-I)=0, roots are eigenvalues 
– Solve (A- I) x=0 for each  to obtain eigenvectors x
PCA properties
PC Analysis (PCA) may be achieved by:
• transformation making covariance matrix diagonal
• projecting the data on a line for which the sums of squares of
distances from original points to projections is minimal.
• orthogonal transformation to new variables that have stationary
variances
True covariance matrices are usually not known, estimated from data.
This works well on single-cluster data; more complex structure may
require local PCA, separately for each cluster.
PC is useful for: finding new, more informative, uncorrelated features;
reducing dimensionality: reject low variance features,
reconstructing covariance matrices from low-dim data.
PCA Wisconsin example
Wisconsin Breast Cancer data:
• Collected at the University of Wisconsin Hospitals, USA.
• 699 cases, 458 (65.5%) benign (red), 241 malignant (green).
• 9 features: quantized 1, 2 .. 10, cell properties, ex:
Clump Thickness, Uniformity of Cell Size, Shape, Marginal
Adhesion, Single Epithelial Cell Size, Bare Nuclei,
Bland Chromatin, Normal
Nucleoli, Mitoses.
2D scatterograms do not show
any structure no matter which
subspaces are taken!
Example cont.
PC gives useful information
already in 2D.
Taking first PCA component of
the standardized data:
If (Y1>0.41) then benign
else malignant
18 errors/699 cases = 97.4%
Transformed vectors are not
standardized, std’s are below.
Eigenvalues converge
slowly, but classes are
separated well.
PCA disadvantages
Useful for dimensionality reduction but:
•
Largest variance determines which components are used, but
does not guarantee interesting viewpoint for clustering data.
•
The meaning of features is lost when linear combinations are
formed.
Analysis of coefficients in Z1 and other important eigenvectors may
show which original features are given much weight.
PCA may be also done in an efficient way by performing singular
value decomposition of the standardized data matrix.
PCA is also called Karhuen-Loève transformation.
Many variants of PCA are described in A. Webb, Statistical pattern
recognition, J. Wiley 2002.
Exercise (will be part of Ex. 1)
• How would you calculate efficiently the PCA
of data where the dimensionality d is much
larger than the number of vector observations
n?
2 skewed distributions
PCA transformation for 2D data:
First component will be chosen along
the largest variance line, both clusters
will strongly overlap, no interesting
structure will be visible.
In fact projection to orthogonal axis to
the first PCA component has much
more discriminating power.
Discriminant coordinates should be
used to reveal class structure.