Transcript Document
(Semi-)Supervised Probabilistic
Principal Component Analysis
Shipeng Yu
University of Munich, Germany
Siemens Corporate Technology
http://www.dbs.ifi.lmu.de/~spyu
Joint work with Kai Yu, Volker Tresp,
Hans-Peter Kriegel, Mingrui Wu
Dimensionality Reduction
We are dealing with high-dimensional data
Motivations
Texts: e.g. bag-of-words features
Images: color histogram, correlagram, etc.
Web pages: texts, linkages, structures, etc.
Noisy features: how to remove or down-weight them?
Learnability: “curse of dimensionality”
Inefficiency: high computational cost
Visualization
A pre-processing for many data mining tasks
2
Unsupervised versus Supervised
Unsupervised Dimensionality Reduction
Only the input data are given
PCA (principal component analysis)
Supervised Dimensionality Reduction
Should be biased by the outputs
Classification: FDA (Fisher discriminant analysis)
Regression: PLS (partial least squares)
RVs: CCA (canonical correlation analysis)
More general solutions?
Semi-Supervised?
3
Our Settings and Notations
N data points, M input features, L output labels
X 7! V
We aim to derive a mapping ª :
such that
V ½ RK , K < M
x ¢¢¢ x
y ¢¢¢ y
1
x1
x2
¢¢¢
xM
x1
x2
..
.
xN
x1
¢¢¢
xM
y1
¢¢¢
xN1
xN1+1
..
.
yN1
Y
X
yL
y1
y2
..
.
xN
yN
Y
Supervised
L
y1
..
.
x1
x2
..
.
X
1
x1
..
.
xN
X
Unsupervised
M
Unlabeled
Data!
Semi-supervised
4
Outline
Principal Component Analysis
Probabilistic PCA
Supervised Probabilistic PCA
Related Work
Conclusion
5
PCA: Motivation
Find the K orthogonal projection directions which
have the most data variances
6
Applications
Visualization
De-noising
Latent semantic indexing
Eigenfaces
…
1st PC
4
2
0
-2
2nd PC
-4
-6
-6
-4
-2
0
2
4
6
6
PCA: Algorithm
Basic Algorithm
Centralize dataÃ
xi
¹x =
1
N
N
i=1
xi
P matrix
Compute the sample covariance
S=
xi ¡ ¹x ;
P
1 X> X
N
=
>
N
x
x
i=1 i i
1
N
Do eigen-decomposition (sort eigenvalues decreasingly)
>
S = U¤U
PCA directions are given in UK, the first K columns of U
¤
The PCA projection of a test data x is
> ¤
¤
v = UK x
7
Supervised PCA?
PCA is unsupervised
When output information is available:
Classification labels: 0/1
Regression responses: real values
Ranking orders: rank labels / preferences
Multi-outputs: output dimension > 1
Structured outputs, …
Can PCA be biased by outputs?
And how?
8
Outline
Principal Component Analysis
Probabilistic PCA
Supervised Probabilistic PCA
Related Work
Conclusion
9
Latent Variable Model for PCA
Another interpretation of PCA [Pearson 1901]
PCA is minimizing the reconstruction error of X :
min kX ¡ VAk2F
A;V
>
s.t. V V = I
V 2 RN K are latent variables: PCA projections of X
£
A 2 RK M are factor loadings: PCA mappings
Equivalent to PCA up to a scaling factor
£
Lead to idea of PPCA
10
Probabilistic PCA [TipBis99]
N
x
Wx
v
Latent variable model
x = Wx v + ¹x + ²x
Latent variables
v » N (0; I)
Mean
vector
j »N
σx
Noise process
²x » N (0; ¾x2 I)
2
(Wx v + ¹x ; ¾x I)
Conditional independence: x v
2 !0
¾
If x
, PPCA leads to PCA solution (up to a rotation and
R
scaling factor)
x is Gaussian distributed: P (x) = P (xjv)P (v) dv
11
From Unsupervised to Supervised
Key insights of PPCA
2 RN
Y
When we have outputs
All the M input dimensions are conditionally independent
given the K latent variables
In PCA we are seeking the K latent variables that best
explain the data covariance
£L
, we believe:
There are inter-covariances between X and Y
There are intra-covariances within Y if L > 1
Idea: Let the latent variables explain all of them!
12
Outline
Principal Component Analysis
Probabilistic PCA
Supervised Probabilistic PCA
Related Work
Conclusion
13
Supervised Probabilistic PCA
N
Wx
Supervised latent variable model
x = Wx v + ¹x + ²x
y = Wy v + ¹y + ²y
x
y
σx
σy
Wy
Noise process
²y » N (0; ¾y2 I)
All input and output dimensions are conditionally independent
xjv » N (Wx v + ¹x ; ¾x2 I);
v
yjv » N (Wy v + ¹y ; ¾y2 I)
(x; y) are jointly Gaussian
R distributed:
P (x; y) = P (xjv)P (yjv)P (v) dv
14
Semi-Supervised Probabilistic PCA
¢¢¢
v1
x1
¢¢¢
xM
vK
y1
N1
¢¢¢
yL
x1
..
.
y1
..
.
xN1
xN1+1
..
.
yN1
Wx
v
x
y
σx
σy
Wy
Y
x
xN
X
Idea: A SPPCA model with missing data!
Y
Y
Likelihood:
N1
N
P (X; Y) =
P (xn ; yn )
n=1
N-N1
v
P (xn0 )
n0 =N1 +1
15
S2PPCA: EM Learning
2 ; ¾2
W
;
W
;
¹
;
¹
;
¾
x
y
x
y
x
y
Model parameters:
EM Learning
E-step: estimate v for each data point (a projection
problem)
P (vn jxn ; yn ) / P (xn jvn )P (yn jvn )P (vn )
Inference
P (v 0 jx 0 ) / P (x 0 jv 0 )P (v 0 )
n
n
n
n
n
Problem
M-step: maximize data log-likelihood w.r.t. parameters
An extension of EM learning for PPCA model
Can be kernelized!
By product: an EM learning algorithm for kernel PCA
16
S2PPCA: Toy Problem - Linear
Y
X
Y
X
17
S2PPCA: Toy Problem - Nonlinear
18
S2PPCA: Toy Problem - Nonlinear
19
S2PPCA: Properties
Semi-supervised projection
Applicable to large data sets
Take PCA and kernel PCA as special cases
Primal: O(t(M+L)NK) time, O((M+L)N) space
Dual: O(tN2K) time, O(N2) space
cheaper than
A latent variable solution [Yu et al, SIGIR’05]
min
A;B;W
Primal if M>N
(1 ¡ ¯)kX ¡ VAk2F + ¯ kY ¡ VBk2F
>
s.t. V V = I; V = XW
Cannot deal with semi-supervised setting!
Closed form solutions for SPPCA
No closed form solutions for S2PPCA
20
SPPCA: Primal Solution
Theorem 1 (Primal) Let S denote the normalized sample covariance matrix
for centered observations f(xn ; yn )gN
,
n=1
Ã
!
µ
¶µ
¶
X
>
1 S
1 S
1 N ¡ 1 xn
xn
x
¡1
2
¾x
¾x ¾y xy
S=
© 2
© 2 =
;
1 S
1 S
N
yn
yn
yx
y
¾x ¾ y
¾2
n=1
y
(M+L)×(M+L)
and ¸1 ¸ : : : ¸ ¸(M +L) be its eigenvalues with eigenvectors u1 ; : : : ; u(M +L) ,
then if the latent space is K-dimensional, the following results hold:
(i) In SPPCA model Wx and Wy are calculated as
1
Wx = ¾x Ux (¤K ¡ I) 2 R;
1
Wy = ¾y Uy (¤K ¡ I) 2 R;
where ¤K = (¸1 ; : : : ; ¸K ), Ux (Uy ) contains the ¯rst M (last L) rows of
[u1 ; : : : ; uK ], and R is an arbitrary K £ K orthogonal rotation matrix.
(ii) Projection v¤ for centered new input x¤ is given as
h
1 >
¡1
>
¡
v =
R (¤K ¡ I) 2 Ux Ux + (¤K ¡ I) 1
¾x
¤
i
¡1
> ¤
Ux x :
21
SPPCA: Dual Solution
New kernel matrix!
b
>
Theorem 1 (Dual) Let K = ¾12 K + ¾12 YY , and ¸1 ¸ : : : ¸ ¸N be its
x
y
eigenvalues with eigenvectors v1 ; : : : ; vN , then if the latent space in SPPCA is
K-dimensional,
(i) The projections of training data, which are encoded in rows of matrix Z,
are calculated as
p
¡ 1
Z = NVK (I ¡ N¤K1 ) 2 R
where ¤K = (¸1 ; : : : ; ¸K ), VK = [v1 ; : : : ; vK ], and R is an arbitrary
K £ K orthogonal matrix.
¤
¤
(ii) Projections v for new input x is given as
p
> ¡1
>
>
¤
¡
¤
v = NR D 2 (VK KVK + D) 1 VK k(X; x );
where D = I ¡ N¤K1 .
¡
22
Experiments
Train Nearest Neighbor classifier after projection
Evaluation metrics:
Multi-class classification: error rate
Multi-label classification: F1-measure, AUC
23
Multi-class Classification
S2PPCA is almost always better than SPPCA
LDA is very good for FACE data
S2PPCA is very good on TEXT data
S2PPCA has good scalability
24
Multi-label Classification
S2PPCA is always better than SPPCA
S2PPCA is better in most cases
S2PPCA has good scalability
25
Extensions
Put priors on factor loading matrices
Relax Gaussian noise model for outputs y = f(v)
Better way to incorporate supervised information
We need to do more approximations (using, e.g. EP)
Directly predict missing outputs (i.e. single step)
Mixture modeling in latent variable space
Learn MAP estimates for them
Achieve local PCA instead of global PCA
Robust supervised PCA mapping
Replace Gaussian with Student-t
Outlier detection in PCA
26
Related Work
Fisher discriminant analysis (FDA)
Canonical correlation analysis (CCA)
Goal: Find directions to maximize between-class distance while
minimizing within-class distance
Only deal with outputs of multi-class classification
Limited number of projection dimensions
Goal: Maximize the correlation between inputs and outputs
Ignore intra-covariance of both inputs and outputs
Partial least squares (PLS)
Goal: Sequentially find orthogonal directions to maximize
covariance with respect to outputs
A penalized CCA; poor generalization on new output
dimensions
27
Conclusion
A general framework for (semi-)supervised
dimensionality reduction
We can solve the problem analytically (EIG) or
via iterative algorithms (EM)
Trade-off
EIG: optimization-based, easily extended to other loss
EM: semi-supervised projection, good scalability
Both algorithms can be kernelized
PCA and kernel PCA are special cases
More applications expected
28
Thank you!
Questions?
http://www.dbs.ifi.lmu.de/~spyu