Transcript Document

(Semi-)Supervised Probabilistic
Principal Component Analysis
Shipeng Yu
University of Munich, Germany
Siemens Corporate Technology
http://www.dbs.ifi.lmu.de/~spyu
Joint work with Kai Yu, Volker Tresp,
Hans-Peter Kriegel, Mingrui Wu
Dimensionality Reduction

We are dealing with high-dimensional data




Motivations





Texts: e.g. bag-of-words features
Images: color histogram, correlagram, etc.
Web pages: texts, linkages, structures, etc.
Noisy features: how to remove or down-weight them?
Learnability: “curse of dimensionality”
Inefficiency: high computational cost
Visualization
A pre-processing for many data mining tasks
2
Unsupervised versus Supervised

Unsupervised Dimensionality Reduction



Only the input data are given
PCA (principal component analysis)
Supervised Dimensionality Reduction

Should be biased by the outputs



Classification: FDA (Fisher discriminant analysis)
Regression: PLS (partial least squares)
RVs: CCA (canonical correlation analysis)
More general solutions?
 Semi-Supervised?

3
Our Settings and Notations
N data points, M input features, L output labels
X 7! V
 We aim to derive a mapping ª :
such that
V ½ RK , K < M
x ¢¢¢ x
y ¢¢¢ y

1
x1
x2
¢¢¢
xM
x1
x2
..
.
xN
x1
¢¢¢
xM
y1
¢¢¢
xN1
xN1+1
..
.
yN1
Y
X
yL
y1
y2
..
.
xN
yN
Y
Supervised
L
y1
..
.
x1
x2
..
.
X
1
x1
..
.
xN
X
Unsupervised
M
Unlabeled
Data!
Semi-supervised
4
Outline
Principal Component Analysis
 Probabilistic PCA
 Supervised Probabilistic PCA
 Related Work
 Conclusion

5
PCA: Motivation

Find the K orthogonal projection directions which
have the most data variances
6

Applications





Visualization
De-noising
Latent semantic indexing
Eigenfaces
…
1st PC
4
2
0
-2
2nd PC
-4
-6
-6
-4
-2
0
2
4
6
6
PCA: Algorithm

Basic Algorithm


Centralize dataÃ
xi
¹x =
1
N
N
i=1
xi
P matrix
Compute the sample covariance
S=

xi ¡ ¹x ;
P
1 X> X
N
=
>
N
x
x
i=1 i i
1
N
Do eigen-decomposition (sort eigenvalues decreasingly)
>
S = U¤U


PCA directions are given in UK, the first K columns of U
¤
The PCA projection of a test data x is
> ¤
¤
v = UK x
7
Supervised PCA?
PCA is unsupervised
 When output information is available:






Classification labels: 0/1
Regression responses: real values
Ranking orders: rank labels / preferences
Multi-outputs: output dimension > 1
Structured outputs, …
Can PCA be biased by outputs?
 And how?

8
Outline
Principal Component Analysis
 Probabilistic PCA
 Supervised Probabilistic PCA
 Related Work
 Conclusion

9
Latent Variable Model for PCA

Another interpretation of PCA [Pearson 1901]
 PCA is minimizing the reconstruction error of X :
min kX ¡ VAk2F
A;V
>
s.t. V V = I




V 2 RN K are latent variables: PCA projections of X
£
A 2 RK M are factor loadings: PCA mappings
Equivalent to PCA up to a scaling factor
£
Lead to idea of PPCA
10
Probabilistic PCA [TipBis99]
N
x
Wx

v
Latent variable model
x = Wx v + ¹x + ²x
Latent variables
v » N (0; I)



Mean
vector
j »N
σx
Noise process
²x » N (0; ¾x2 I)
2
(Wx v + ¹x ; ¾x I)
Conditional independence: x v
2 !0
¾
If x
, PPCA leads to PCA solution (up to a rotation and
R
scaling factor)
x is Gaussian distributed: P (x) = P (xjv)P (v) dv
11
From Unsupervised to Supervised

Key insights of PPCA



2 RN
Y
When we have outputs



All the M input dimensions are conditionally independent
given the K latent variables
In PCA we are seeking the K latent variables that best
explain the data covariance
£L
, we believe:
There are inter-covariances between X and Y
There are intra-covariances within Y if L > 1
Idea: Let the latent variables explain all of them!
12
Outline
Principal Component Analysis
 Probabilistic PCA
 Supervised Probabilistic PCA
 Related Work
 Conclusion

13
Supervised Probabilistic PCA
N
Wx

Supervised latent variable model
x = Wx v + ¹x + ²x
y = Wy v + ¹y + ²y

x
y
σx
σy
Wy
Noise process
²y » N (0; ¾y2 I)
All input and output dimensions are conditionally independent
xjv » N (Wx v + ¹x ; ¾x2 I);

v
yjv » N (Wy v + ¹y ; ¾y2 I)
(x; y) are jointly Gaussian
R distributed:
P (x; y) = P (xjv)P (yjv)P (v) dv
14
Semi-Supervised Probabilistic PCA
¢¢¢
v1
x1
¢¢¢
xM
vK
y1
N1
¢¢¢
yL
x1
..
.
y1
..
.
xN1
xN1+1
..
.
yN1
Wx
v
x
y
σx
σy
Wy
Y
x
xN
X


Idea: A SPPCA model with missing data!
Y
Y
Likelihood:
N1
N
P (X; Y) =
P (xn ; yn )
n=1
N-N1
v
P (xn0 )
n0 =N1 +1
15
S2PPCA: EM Learning
2 ; ¾2
W
;
W
;
¹
;
¹
;
¾
x
y
x
y
x
y
 Model parameters:

EM Learning

E-step: estimate v for each data point (a projection
problem)
P (vn jxn ; yn ) / P (xn jvn )P (yn jvn )P (vn )
Inference
P (v 0 jx 0 ) / P (x 0 jv 0 )P (v 0 )
n



n
n
n
n
Problem
M-step: maximize data log-likelihood w.r.t. parameters
An extension of EM learning for PPCA model
Can be kernelized!

By product: an EM learning algorithm for kernel PCA
16
S2PPCA: Toy Problem - Linear
Y
X
Y
X
17
S2PPCA: Toy Problem - Nonlinear
18
S2PPCA: Toy Problem - Nonlinear
19
S2PPCA: Properties

Semi-supervised projection


Applicable to large data sets



Take PCA and kernel PCA as special cases
Primal: O(t(M+L)NK) time, O((M+L)N) space
Dual: O(tN2K) time, O(N2) space
cheaper than
A latent variable solution [Yu et al, SIGIR’05]
min
A;B;W
Primal if M>N
(1 ¡ ¯)kX ¡ VAk2F + ¯ kY ¡ VBk2F
>
s.t. V V = I; V = XW


Cannot deal with semi-supervised setting!
Closed form solutions for SPPCA

No closed form solutions for S2PPCA
20
SPPCA: Primal Solution
Theorem 1 (Primal) Let S denote the normalized sample covariance matrix
for centered observations f(xn ; yn )gN
,
n=1
Ã
!
µ
¶µ
¶
X
>
1 S
1 S
1 N ¡ 1 xn
xn
x
¡1
2
¾x
¾x ¾y xy
S=
© 2
© 2 =
;
1 S
1 S
N
yn
yn
yx
y
¾x ¾ y
¾2
n=1
y
(M+L)×(M+L)
and ¸1 ¸ : : : ¸ ¸(M +L) be its eigenvalues with eigenvectors u1 ; : : : ; u(M +L) ,
then if the latent space is K-dimensional, the following results hold:
(i) In SPPCA model Wx and Wy are calculated as
1
Wx = ¾x Ux (¤K ¡ I) 2 R;
1
Wy = ¾y Uy (¤K ¡ I) 2 R;
where ¤K = (¸1 ; : : : ; ¸K ), Ux (Uy ) contains the ¯rst M (last L) rows of
[u1 ; : : : ; uK ], and R is an arbitrary K £ K orthogonal rotation matrix.
(ii) Projection v¤ for centered new input x¤ is given as
h
1 >
¡1
>
¡
v =
R (¤K ¡ I) 2 Ux Ux + (¤K ¡ I) 1
¾x
¤
i
¡1
> ¤
Ux x :
21
SPPCA: Dual Solution
New kernel matrix!
b
>
Theorem 1 (Dual) Let K = ¾12 K + ¾12 YY , and ¸1 ¸ : : : ¸ ¸N be its
x
y
eigenvalues with eigenvectors v1 ; : : : ; vN , then if the latent space in SPPCA is
K-dimensional,
(i) The projections of training data, which are encoded in rows of matrix Z,
are calculated as
p
¡ 1
Z = NVK (I ¡ N¤K1 ) 2 R
where ¤K = (¸1 ; : : : ; ¸K ), VK = [v1 ; : : : ; vK ], and R is an arbitrary
K £ K orthogonal matrix.
¤
¤
(ii) Projections v for new input x is given as
p
> ¡1
>
>
¤
¡
¤
v = NR D 2 (VK KVK + D) 1 VK k(X; x );
where D = I ¡ N¤K1 .
¡
22
Experiments


Train Nearest Neighbor classifier after projection
Evaluation metrics:


Multi-class classification: error rate
Multi-label classification: F1-measure, AUC
23
Multi-class Classification




S2PPCA is almost always better than SPPCA
LDA is very good for FACE data
S2PPCA is very good on TEXT data
S2PPCA has good scalability
24
Multi-label Classification



S2PPCA is always better than SPPCA
S2PPCA is better in most cases
S2PPCA has good scalability
25
Extensions

Put priors on factor loading matrices


Relax Gaussian noise model for outputs y = f(v)




Better way to incorporate supervised information
We need to do more approximations (using, e.g. EP)
Directly predict missing outputs (i.e. single step)
Mixture modeling in latent variable space


Learn MAP estimates for them
Achieve local PCA instead of global PCA
Robust supervised PCA mapping


Replace Gaussian with Student-t
Outlier detection in PCA
26
Related Work

Fisher discriminant analysis (FDA)




Canonical correlation analysis (CCA)



Goal: Find directions to maximize between-class distance while
minimizing within-class distance
Only deal with outputs of multi-class classification
Limited number of projection dimensions
Goal: Maximize the correlation between inputs and outputs
Ignore intra-covariance of both inputs and outputs
Partial least squares (PLS)


Goal: Sequentially find orthogonal directions to maximize
covariance with respect to outputs
A penalized CCA; poor generalization on new output
dimensions
27
Conclusion



A general framework for (semi-)supervised
dimensionality reduction
We can solve the problem analytically (EIG) or
via iterative algorithms (EM)
Trade-off





EIG: optimization-based, easily extended to other loss
EM: semi-supervised projection, good scalability
Both algorithms can be kernelized
PCA and kernel PCA are special cases
More applications expected
28
Thank you!
Questions?
http://www.dbs.ifi.lmu.de/~spyu