Transcript Slide 1

Locally Constraint Support Vector
Clustering
Dragomir Yankov, Eamonn Keogh, Kin Fai Kan
Computer Science & Eng. Dept.
University of California, Riverside
Time Series Data Mining Group
Outline
• On the need of improving the Support Vector Clustering
(SVC) algorithm. Motivation
• Problem formulation
• Locally constrained SVC
– An overview of SVC
– Applying factor analysis for local outlier detection
– Regularizing the decision function of SVC
• Experimental evaluation
Time Series Data Mining Group
Motivation for improving SVC
• SVC transforms the data in a high dimensional feature space,
where a decision function is computed
• The support-vectors define contours in the original space
representing higher density regions
original
data
detected
clusters
• The method is theoretically sound and useful for detecting
non-convex formations
Time Series Data Mining Group
Motivation for improving SVC (cont)
• Parametrizing SVC incorrectly may either disguise some objectively
present clusters, or produce multiple unintuitive clusters
• Correct parametrization is especially hard in the presence of noise
(frequently encountered when learning from embedded manifolds)
large kernel widths
merge the clusters
Time Series Data Mining Group
small kernel widths produce
multiple unintuitive clusters
Problem formulation
How can we make Support Vector Clustering:
1. Less susceptible to noise in the data
2. More resilient to imprecise parametrization
Time Series Data Mining Group
Locally constrained SVC –
one class classification
• Support Vector density
estimation
• Primal formulation
• Dual formulation
Time Series Data Mining Group
Locally constrained SVC –
labeling the closed contours
• Support Vector Clustering – decision function
• Labeling the individual classes
Build an affinity matrix
and find the connected
components
Time Series Data Mining Group
Locally constrained SVC –
detecting local outliers
• Factor analysis:
d
X  RD x


Z

R
z u
z  N (0, I ), u  N (0, )
• Mixture of factor analyzers
x   jzj u
z j  N ( j ,  )
• We can adapt MFA to pinpoint
local outliers
Time Series Data Mining Group
Points like P1and P2 that
deviate a lot from the FA are
among the true outliers
Locally constrained SVC –
regularizing the decision function
• To compute the local deviation of each point we use their Mahalanobis
distances with respect to the corresponding FA
• New primal formulation (weighting the slack variables)
• New dual formulation
Time Series Data Mining Group
Experimental evaluation –
synthetic data
• Gaussian with radial Gaussian distributions
LSVC
Good parameter values
for LSVC are detected
automatically. The right
clusters are detected
SVC
SVC is harder to
parametrize. The detected
clusters are incorrect
Time Series Data Mining Group
Experimental evaluation –
synthetic data
• Swiss roll data with added Gaussian noise
LSVC
Most of the noise is
identified as bounded SVs
by LSVC. The correct
clusters are detected
SVC
SVC tends to merge the two
large clusters. With
supervision the clusters are
eventually identified
Time Series Data Mining Group
Experimental evaluation –
face images
• Frey face dataset
LSVC
LSVC discriminates the
two objectively interesting
manifolds embedding the
data
SVC
Even with supervision we
could not find parameters
that separate the two major
manifolds with SVC
Time Series Data Mining Group
Experimental evaluation –
shape clustering
• Arrowheads dataset
LSVC
SVC
Some of the classes are similar.
There are multiple elements
bridging their shape manifolds
LSVC achieves 73%
accuracy vs 60% for SVC
Time Series Data Mining Group
Conclusion
• The LSVC method combines both a global and a local view of
the data
– It computes a decision function that defines a global measure of
density support
– MFA complements this with a local view based on the individual
analyzers
• The algorithm improves significantly on the stability of SVC in
the presence of noise
• LSVC allows for easier automatic parameterization of oneclass SVMs
Time Series Data Mining Group
All datasets and the code for LSVC can be obtained by writing
to the first author: [email protected]
THANK YOU!
Time Series Data Mining Group