Explaining Data in High-Dimensional Space

Download Report

Transcript Explaining Data in High-Dimensional Space

Explaining High-Dimensional
Data
Hoa Nguyen, Rutgers University
Mentors: Ofer Melnik
Kobbi Nissim
High-Dimensional Data
A great deal of data from different
domains (medicine, finance, science) is
high-dimensional.
High-dimensional data is hard to
visualize and understand.
Example:
Given a set of images (represented by 8x8 pixel matrices), we
could consider each image as a point in 64-dimensional space.
Analyzing Data
Instead of visualizing, we find
properties of data to describe it.
Statistics, Data Mining, Machine
Learning:


Finding properties of data directly
Finding models to capture data
Classifier
Given a set of data points with assigned
labels.
Build a model for the data.
Use this model to label new unclassified
data points.
The Geometrical View of Classifiers:
Example:
Given a set of data points in 2 dimensional-space with a + or – label for each point:
(x i ,y i)±
_
_
_
+ +
_
_
+
_
+
+
+ +
_
_
We are interested in classifiers that take all the points of
a class, and enclose them in a convex shape.
Goal
To understand the properties of
the data enclosed by a convex
shape.
Convex Shapes
The convex hull:
The convex hull C of a set of points is the smallest convex
set that includes all the points.
Problem:
It is difficult to study the convex hull directly.
Our solution:
Instead of looking at the convex hull, we use the simpler convex
shape to approximate the hull – the ellipsoid.
MVE (Minimum Volume Ellipsoid)
Example:
+
+
+ + +
+
+
+
+
+
Advantage:
Use an ellipsoid to approximate the convex
region, or to bound the geometry of the
convex hull.
Using MVE to approximate the convex hull
[John 1948] has shown that if we shrink the minimum
volume outer ellipsoid of a convex set C by a factor k
about its center, we obtain an ellipsoid contained in C.
(k is the dimension of the space)
Example:
+
+
+
+ +
+
+
+
+
+
Calculate the MVE
The MVE is described by the equation:
v’ -1 v = k
V={v1,v2,…,vh}  Rk.
v is an exterior point
: the scatter matrix
h
 = wiviviT
i=1
the eigen vectors of  correspond to the directions of the ellipsoid axes.
the eigen values of correspond to the half-lengths or radii of the axes.
k: a constant equal to the dimension of the space
wi: the weight of a point vi.
Calculate the MVE (cont.)
[Titterington 1978]
An algorithm to calculate the weights of MVE:
1. A point has a positive weight if it lies
on the surface of the ellipsoid.
2. At least k+1 points have non-zero
weights.
3. At most k(k+3)/2+1 points have nonzero weights.
Use MVE for data analysis:
1. Finding extreme points by looking at
points on the ellipsoid surface.
2. Finding the subspace of data by
looking at the directions where the
ellipsoid is thin.
Points on the surface of the ellipsoid
i.e., points with non-zero weights.
Example: In our hand-written digit file, there are 376 points which belong to
class “0”. By using MVE, we find that there are about 178 points with nonzero weights, i.e., these points lie on the surface of the MVE.
The mean-zero
Some “0” points on the surface of the MVE
Directions where the ellipsoid is thin
The directions and size of an ellipsoid’s axes
correspond to the eigen vectors and values of
its scatter matrix.
Direction of thinness: A short axis defines a
direction in which the data does not extend.
If V is a zero-valued eigen vector, then it
defines a constraint for any data point x:
Vx=0
A simple Null Space
Any basis for the Null Space of the
scatter matrix is an equivalent set of
constraints.
In order to understand the data, we
would like to find constraints that are
easy to interpret.
Goal: simplify the null space basis,
e.g.,find a basis with many zeros.
The Null Space Problem (NSP)
The Null Space Problem is defined as finding the
basis with the maximal number of zeros.
It is an NP-hard problem. [Pothen, Coleman 1986]
An approach:
Find a heuristic algorithm to simplify an existing
basis of the null space of Class “0”: e.g., using
Gaussian elimination to get a null space basis with
more 0 components.
The null space basis = the set of eigenvectors with 0-eigenvalues
The null space basis after using Gaussian elimination
Summary
Data analysis
Classifier with convex shape
MVE
Points on the surface of the MVE
Simple basis for the null space