Mathematical Programming in Data Mining

Download Report

Transcript Mathematical Programming in Data Mining

Mathematical
Programming in Data
Mining
Author: O. L. Mangasarian
Advisor: Dr. Hsu
Graduate: Yan-Cheng Lin
Abstract
 Describe
mathematical programming
to feature selection, clustering and
robust representation
Outline
 Motivation
 Objective
 Problems
 Feature
Selection
 Clustering
 Robust Representation
 Conclusion
Motivation
 Mathematical
programming has been
applied to a great variety of
theoretical
 Problems can be formulated and
effectively solved as mathematical
programs
Objective
 Describe
three mathematicalprogramming-based developments
relevant to data mining
Problems
 Feature
Selection
 Clustering
 Robust Representation
Problem - Feature Selection
 Discriminating
between two finite
point sets in n-dimensional feature
space and utilizes as few of the
feature as possible
 Formulated as mathematical
program with a parametric objective
function and linear constraints
Problem - Clustering
 Assigning
m points in the ndimensional real space Rn to k
clusters
 Formulated as determining k centers
in Rn, the sum of distances of each
point to the nearest center is
minimized
Problem - Robust Representation
 Modeling
a system of relations in a
manner that preserves the validity of
the representation when the data on
which the model is based changes
 Use a sufficiently small error ‫ זּ‬is
purposely tolerated
Feature Selection
 Use
the simplest model to describe
the essence of a phenomenon
 Binary classification problem:
– discriminating between two given point
sets A and B in the n-dimensional real
space Rn by using as few of the ndimensions of the space as possible
Binary classification
W
P
Feature Selection

the following are some defined:

A

B
Successive Linearization Algorithm

w vector is result
Experimentation
32-feature
Wisconsin
Prognostic Breast
Cancer(WPBC)
 N=32, m = 28, k =
118, r = 0.05, 4
features,
increasing tenfold
cross-validation
correctness by
35.4%

Clustering
 Determining
k cluster centers, the
sum of the 1-norm distances of each
point in a given database to nearest
cluster center is minimized
 Minimizing product of two linear
functions on a set defined by linear
inequalities
K-Median Algorithm

Need to solve
Experimentation
 used
as a KDD tool to mine WPBC to
discover medical knowledge
 key observation is curves are well
separated
Experimentation
Robust Representation
 model
remains valid under a class of
data perturbation
 Use ‫זּ‬-tolerance zone wherein errors
are disregarded
 Better generalization results than
conventional zero-tolerance
Robust Representation
A
is a m*n matrix, a is a m*1 vector
 x is a vector be “learned”
 find minimize of Ax - a
Robust Representation
=
A
x
a
=
A
x
‫זּ‬-tolerate
a
Conclusion
 Mathematical
programming codes
are reliable and robust codes
 Problems solved demonstrate
mathematical programming as
versatile and effective tool for solving
important problems in data mining
and knowledge discovery in
databases
Opinion
 Mathematical
describe can explain
about complex problems and
convince others, but …you must be
understand it first