Mathematical Programming in Data Mining
Download
Report
Transcript Mathematical Programming in Data Mining
Mathematical
Programming in Data
Mining
Author: O. L. Mangasarian
Advisor: Dr. Hsu
Graduate: Yan-Cheng Lin
Abstract
Describe
mathematical programming
to feature selection, clustering and
robust representation
Outline
Motivation
Objective
Problems
Feature
Selection
Clustering
Robust Representation
Conclusion
Motivation
Mathematical
programming has been
applied to a great variety of
theoretical
Problems can be formulated and
effectively solved as mathematical
programs
Objective
Describe
three mathematicalprogramming-based developments
relevant to data mining
Problems
Feature
Selection
Clustering
Robust Representation
Problem - Feature Selection
Discriminating
between two finite
point sets in n-dimensional feature
space and utilizes as few of the
feature as possible
Formulated as mathematical
program with a parametric objective
function and linear constraints
Problem - Clustering
Assigning
m points in the ndimensional real space Rn to k
clusters
Formulated as determining k centers
in Rn, the sum of distances of each
point to the nearest center is
minimized
Problem - Robust Representation
Modeling
a system of relations in a
manner that preserves the validity of
the representation when the data on
which the model is based changes
Use a sufficiently small error זּis
purposely tolerated
Feature Selection
Use
the simplest model to describe
the essence of a phenomenon
Binary classification problem:
– discriminating between two given point
sets A and B in the n-dimensional real
space Rn by using as few of the ndimensions of the space as possible
Binary classification
W
P
Feature Selection
the following are some defined:
A
B
Successive Linearization Algorithm
w vector is result
Experimentation
32-feature
Wisconsin
Prognostic Breast
Cancer(WPBC)
N=32, m = 28, k =
118, r = 0.05, 4
features,
increasing tenfold
cross-validation
correctness by
35.4%
Clustering
Determining
k cluster centers, the
sum of the 1-norm distances of each
point in a given database to nearest
cluster center is minimized
Minimizing product of two linear
functions on a set defined by linear
inequalities
K-Median Algorithm
Need to solve
Experimentation
used
as a KDD tool to mine WPBC to
discover medical knowledge
key observation is curves are well
separated
Experimentation
Robust Representation
model
remains valid under a class of
data perturbation
Use זּ-tolerance zone wherein errors
are disregarded
Better generalization results than
conventional zero-tolerance
Robust Representation
A
is a m*n matrix, a is a m*1 vector
x is a vector be “learned”
find minimize of Ax - a
Robust Representation
=
A
x
a
=
A
x
זּ-tolerate
a
Conclusion
Mathematical
programming codes
are reliable and robust codes
Problems solved demonstrate
mathematical programming as
versatile and effective tool for solving
important problems in data mining
and knowledge discovery in
databases
Opinion
Mathematical
describe can explain
about complex problems and
convince others, but …you must be
understand it first