Support Vector Machines for Data Fitting and Classification

Download Report

Transcript Support Vector Machines for Data Fitting and Classification

Support Vector Machines
for Data Fitting and Classification
David R. Musicant
with Olvi L. Mangasarian
UW-Madison Data Mining Institute
Annual Review
June 2, 2000
Overview


Regression and its role in data mining
Robust support vector regression
– Our general formulation

Tolerant support vector regression
– Our contributions
– Massive support vector regression
– Integration with data mining tools


Active support vector machines
Other research and future
directions
What is regression?



Regression forms a rule for predicting an
unknown numerical feature from known ones.
Example: Predicting purchase habits.
Can we use...
– age, income, level of education

To predict...
– purchasing patterns?

And simultaneously...
– avoid the “pitfalls” that standard statistical regression
falls into?
Regression example

Can we use.
Age
Income
Years of Education $ spent on software
30 $56,000 / yr
16
$800
50 $60,000 / yr
12
$0
16 $2,000 / yr
11
$200

To predict:
Age
Income
Years of Education $ spent on software
40 $48,000 / yr
17
?
29 $60,000 / yr
18
?
Role in data mining

Goal: Find new relationships in data
– e.g. customer behavior, scientific experimentation

Regression explores importance of each known
feature in predicting the unknown one.
– Feature selection

Regression is a form of supervised learning
– Use data where the predictive value is known for
given instances, to form a rule

Massive datasets
Regression is a fundamental task in data mining.
Part I:
Robust Regression
a.k.a. Huber Regression
2
1
0
-2
-g
-1
0
-1
1
g
2
“Standard” Linear Regression
ê = Aw + be
y
y
b
Find w, b such that:
A
Aw + be ù y
Optimization problem

Find w, b such that:
Aw + be ù y

Bound the error by s:
à s ô Aw + be à y ô s

Minimize the error:
P
min
s2i
à s ô Aw + be à y ô s
Traditional approach: minimize squared error.
Examining the loss function

Standard regression uses a squared error loss
function.
– Points which are far from the predicted line (outliers)
are overemphasized.
4
3
2
1
0
-2
-1
-1
0
1
2
Alternative loss function

Instead of squared error,
P try absolute value of
the error:
min
js j
i
à s ô Aw + be à y ô s
2
1
0
-2
-1
0
1
2
This is called the 1-norm loss function.
1-Norm Problems And Solution
– Overemphasizes error on points close to the
predicted line

Solution: Huber loss function hybrid approach
Linear
2
1
Quadratic
0
-2
-1
0
1
2
-1
Many practitioners prefer the Huber loss function.
Mathematical Formulation

g indicates switchover from quadratic to linear
2
1
0
-g
-2
-1
ú
ú( t ) =
0
-1
t 2=2;
í j t j à í 2=2;
g
1
2
if j t j ô í
if j t j > í
Larger g means “more quadratic.”
Regression Approach Summary
4

Quadratic Loss Function
– Standard method in statistics
– Over-emphasizes outliers

2
1
0
-2
-1
Linear Loss Function (1-norm)
– Formulates well as a linear
program
– Over-emphasizes small errors

3
0
1
2
0
1
2
0
1
2
2
1
0
-2
-1
Huber Loss Function (hybrid
approach)
– Appropriate emphasis on large
and small errors
-1
2
1
0
-2
-1
-1
Previous attempts complicated

Earlier efforts to solve Huber regression:
–
–
–
–

Huber: Gauss-Seidel method
Madsen/Nielsen: Newton Method
Li: Conjugate Gradient Method
Smola: Dual Quadratic Program
Our new approach:Pconvex quadratic
program
P
2
min
jt i j
zi =2 + í
w 2 R d; z2 R l ; t 2 R l
z à t ô Aw + be à y ô z + t
Our new approach is simpler and faster.
Experimental Results: Census20k
20,000 points
11 features
1.345
g
Li
Madsen/Nielsen
Huber
Smola
MM
1
0.1
Faster!
0
200
400
Time (CPU sec)
600
Experimental Results: CPUSmall
8,192 points
12 features
1.345
g
Li
Madsen/Nielsen
Huber
Smola
MM
1
0.1
Faster!
0
50
100
150
Time (CPU sec)
200
Introduce nonlinear kernel!

Begin with previous formulation:
P
min
z2i=2 +
í
P
jt i j
w 2 R d; z2 R l ; t 2 R l
z à t ô Aw + be à y ô z + t

Substitute w = A’a and minimize a instead:
z à t ô AA 0ë + be à y ô z + t

Substitute K(A,A’) for AA’:
z à t ô K ( A; A 0) ë + be à y ô z + t
A kernel is nonlinear function.
Nonlinear results
Dataset
Kernel
Training Accuracy Testing Accuracy
CPUSmall Linear
94.50%
94.06%
Gaussian
97.26%
95.90%
Boston
Linear
85.60%
83.81%
Housing
Gaussian
92.36%
88.15%
Nonlinear kernels improve accuracy.
Part II:
Tolerant Regression
a.k.a. Tolerant Training
Regression Approach Summary
4

Quadratic Loss Function
– Standard method in statistics
– Over-emphasizes outliers

2
1
0
-2
-1
Linear Loss Function (1-norm)
– Formulates well as a linear
program
– Over-emphasizes small errors

3
0
1
2
0
1
2
0
1
2
2
1
0
-2
-1
Huber Loss Function (hybrid
approach)
– Appropriate emphasis on large
and small errors
-1
2
1
0
-2
-1
-1
Optimization problem (1-norm)

Find w, b such that:
Aw + be ù y

Bound the error by s:
à s ô Aw + be à y ô s

Minimize the error:
P
min
j si j
à s ô Aw + be à y ô s
Minimize the magnitude of the error.
The overfitting issue

Noisy training data can be fitted “too well”
– leads to poor generalization on future data

ê = Aw + be
y
A
Prefer simpler
regressions, i.e. where
– some w coefficients are
zero
– line is “flatter”
Reducing overfitting

To achieve both goals
– minimize magnitude of w vector
P
P
min
j w i j + C si
à s ô Aw + be à y ô s

C is a parameter to balance the two goals
– Chosen by experimentation

Reduces overfitting due to points far from
surface
Overfitting again: “close” points

“Close points” may be wrong due to noise only
– Line should be influenced by “real” data, not noise
"
ê = Aw + be
y
A

Ignore errors from
those points which are
close!
Tolerant regression

Allow an intervalPof size e with P
uniform error
min
j w i j + C si
à s ô Aw + be à y ô s
e" ô s

How large should e be?
– Large as possible, while preserving accuracy
P
P
min
j w i j + C si à Cö"
à s ô Aw + be à y ô s
e" ô s
How about a nonlinear surface?
Introduce nonlinear kernel!

Begin with previous
formulation:
P
P

Substitute w = A’a and minimize a instead:
min j wi j + C si à Cö"
à s ô Aw + be à y ô s
e" ô s
à s ô AA 0ë + be à y ô s

Substitute K(A,A’) for AA’:
à s ô K ( A; A 0) ë + be à y ô s
A kernel is nonlinear function.
Our improvements

This formulation and interpretation is new!
– Improves intuition from prior results
– Uses less variables
– Solves faster!

Computational tests run on DMI Locop2
– Dell PowerEdge 6300 server with
– Four gigabytes of memory, 36 gigabytes of disk
space
– Windows NT Server 4.0
– CPLEX 6.5 solver
Donated to UW by Microsoft Corporation
Comparison Results
m
Dataset
0
0.1
0.2
...
0.7
Total
Time
Time (sec) Improvement
Census Tuning set error
5.10%
4.74%
Max
0.00
0.02
79.7%
SSR time (sec)
980
935
5086
Avg
MM time (sec)
199
294
3765
26.0%
Tuning set error
6.60%
6.32%
Max
0.00
3.09
65.7%
SSR time (sec)
1364
1286
7604
Avg
MM time (sec)
468
660
6533
14.1%
e
CompActiv
e
Boston Tuning set error
Housing
e
14.69% 14.62%
Max
0.00
0.42
52.0%
SSR time (sec)
36
34
170
Avg
MM time (sec)
17
23
140
17.6%
Problem size concerns

How does the problem scale?
– m = number of points
– n = number of features



For linear kernel: problem size is O(mn)
à s ô Aw + be à y ô s
For nonlinear kernel: problem size is O(m2)
à s ô K ( A; A 0) ë + be à y ô s
Thousands of data points ==> massive problem!
Need an algorithm that will scale well.
Chunking approach

Idea: Use a chunking method
– Bring as much into memory as possible
– Solve this subset of the problem
– Retain solution and integrate into next subset

Explored in depth by Paul Bradley and O.L.
Mangasarian for linear kernels
Solve in pieces, one chunk at a time.
Row-Column Chunking

Why column chunking also?
– If non-linear kernel is used, chunks are very
wide.
– A wide chunk must have a small number of
rows to fit in memory.
Both these chunks use the same memory!
Chunking Experimental Results
Dataset:
16,000 point subset of Census in R 11+ noise
Kernel:
Gaussian Radial Basis Kernel
LP Size:
32,000 nonsparse rows and columns
Problem Size:
1.024 billion nonzero values
Time to termination:
18.8 days
Number of SVs:
1621 support vectors
Solution variables:
33 nonzero components
Final tuning set error:
9.8%
Tuning set error on first
16.2%
chunk (1000 points)
Objective Value & Tuning Set Error
for Billion-Element Matrix
Tuning Set Error
25000
20%
20000
18%
Tuning Set Error
Objective Value
Objective Value
15000
10000
5000
16%
14%
12%
10%
8%
0
0
0
500
5
1000
9
1500
13
2000
18
Row-Column Chunk Iteration Number
Time in Days
0
0
500
5
1000
9
1500
13
2000
18
Row-Column Chunk Iteration Number
Time in Days
Given enough time, we find the right answer!
Integration into data mining tools


Method runs as a stand-alone application, with
data resident on disk
With minimal effort, could sit on top of a RDBMS
to manage data input/output
– Queries select a subset of data - easily SQLable

Database queries occur “infrequently”
– Data mining can be performed on a different machine
from the one maintaining the DBMS

Licensing of a linear program solver necessary
Algorithm can integrate with data mining tools.
Part III:
Active Support Vector Machines
a.k.a. ASVM
The Classification Problem
x 0w = í + 1
A+
Ax 0w = í à 1
0
Separating Surface: x w = í
w
Find surface to best separate two classes.
Active Support Vector Machine

Features
–
–
–
–
Solves classification problems
No special software tools necessary! No LP or QP!
FAST. Works on very large problems.
Web page: www.cs.wisc.edu/~musicant/asvm
• Available for download and can be integrated into data
mining tools
• MATLAB integration already provided
# of points Features Iterations Time (CPU min)
4 million
32
5
38.04
7 million
32
5
95.57
Summary and Future Work

Summary
– Robust regression can be modeled simply and
efficiently as a quadratic program
– Tolerant regression can be used to solve massive
regression problems
– ASVM can solve massive classification problems
quickly

Future work
– Parallel approaches
– Distributed approaches
– ASVM for various types of regression
Questions?