Transcript Clustering
Classification - SVM
CS 685: Special Topics in Data Mining
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
How would you
classify this
data?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
How would you
classify this
data?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
How would you
classify this
data?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
How would you
classify this
data?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
Any of these
would be fine..
..but which is
best?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Classifier Margin
x
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
a
f
yest
f(x,w,b) = sign(w. x - b)
Define the margin of
a linear classifier as
the width that the
boundary could be
increased by before
hitting a datapoint.
CS685 : Special Topics in Data Mining, UKY
Maximum Margin
x
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
a
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM
(Called an LSVM)
Linear SVM
CS685 : Special Topics in Data Mining, UKY
Maximum Margin
x
denotes +1
denotes -1
Support
Vectors are
those
datapoints that
the margin
pushes up
against
Copyright © 2001, 2003,
Andrew W. Moore
a
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM
(Called an LSVM)
Linear SVM
CS685 : Special Topics in Data Mining, UKY
Why Maximum Margin?
denotes +1
denotes -1
Support
Vectors are
those
datapoints that
the margin
pushes up
against
Copyright © 2001, 2003,
Andrew W. Moore
1. Intuitively this feels safest.
2. If we’vef(x,w,b)
made a small
errorxin
the
= sign(w.
- b)
location of the boundary (it’s been
The maximum
jolted in its perpendicular
margin
direction) this gives
us linear
least chance
classifier is the
of causing a misclassification.
3. LOOCV is easylinear
since classifier
the modelwith
is
the, um,
maximum
immune to removal
of any
nonmargin.
support-vector datapoints.
This is
the simplest
4. There’s some theory
(using
VC
of SVM
dimension) thatkind
is related
to (but
an LSVM)
not the same as)(Called
the proposition
that this is a good thing.
5. Empirically it works very very
well. CS685 : Special Topics in Data Mining, UKY
Specifying a line andPlus-Plane
margin
Classifier Boundary
Minus-Plane
• How do we represent this mathematically?
• …in m input dimensions?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
if
w . x + b >= 1
-1
if
w . x + b <= -1
Universe
explodes
if
-1 < w . x + b < 1
Classify as.. +1
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v )
And so of course the vector w is ?
also perpendicular to the Minus
Plane Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
x+
x-
•
•
•
•
•
M = Margin Width
How do we compute
M in terms of w
and b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Copyright © 2001, 2003, Andrew W. Moore
Any location in
mm::not
R
not
necessarily a
datapoint
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
x+
x-
•
•
•
•
•
•
M = Margin Width
How do we compute
M in terms of w
and b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Claim: x+ = x- + l w for some value of l. Why?
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
x+
x-
•
•
•
•
•
•
M = Margin Width
The line from x- to x+ is
perpendicular
to the
How
do we compute
planes.
M in terms- of +w
So to get from x to x
and b?
travel some distance in
direction w.
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Claim: x+ = x- + l w for some value of l. Why?
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
x+
M = Margin Width
x-
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M in
terms of w and b
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
M = Margin Width
x+
x-
w . (x - + l w) + b = 1
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M in
terms of w and b
Copyright © 2001, 2003, Andrew W. Moore
=>
w . x - + b + l w .w = 1
=>
-1 + l w .w = 1
=>
2
λ
w.w
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
M = Margin Width =
x+
2
w.w
xM = |x+ - x- | =| l w |=
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
2
•
λ
λ | w | λ w.w
2 w.w
2
w.w
w.w
w.w
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Learning the Maximum Margin Classifier
x+
M = Margin Width =
2
w.w
x-
Given a guess of w and b we can
• Compute whether all data points in the correct half-planes
• Compute the width of the margin
So now we just need to write a program to search the space of w’s
and b’s to find the widest margin that matches all the datapoints.
How?
Gradient descent? Simulated Annealing? Matrix Inversion? EM?
Newton’s Method?
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Estimate the Margin
denotes +1
denotes -1
wx +b = 0
x
• What is the distance expression for a point x to a
line wx+b= 0?
d ( x)
xw b
w
Copyright © 2001, 2003,
Andrew W. Moore
2
2
xw b
d
2
w
i 1 i
CS685 : Special Topics in Data Mining, UKY
Estimate the Margin
denotes +1
denotes -1
wx +b = 0
Margin
• What is the expression for margin?
margin arg min d (x) arg min
xD
Copyright © 2001, 2003,
Andrew W. Moore
xD
xw b
d
2
w
i 1 i
CS685 : Special Topics in Data Mining, UKY
Maximize Margin
denotes +1
denotes -1
wx +b = 0
Margin
argmax margin(w, b, D)
w ,b
= argmax arg min d (xi )
w ,b
xi D
argmax arg min
w ,b
Copyright © 2001, 2003,
Andrew W. Moore
xi D
b xi w
d
2
w
i 1 i
CS685 : Special Topics in Data Mining, UKY
Maximize Margin
denotes +1
denotes -1
wx +b = 0
Margin
argmax arg min
w ,b
xi D
b xi w
d
2
w
i 1 i
subject to xi D : yi xi w b 0
• Min-max problem game problem
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
denotes +1
denotes -1
Maximize Margin
Margin
wx +b = 0
argmax arg min
w ,b
xi D
b xi w
d
2
w
i 1 i
subject to xi D : yi xi w b 0
Strategy:
argmin i 1 wi2
d
xi D : b xi w 1
Copyright © 2001, 2003,
Andrew W. Moore
w ,b
subject to xi D : yi xi w b 1
CS685 : Special Topics in Data Mining, UKY
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Suppose we’re in 1-dimension
What would SVMs
do with this
data?
x=0
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Suppose we’re in 1-dimension
Not a big surprise
x=0
Positive “plane”
Copyright © 2001, 2003,
Andrew W. Moore
Negative
“plane”
CS685 : Special Topics in Data Mining, UKY
Harder 1-dimensional dataset
That’s wiped the
smirk off SVM’s
face.
What can be done
about this?
x=0
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Harder 1-dimensionalRemember
datasethow
permitting nonlinear basis
functions made
linear regression so
much nicer?
Let’s permit them
here too
x=0
Copyright © 2001, 2003,
Andrew W. Moore
z k ( xk , x )
2
k
CS685 : Special Topics in Data Mining, UKY
Harder 1-dimensionalRemember
datasethow
permitting nonlinear basis
functions made
linear regression so
much nicer?
Let’s permit them
here too
x=0
Copyright © 2001, 2003,
Andrew W. Moore
z k ( xk , x )
2
k
CS685 : Special Topics in Data Mining, UKY
Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
| xk c j |
z k [ j ] φ j (x k ) KernelFn
zk = ( sigmoid functions of xk )
KW
This is sensible.
Is that the end of the story?
No…there’s one more trick!
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
SVM Performance
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-writtencharacter recognition benchmark
• Another Example: Andrew knows several
reliable people doing practical real-world work
who claim that SVMs have saved them when
their other favorite classifiers did poorly.
• There is a lot of excitement and religious
fervor about SVMs as of 2001.
CS685 : Special Topics in Data Mining, UKY
Copyright © 2001, 2003,
Andrew W. Moore
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
–
–
–
–
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2”
:
SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input,
just predict with each SVM and find out which
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
SVM Related Links
• http://svm.dcs.rhbnc.ac.uk/
• http://www.kernel-machines.org/
• C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Knowledge Discovery and Data Mining,
2(2), 1998.
• SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light
• BOOK: An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-Taylor
Cambridge University Press
CS685 : Special Topics in Data Mining, UKY
Classification - CBA
CS 685: Special Topics in Data Mining
Spring 2008
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Association Rules
Transactionid
Items bought
100
f, a, c, d, g, I, m, p
200
a, b, c, f, l,m, o
300
b, f, h, j, o
400
b, c, k, s, p
500
a, f, c, e, l, p, m, n
Customer
buys both
Customer
buys beer
Customer
buys diaper
Itemset X = {x1, …, xk}
Find all the rules X Y with
minimum support and confidence
support, s, is the probability that
a transaction contains X Y
confidence, c, is the conditional
probability that a transaction
having X also contains Y
Let supmin = 50%, confmin = 50%
Association rules:
A C (60%, 100%)
C A (60%, 75%)
CS685 : Special Topics in Data Mining, UKY
Classification based on
Association
• Classification rule mining versus Association
rule mining
• Aim
– A small set of rules as classifier
– All rules according to minsup and minconf
• Syntax
– Xy
– X Y
CS685 : Special Topics in Data Mining, UKY
Why & How to Integrate
• Both classification rule mining and
association rule mining are indispensable to
practical applications.
• The integration is done by focusing on a
special subset of association rules whose
right-hand-side are restricted to the
classification class attribute.
– CARs: class association rules
CS685 : Special Topics in Data Mining, UKY
CBA: Three Steps
• Discretize continuous attributes, if any
• Generate all class association rules (CARs)
• Build a classifier based on the generated CARs.
CS685 : Special Topics in Data Mining, UKY
Our Objectives
• To generate the complete set of CARs that
satisfy the user-specified minimum
support (minsup) and minimum
confidence (minconf) constraints.
• To build a classifier from the CARs.
CS685 : Special Topics in Data Mining, UKY
Rule Generator: Basic Concepts
• Ruleitem
<condset, y> :condset is a set of items, y is a class label
Each ruleitem represents a rule: condset->y
• condsupCount
• The number of cases in D that contain condset
• rulesupCount
• The number of cases in D that contain the condset and are
labeled with class y
• Support=(rulesupCount/|D|)*100%
• Confidence=(rulesupCount/condsupCount)*100%
CS685 : Special Topics in Data Mining, UKY
RG: Basic Concepts (Cont.)
• Frequent ruleitems
– A ruleitem is frequent if its support is above minsup
• Accurate rule
– A rule is accurate if its confidence is above minconf
• Possible rule
– For all ruleitems that have the same condset, the
ruleitem with the highest confidence is the possible
rule of this set of ruleitems.
• The set of class association rules (CARs) consists
of all the possible rules (PRs) that are both
frequent and accurate.
CS685 : Special Topics in Data Mining, UKY
RG: An Example
• A ruleitem:<{(A,1),(B,1)},(class,1)>
– assume that
• the support count of the condset (condsupCount) is 3,
• the support of this ruleitem (rulesupCount) is 2, and
• |D|=10
– then (A,1),(B,1) -> (class,1)
• supt=20% (rulesupCount/|D|)*100%
• confd=66.7% (rulesupCount/condsupCount)*100%
CS685 : Special Topics in Data Mining, UKY
RG: The Algorithm
1 F 1 = {large 1-ruleitems};
2 CAR 1 = genRules (F 1 );
3 prCAR 1 = pruneRules (CAR 1 ); //count the item and class occurrences to
determine the frequent 1-ruleitems and prune it
4 for (k = 2; F k-1 Ø; k++) do
5
C k = candidateGen (F k-1 ); //generate the candidate ruleitems Ck
6
7
8
9
10
11
12
using the frequent ruleitems Fk-1
for each data case d D do //scan the database
C d = ruleSubset (C k , d); //find all the ruleitems in Ck whose condsets
are supported by d
for each candidate c C d do
c.condsupCount++;
if d.class = c.class then
c.rulesupCount++; //update various support counts of the candidates in Ck
end
end
CS685 : Special Topics in Data Mining, UKY
RG: The Algorithm(cont.)
13
F k = {c C k | c.rulesupCount minsup};
//select those new frequent ruleitems to form Fk
14 CAR k = genRules(F k ); //select the ruleitems both accurate and frequent
15 prCAR k = pruneRules(CAR k );
16 end
17 CARs = k CAR k ;
18 prCARs = k prCAR k ;
CS685 : Special Topics in Data Mining, UKY