Clustering - University of Kentucky

Download Report

Transcript Clustering - University of Kentucky

Classification - SVM
CS 685: Special Topics in Data Mining
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
How would you
classify this
data?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
How would you
classify this
data?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
How would you
classify this
data?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
How would you
classify this
data?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Linear Classifiers
x
denotes +1
denotes -1
a
f
yest
f(x,w,b) = sign(w. x - b)
Any of these
would be fine..
..but which is
best?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Classifier Margin
x
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
a
f
yest
f(x,w,b) = sign(w. x - b)
Define the margin of
a linear classifier as
the width that the
boundary could be
increased by before
hitting a datapoint.
CS685 : Special Topics in Data Mining, UKY
Maximum Margin
x
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
a
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM
(Called an LSVM)
Linear SVM
CS685 : Special Topics in Data Mining, UKY
Maximum Margin
x
denotes +1
denotes -1
Support
Vectors are
those
datapoints that
the margin
pushes up
against
Copyright © 2001, 2003,
Andrew W. Moore
a
f
yest
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM
(Called an LSVM)
Linear SVM
CS685 : Special Topics in Data Mining, UKY
Why Maximum Margin?
denotes +1
denotes -1
Support
Vectors are
those
datapoints that
the margin
pushes up
against
Copyright © 2001, 2003,
Andrew W. Moore
1. Intuitively this feels safest.
2. If we’vef(x,w,b)
made a small
errorxin
the
= sign(w.
- b)
location of the boundary (it’s been
The maximum
jolted in its perpendicular
margin
direction) this gives
us linear
least chance
classifier is the
of causing a misclassification.
3. LOOCV is easylinear
since classifier
the modelwith
is
the, um,
maximum
immune to removal
of any
nonmargin.
support-vector datapoints.
This is
the simplest
4. There’s some theory
(using
VC
of SVM
dimension) thatkind
is related
to (but
an LSVM)
not the same as)(Called
the proposition
that this is a good thing.
5. Empirically it works very very
well. CS685 : Special Topics in Data Mining, UKY
Specifying a line andPlus-Plane
margin
Classifier Boundary
Minus-Plane
• How do we represent this mathematically?
• …in m input dimensions?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Specifying a line and margin
Plus-Plane
Classifier Boundary
Minus-Plane
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
if
w . x + b >= 1
-1
if
w . x + b <= -1
Universe
explodes
if
-1 < w . x + b < 1
Classify as.. +1
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
M = Margin Width
How do we compute
M in terms of w
and b?
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus Plane. Why?
Let u and v be two vectors on the
Plus Plane. What is w . ( u – v )
And so of course the vector w is ?
also perpendicular to the Minus
Plane Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
x+
x-
•
•
•
•
•
M = Margin Width
How do we compute
M in terms of w
and b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Copyright © 2001, 2003, Andrew W. Moore
Any location in
mm::not
R
not
necessarily a
datapoint
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
x+
x-
•
•
•
•
•
•
M = Margin Width
How do we compute
M in terms of w
and b?
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Claim: x+ = x- + l w for some value of l. Why?
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
x+
x-
•
•
•
•
•
•
M = Margin Width
The line from x- to x+ is
perpendicular
to the
How
do we compute
planes.
M in terms- of +w
So to get from x to x
and b?
travel some distance in
direction w.
Plus-plane = { x : w . x + b = +1 }
Minus-plane = { x : w . x + b = -1 }
The vector w is perpendicular to the Plus Plane
Let x- be any point on the minus plane
Let x+ be the closest plus-plane-point to x-.
Claim: x+ = x- + l w for some value of l. Why?
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
x+
M = Margin Width
x-
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M in
terms of w and b
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
M = Margin Width
x+
x-
w . (x - + l w) + b = 1
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
It’s now easy to get M in
terms of w and b
Copyright © 2001, 2003, Andrew W. Moore
=>
w . x - + b + l w .w = 1
=>
-1 + l w .w = 1
=>
2
λ
w.w
CS685 : Special Topics in Data Mining, UKY
Computing the margin width
M = Margin Width =
x+
2
w.w
xM = |x+ - x- | =| l w |=
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + l w
• |x+ - x- | = M
2
•
λ
 λ | w |  λ w.w
2 w.w
2


w.w
w.w
w.w
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Learning the Maximum Margin Classifier
x+
M = Margin Width =
2
w.w
x-
Given a guess of w and b we can
• Compute whether all data points in the correct half-planes
• Compute the width of the margin
So now we just need to write a program to search the space of w’s
and b’s to find the widest margin that matches all the datapoints.
How?
Gradient descent? Simulated Annealing? Matrix Inversion? EM?
Newton’s Method?
Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Quadratic Programming
T
u
Ru
T
Find arg max c  d u 
2
u
a11u1  a12u2  ...  a1mum  b1
Subject to
a21u1  a22u2  ...  a2 mum  b2
:
an1u1  an 2u2  ...  anmum  bn
n additional linear
inequality
constraints
a( n 1)1u1  a( n 1) 2u2  ...  a( n 1) mum  b( n 1)
a( n  2 )1u1  a( n  2) 2u2  ...  a( n  2) mum  b( n  2)
:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) mum  b( n  e )
Copyright © 2001, 2003,
Andrew W. Moore
e additional linear
equality
constraints
And subject to
Quadratic criterion
CS685 : Special Topics in Data Mining, UKY
Quadratic Programming
Find
T
u
Ru
T
arg max c  d u 
2
u
Quadratic criterion
a11u1  a12u2  ...  a1mum  b1
a21u1  a22u2  ...  a2 mum  b2
:
an1u1  an 2u2  ...  anmum  bn
Subject to
inequality
constraints
a( n 1)1u1  a( n 1) 2u2  ...  a( n 1) mum  b( n 1)
a( n  2 )1u1  a( n  2) 2u2  ...  a( n  2) mum  b( n  2)
:
a( n  e )1u1  a( n  e ) 2u2  ...  a( n  e ) mum  b( n  e )
Copyright © 2001, 2003,
Andrew W. Moore
e additional linear
equality
constraints
And subject to
n additional linear
CS685 : Special Topics in Data Mining, UKY
Learning the Maximum Margin Classifier
M = Given guess of w , b we can
2
w.w •
Compute whether all data
points are in the correct halfplanes
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
What should our quadratic
optimization criterion be?
Copyright © 2001, 2003, Andrew W. Moore
How many constraints will we
have?
What should they be?
CS685 : Special Topics in Data Mining, UKY
Learning the Maximum Margin Classifier
M = Given guess of w , b we can
2
w.w •
Compute whether all data
points are in the correct halfplanes
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
What should our quadratic
optimization criterion be?
Minimize w.w
Copyright © 2001, 2003, Andrew W. Moore
How many constraints will we
have? R
What should they be?
w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1
CS685 : Special Topics in Data Mining, UKY
Uh-oh!
This is going to be a problem!
What should we do?
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Uh-oh!
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
This is going to be a problem!
What should we do?
Idea 1:
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things to
minimize makes for an illdefined optimization
CS685 : Special Topics in Data Mining, UKY
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
Tradeoff parameter
There’s a serious practical
problem that’s about to make us
reject this approach. Can you
guess what it is?
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
Tradeoff parameter
There’s a serious practical
Can’t be expressed asproblem
a Quadratic
that’s about to make us
Programming reject
problem.
this approach. Can you
Solving it may beguess
too slow.
what it is?
(Also, doesn’t distinguish between
disastrous errors and near misses)
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Uh-oh!
denotes +1
denotes -1
Copyright © 2001, 2003,
Andrew W. Moore
This is going to be a problem!
What should we do?
Idea 2.0:
Minimize
w.w + C (distance of error
points to their
correct place)
CS685 : Special Topics in Data Mining, UKY
Learning Maximum Margin with Noise
M = Given guess of w , b we can
2
w.w •
Compute sum of distances of
points to their correct zones
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
What should our quadratic
optimization criterion be?
Copyright © 2001, 2003, Andrew W. Moore
How many constraints will we
have?
What should they be?
CS685 : Special Topics in Data Mining, UKY
Learning Maximum Margin with Noise
e2
e11
M = Given guess of w , b we can
e7
What should our quadratic
optimization criterion be?
Minimize
R
1
w.w  C  εk
2
k 1
Copyright © 2001, 2003, Andrew W. Moore
2
w.w •
Compute sum of distances of
points to their correct zones
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
How many constraints will we
have? R
What should they be?
w . xk + b >= 1-ek if yk = 1
w . xk + b <= -1+ek if yk = -1
CS685 : Special Topics in Data Mining, UKY
Learning Maximum Margin mwith
Noise
= # input
e2
e11
M = Given guess of
dimension
w , b we can
2
w.w •
Compute sum sof distances of
points to their correct zones
Our original (noiseless data) QP had m+1
the margin
width
variables: w1, •w2Compute
, … wm, and
b.
Assume R datapoints, each
e7
Our new (noisy data)
QP) where
has m+1+R
(xk,y
yk = +/- 1
k
variables: w1, w2, … wm, b, ek , e1 ,… eR
What should our quadratic
optimization criterion be?
Minimize
R
1
w.w  C  εk
2
k 1
Copyright © 2001, 2003, Andrew W. Moore
How many constraints will we
R= # records
have? R
What should they be?
w . xk + b >= 1-ek if yk = 1
w . xk + b <= -1+ek if yk = -1
CS685 : Special Topics in Data Mining, UKY
Learning Maximum Margin with Noise
e2
M = Given guess of w , b we can
e11
2
w.w •
Compute sum of distances of
points to their correct zones
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
e7
What should our quadratic
optimization criterion be?
Minimize
R
How many constraints will we
have? R
What should they be?
1
w.w  C εk w . xk + b >= 1-ek if yk = 1
2
k 1
w . xk + b <= -1+ek if yk = -1
There’s a bug in this QP. Can you spot it?

Copyright © 2001, 2003, Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Learning Maximum Margin with Noise
e2
e11
M = Given guess of w , b we can
e7
What should our quadratic
optimization criterion be?
Minimize
R
1
w.w  C  εk
2
k 1
Copyright © 2001, 2003, Andrew W. Moore
2
w.w •
Compute sum of distances of
points to their correct zones
• Compute the margin width
Assume R datapoints, each
(xk,yk) where yk = +/- 1
How many constraints will we
have? 2R
What should they be?
w . xk + b >= 1-ek if yk = 1
w . xk + b <= -1+ek if yk = -1
ek >= 0 for all k
CS685 : Special Topics in Data Mining, UKY
An Equivalent QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk .xl )
2 k 1 l 1
k 1
Subject to these
constraints:
0  αk  C k
R
α
k 1
k
yk  0
Then define:
R
w   αk y k x k
Then classify with:
f(x,w,b) = sign(w. x - b)
k 1
b  y K (1  ε K )  x K .w K
where K  arg max αk
k
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
An Equivalent QP
R
1 R R
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk .xl )
2 k 1 l 1
k 1
Subject to these
constraints:
0  αk  C k
Then define:
R
α
k 1
k
yk  0
Datapoints with ak >
0 will be the
Then classify with:
support
vectors
R
w   αk y k x k
f(x,w,b) = sign(w. x - b)
k 1
..so this sum only
y K (1  ε K )  x K .w K needs to be over
the support vectors.
b
where K  arg max αk
k
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
R Equivalent
R
QP
1An
Maximize  αk   αk αl Qkl where Qkl  yk yl (xk .xl )
2 k 1 l 1
k 1
R
R
Why did I tell you about this
Subject to these
αk y k
0  αk QP?
 C k
equivalent
constraints:
k 1
• It’s a formulation that QP
packages can optimize more
Then define:
Datapoints with ak >
quickly
0 will
be the
R
• Because of further
jaw-dropping
Then classify with:
support
vectors
w
αk yk xdevelopments
you’re about
to
k
f(x,w,b) = sign(w. x - b)
k 1
learn.
..so this sum only
b  y K (1  ε K )  x K .w K needs to be over
the support vectors.

0

where K  arg max αk
k
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Estimate the Margin
denotes +1
denotes -1
wx +b = 0
x
• What is the distance expression for a point x to a
line wx+b= 0?
d ( x) 
xw b
w
Copyright © 2001, 2003,
Andrew W. Moore
2
2

xw b

d
2
w
i 1 i
CS685 : Special Topics in Data Mining, UKY
Estimate the Margin
denotes +1
denotes -1
wx +b = 0
Margin
• What is the expression for margin?
margin  arg min d (x)  arg min
xD
Copyright © 2001, 2003,
Andrew W. Moore
xD
xw b

d
2
w
i 1 i
CS685 : Special Topics in Data Mining, UKY
Maximize Margin
denotes +1
denotes -1
wx +b = 0
Margin
argmax margin(w, b, D)
w ,b
= argmax arg min d (xi )
w ,b
xi D
 argmax arg min
w ,b
Copyright © 2001, 2003,
Andrew W. Moore
xi D
b  xi  w

d
2
w
i 1 i
CS685 : Special Topics in Data Mining, UKY
Maximize Margin
denotes +1
denotes -1
wx +b = 0
Margin
argmax arg min
w ,b
xi D
b  xi  w

d
2
w
i 1 i
subject to xi  D : yi  xi  w  b   0
• Min-max problem  game problem
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
denotes +1
denotes -1
Maximize Margin
Margin
wx +b = 0
argmax arg min
w ,b
xi D
b  xi  w

d
2
w
i 1 i
subject to xi  D : yi  xi  w  b   0
Strategy:
argmin i 1 wi2
d
xi  D : b  xi  w  1
Copyright © 2001, 2003,
Andrew W. Moore
w ,b
subject to xi  D : yi  xi  w  b   1
CS685 : Special Topics in Data Mining, UKY
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Suppose we’re in 1-dimension
What would SVMs
do with this
data?
x=0
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Suppose we’re in 1-dimension
Not a big surprise
x=0
Positive “plane”
Copyright © 2001, 2003,
Andrew W. Moore
Negative
“plane”
CS685 : Special Topics in Data Mining, UKY
Harder 1-dimensional dataset
That’s wiped the
smirk off SVM’s
face.
What can be done
about this?
x=0
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Harder 1-dimensional dataset
Remember how
permitting non-linear
basis functions made
linear regression so
much nicer?
Let’s permit them here
too
x=0
Copyright © 2001, 2003,
Andrew W. Moore
z k  ( xk , x )
2
k
CS685 : Special Topics in Data Mining, UKY
Harder 1-dimensional dataset
Remember how
permitting non-linear
basis functions made
linear regression so
much nicer?
Let’s permit them here
too
x=0
Copyright © 2001, 2003,
Andrew W. Moore
z k  ( xk , x )
2
k
CS685 : Special Topics in Data Mining, UKY
Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
 | xk  c j | 

z k [ j ]  φ j (x k )  KernelFn
 KW 
zk = ( sigmoid functions of xk )
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
SVM Performance
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-writtencharacter recognition benchmark
• There has been a lot of excitement and
religious fervor about SVMs.
• Despite this, some practitioners (including
your lecturer) are a little skeptical.
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
–
–
–
–
SVM 1 learns “Output==1” vs “Output != 1”
SVM 2 learns “Output==2” vs “Output != 2”
:
SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just predict
with each SVM and find out which one puts the
prediction the furthest into the positive region.
Copyright © 2001, 2003,
Andrew W. Moore
CS685 : Special Topics in Data Mining, UKY
SVM Related Links
• http://svm.dcs.rhbnc.ac.uk/
• http://www.kernel-machines.org/
• C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Knowledge Discovery and Data Mining,
2(2), 1998.
• SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light
• BOOK: An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-Taylor
Cambridge University Press
CS685 : Special Topics in Data Mining, UKY
Classification - CBA
CS 685: Special Topics in Data Mining
Spring 2008
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Association Rules
Transactionid
Items bought
100
f, a, c, d, g, I, m, p
200
a, b, c, f, l,m, o
300
b, f, h, j, o
400
b, c, k, s, p
500
a, f, c, e, l, p, m, n
Customer
buys both
Customer
buys beer
Customer
buys diaper
Itemset X = {x1, …, xk}
Find all the rules X  Y with
minimum support and confidence
support, s, is the probability that
a transaction contains X  Y
confidence, c, is the conditional
probability that a transaction
having X also contains Y
Let supmin = 50%, confmin = 50%
Association rules:
A  C (60%, 100%)
C  A (60%, 75%)
CS685 : Special Topics in Data Mining, UKY
Classification based on
Association
• Classification rule mining versus Association
rule mining
• Aim
– A small set of rules as classifier
– All rules according to minsup and minconf
• Syntax
– Xy
– X Y
CS685 : Special Topics in Data Mining, UKY
Why & How to Integrate
• Both classification rule mining and
association rule mining are indispensable to
practical applications.
• The integration is done by focusing on a
special subset of association rules whose
right-hand-side are restricted to the
classification class attribute.
– CARs: class association rules
CS685 : Special Topics in Data Mining, UKY
CBA: Three Steps
• Discretize continuous attributes, if any
• Generate all class association rules (CARs)
• Build a classifier based on the generated CARs.
CS685 : Special Topics in Data Mining, UKY
Our Objectives
• To generate the complete set of CARs that
satisfy the user-specified minimum
support (minsup) and minimum
confidence (minconf) constraints.
• To build a classifier from the CARs.
CS685 : Special Topics in Data Mining, UKY
Rule Generator: Basic Concepts
• Ruleitem
<condset, y> :condset is a set of items, y is a class label
Each ruleitem represents a rule: condset->y
• condsupCount
• The number of cases in D that contain condset
• rulesupCount
• The number of cases in D that contain the condset and are
labeled with class y
• Support=(rulesupCount/|D|)*100%
• Confidence=(rulesupCount/condsupCount)*100%
CS685 : Special Topics in Data Mining, UKY
RG: Basic Concepts (Cont.)
• Frequent ruleitems
– A ruleitem is frequent if its support is above minsup
• Accurate rule
– A rule is accurate if its confidence is above minconf
• Possible rule
– For all ruleitems that have the same condset, the
ruleitem with the highest confidence is the possible
rule of this set of ruleitems.
• The set of class association rules (CARs) consists
of all the possible rules (PRs) that are both
frequent and accurate.
CS685 : Special Topics in Data Mining, UKY
RG: An Example
• A ruleitem:<{(A,1),(B,1)},(class,1)>
– assume that
• the support count of the condset (condsupCount) is 3,
• the support of this ruleitem (rulesupCount) is 2, and
• |D|=10
– then (A,1),(B,1) -> (class,1)
• supt=20% (rulesupCount/|D|)*100%
• confd=66.7% (rulesupCount/condsupCount)*100%
CS685 : Special Topics in Data Mining, UKY
RG: The Algorithm
1 F 1 = {large 1-ruleitems};
2 CAR 1 = genRules (F 1 );
3 prCAR 1 = pruneRules (CAR 1 ); //count the item and class occurrences to
determine the frequent 1-ruleitems and prune it
4 for (k = 2; F k-1 Ø; k++) do
5
C k = candidateGen (F k-1 ); //generate the candidate ruleitems Ck
6
7
8
9
10
11
12
using the frequent ruleitems Fk-1
for each data case d D do //scan the database
C d = ruleSubset (C k , d); //find all the ruleitems in Ck whose condsets
are supported by d
for each candidate c C d do
c.condsupCount++;
if d.class = c.class then
c.rulesupCount++; //update various support counts of the candidates in Ck
end
end
CS685 : Special Topics in Data Mining, UKY
RG: The Algorithm(cont.)
13
F k = {c C k | c.rulesupCount minsup};
//select those new frequent ruleitems to form Fk
14 CAR k = genRules(F k ); //select the ruleitems both accurate and frequent
15 prCAR k = pruneRules(CAR k );
16 end
17 CARs =  k CAR k ;
18 prCARs =  k prCAR k ;
CS685 : Special Topics in Data Mining, UKY