Clustering - University of Kentucky

Download Report

Transcript Clustering - University of Kentucky

Classification
CS 685: Special Topics in Data Mining
Spring 2009
Jinze Liu
The UNIVERSITY
of Mining,
KENTUCKY
CS685 : Special
Topics in Data
UKY
Bayesian Classification: Why?
• Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to certain
types of learning problems
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct.
Prior knowledge can be combined with observed data.
• Probabilistic prediction: Predict multiple hypotheses, weighted
by their probabilities
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
CS685 : Special Topics in Data Mining, UKY
Bayesian Theorem: Basics
• Let X be a data sample whose class label is unknown
• Let H be a hypothesis that X belongs to class C
• For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
• P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
• P(X): probability that sample data is observed
• P(X|H) : probability of observing the sample X, given that
the hypothesis holds
CS685 : Special Topics in Data Mining, UKY
Bayesian Theorem
• Given training data X, posteriori probability of a
hypothesis H, P(H|X) follows the Bayes theorem
P(H | X )  P( X | H )P(H )
P( X )
• Informally, this can be written as
posterior =likelihood x prior / evidence
• MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP
hH
hH
• Practical difficulty: require initial knowledge of
many probabilities, significant computational cost
CS685 : Special Topics in Data Mining, UKY
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally
independent:
n
P( X | C i)   P( x k | C i)
k 1
• The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
• No dependence relation between attributes
• Greatly reduces the computation cost, only count the class
distribution.
• Once the probability P(X|Ci) is known, assign X to the class
with maximum P(X|Ci)*P(Ci)
CS685 : Special Topics in Data Mining, UKY
Training dataset
age
Class:
<=30
C1:buys_computer= <=30
‘yes’
30…40
C2:buys_computer= >40
>40
‘no’
>40
31…40
Data sample
<=30
X =(age<=30,
<=30
Income=medium,
>40
Student=yes
<=30
Credit_rating=
31…40
Fair)
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
CS685 : Special Topics in Data Mining, UKY
Naïve Bayesian Classifier:
Example
• Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes”)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007
X belongs to class “buys_computer=yes”
CS685 : Special Topics in Data Mining, UKY
Naïve Bayesian Classifier:
Comments
• Advantages :
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence , therefore loss of accuracy
– Practically, dependencies exist among variables
– E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
– Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
• How to deal with these dependencies?
– Bayesian Belief Networks
CS685 : Special Topics in Data Mining, UKY
Bayesian Networks
• Bayesian belief network allows a subset of the
variables conditionally independent
• A graphical model of causal relationships
– Represents dependency among the variables
– Gives a specification of joint probability distribution
Y
X
Z
P
Nodes: random variables
Links: dependency
X,Y are the parents of Z, and Y is the
parent of P
No dependency between Z and P
Has no loops or cycles
CS685 : Special Topics in Data Mining, UKY
Bayesian Belief Network: An Example
Family
History
Smoker
(FH, S)
LungCancer
PositiveXRay
Emphysema
Dyspnea
Bayesian Belief Networks
(FH, ~S) (~FH, S) (~FH, ~S)
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
The conditional probability table
for the variable LungCancer:
Shows the conditional probability
for each possible combination of its
parents
n
P( z1,...,zn) 
 P( z i | Parents( Z i ))
i 1
CS685 : Special Topics in Data Mining, UKY
Learning Bayesian Networks
• Several cases
– Given both the network structure and all variables
observable: learn only the CPTs
– Network structure known, some hidden variables:
method of gradient descent, analogous to neural
network learning
– Network structure unknown, all variables observable:
search through the model space to reconstruct graph
topology
– Unknown structure, all hidden variables: no good
algorithms known for this purpose
• D. Heckerman, Bayesian networks for data
mining
CS685 : Special Topics in Data Mining, UKY
Association Rules
Transactionid
Items bought
100
f, a, c, d, g, I, m, p
200
a, b, c, f, l,m, o
300
b, f, h, j, o
400
b, c, k, s, p
500
a, f, c, e, l, p, m, n
Customer
buys both
Customer
buys beer
Customer
buys diaper
Itemset X = {x1, …, xk}
Find all the rules X  Y with
minimum support and confidence
support, s, is the probability that
a transaction contains X  Y
confidence, c, is the conditional
probability that a transaction
having X also contains Y
Let supmin = 50%, confmin = 50%
Association rules:
A  C (60%, 100%)
C  A (60%, 75%)
CS685 : Special Topics in Data Mining, UKY
Classification based on
Association
• Classification rule mining versus Association
rule mining
• Aim
– A small set of rules as classifier
– All rules according to minsup and minconf
• Syntax
– Xy
– X Y
CS685 : Special Topics in Data Mining, UKY
Why & How to Integrate
• Both classification rule mining and
association rule mining are indispensable to
practical applications.
• The integration is done by focusing on a
special subset of association rules whose
right-hand-side are restricted to the
classification class attribute.
– CARs: class association rules
CS685 : Special Topics in Data Mining, UKY
CBA: Three Steps
• Discretize continuous attributes, if any
• Generate all class association rules (CARs)
• Build a classifier based on the generated CARs.
CS685 : Special Topics in Data Mining, UKY
Our Objectives
• To generate the complete set of CARs that
satisfy the user-specified minimum
support (minsup) and minimum
confidence (minconf) constraints.
• To build a classifier from the CARs.
CS685 : Special Topics in Data Mining, UKY
Rule Generator: Basic Concepts
• Ruleitem
<condset, y> :condset is a set of items, y is a class label
Each ruleitem represents a rule: condset->y
• condsupCount
• The number of cases in D that contain condset
• rulesupCount
• The number of cases in D that contain the condset and are
labeled with class y
• Support=(rulesupCount/|D|)*100%
• Confidence=(rulesupCount/condsupCount)*100%
CS685 : Special Topics in Data Mining, UKY
RG: Basic Concepts (Cont.)
• Frequent ruleitems
– A ruleitem is frequent if its support is above minsup
• Accurate rule
– A rule is accurate if its confidence is above minconf
• Possible rule
– For all ruleitems that have the same condset, the
ruleitem with the highest confidence is the possible
rule of this set of ruleitems.
• The set of class association rules (CARs) consists
of all the possible rules (PRs) that are both
frequent and accurate.
CS685 : Special Topics in Data Mining, UKY
RG: An Example
• A ruleitem:<{(A,1),(B,1)},(class,1)>
– assume that
• the support count of the condset (condsupCount) is 3,
• the support of this ruleitem (rulesupCount) is 2, and
• |D|=10
– then (A,1),(B,1) -> (class,1)
• supt=20% (rulesupCount/|D|)*100%
• confd=66.7% (rulesupCount/condsupCount)*100%
CS685 : Special Topics in Data Mining, UKY
RG: The Algorithm
1 F 1 = {large 1-ruleitems};
2 CAR 1 = genRules (F 1 );
3 prCAR 1 = pruneRules (CAR 1 ); //count the item and class occurrences to
determine the frequent 1-ruleitems and prune it
4 for (k = 2; F k-1 Ø; k++) do
5
C k = candidateGen (F k-1 ); //generate the candidate ruleitems Ck
6
7
8
9
10
11
12
using the frequent ruleitems Fk-1
for each data case d D do //scan the database
C d = ruleSubset (C k , d); //find all the ruleitems in Ck whose condsets
are supported by d
for each candidate c C d do
c.condsupCount++;
if d.class = c.class then
c.rulesupCount++; //update various support counts of the candidates in Ck
end
end
CS685 : Special Topics in Data Mining, UKY
RG: The Algorithm(cont.)
13
F k = {c C k | c.rulesupCount minsup};
//select those new frequent ruleitems to form Fk
14 CAR k = genRules(F k ); //select the ruleitems both accurate and frequent
15 prCAR k = pruneRules(CAR k );
16 end
17 CARs =  k CAR k ;
18 prCARs =  k prCAR k ;
CS685 : Special Topics in Data Mining, UKY
SVM – Support Vector Machines
Small Margin
Large Margin
Support Vectors
CS685 : Special Topics in Data Mining, UKY
SVM – Cont.
• Linear Support Vector Machine
Given a set of points xi  n with label yi {1,1}
The SVM finds a hyperplane defined by the pair (w,b)
(where w is the normal to the plane and b is the distance
from the origin)
s.t.
yi ( xi  w  b)  1 i  1,...,n
x – feature vector, b- bias, y- class label, 2/||w|| - margin
CS685 : Special Topics in Data Mining, UKY
SVM – Cont.
CS685 : Special Topics in Data Mining, UKY
SVM – Cont.
• What if the data is not linearly separable?
• Project the data to high dimensional space where it is
linearly separable and then we can use linear SVM –
(Using Kernels)
(0,1) +
+
-
+
-1
0
+1
(0,0)
+
(1,0)
CS685 : Special Topics in Data Mining, UKY
Non-Linear SVM
Classification using SVM (w,b)
?
xi  w  b  0
In non linear case we can see this as
?
K ( xi , w)  b  0
Kernel – Can be thought of as doing dot product
in some high dimensional space
CS685 : Special Topics in Data Mining, UKY
CS685 : Special Topics in Data Mining, UKY
Results
CS685 : Special Topics in Data Mining, UKY
SVM Related Links
• http://svm.dcs.rhbnc.ac.uk/
• http://www.kernel-machines.org/
• C. J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition. Knowledge Discovery and Data Mining,
2(2), 1998.
• SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light
• BOOK: An Introduction to Support Vector Machines
N. Cristianini and J. Shawe-Taylor
Cambridge University Press
CS685 : Special Topics in Data Mining, UKY