Conventional Data Mining Techniques I

Transcript Conventional Data Mining Techniques I

Conventional Data Mining
Techniques I
A B M Shawkat Ali
PowerPoint permissions
Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all
courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate
under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning
Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.
Copyright © 2007 Cengage Learning Australia Pty Limited
1
My Request
“A good listener is not only popular everywhere,
but after a while he gets to know something”
- Wilson Mizner
Objectives
On completion of this lecture you should know:
•
•
•
•
•
What is a classification task.
How Decision Tree algorithm works.
Experimental test with Decision Tree.
How Naive Bayes algorithm works.
Naive Bayes demonstration with Weka.
Characteristics of classification
• Classifying attributes: Group of variables that
define the class of an instance or record.
• Predicting attributes: Group of variables that
determine the class of an instance.
• Supervised learning: Classifying attributes
are dependent variables and necessarily
categorical whereas predicting attributes are
independent variables.
N.B. The term group also includes singleton.
Three approaches
• Divide the domain into distinct regions for each
class. An instance contains enough information
to belong to a class.
• Find the probability that an instance falls into a
given class.
• Find the probability that a class contains a given
instance.
Supervised learning
• One or more dependent variables and one or
more independent variables. Classes are predefined.
• To develop a learner model usually part of the
known data is used for training and the rest for
testing. Often 70% of the data is used as training
data.
Decision tree
• A tree structure where non-terminal nodes
represent tests on one or more attributes and
terminal nodes reflect decision outcomes.
A decision tree model for Weather dataset
Categorising continuous attributes
• Split should be made where the error is minimum:
Temperature: 15
Play Golf:
No
18 22
No Yes
28 34
Yes Yes
40
No
• Two possible splits between Yes and No – one at
20 [(18+22)/2] and another at 37 [(34+40)/2]. One
error for 20 and two errors for 37.
Entropy measure
Suppose a message looks like this:
DADAABCDABCCADBB….
You wish to transmit it as a binary code:
A=00, B=01, C=10, D=11
You need 2 bits per letter. Suppose you learn that
A and B are more frequent and the probabilities
are:
P(A)=1/2, P(B)=1/4, P(C)=1/8, and P(D)=1/8
If you adopt the code:
A=0, B=10, C=110, and D=111
You can transmit the message by using 1.75 bits
per letter on average.
This fact is contained in the entropy measure:
H(X) = – p1log2p1 – p2log2p2 – …
For the first scenario:
1
1 1
1 1
1 1
1
H ( X )   log 2  log 2  log 2  log 2  2
4
4 4
4 4
4 4
4
For the second scenario:
1
1 1
1 1
1 1
1
H ( X )   log 2  log 2  log 2  log 2  1.75
2
2 4
4 8
8 8
8
Generally, if there are n possibilities with pi the
probability of the event i, the entropy H(X) is
n
given by:
H ( X )   pi log 2 pi
i 1
The Greek letter sigma (Σ) simply indicates
the summation of all the values. The above
equation can be extended for the conditional
probability. For instance, what is the entropy for
attribute Y when we already know the outcome
of attribute X? Involving two attributes X and Y,
we have:
H ( X )   pX  pY | X log2 pY | X
Body image data
Name
Weight
Height
Food Habit
Fitness
Program
Nancy
average
short
vegetarian
yes
Karen
average
short
no restriction
no
Wendy
Melissa
Doiya
Kate
Judy
Ruma
heavy
heavy
heavy
light
light
average
average
tall
average
short
average
tall
no red meat
vegetarian
vegetarian
no restriction
no restriction
no restriction
no
no
no
yes
no
yes
Concern
satisfied
(negative)
dissatisfied
(positive)
dissatisfied
satisfied
satisfied
satisfied
dissatisfied
satisfied
For attribute Food Habit:
2 Positive
2 Negative
b1
Food habit
b2
1 Positive
b3
3 Negative
Sample average entropy calculation for the attribute ‘Food Habit’:
 nb  
 nbc 
 nbc  
        log2   
b 
 nb  
 nt   c  nb 
4 2
2 2
   log  log
8 4 2 4 4 2

4
8
2
1 1
1 0
0 3 0
0 3


log

log


log
 log
 


4 8 1 2 1 1 2 1 8 3 2 3 3 2
0.5  0.5  0  0  0  0
1
3
8
8
 0.50
Average Entropy for Food Habit = 0.50
3

3
For attribute Height:
b1
Height
b2
b3
1 Positive
2 Negative
2 Positive
1 Negative
2 Negative
Sample average entropy calculation for the attribute ‘height’:
3 1
1 2
   log 2  log 2
8 3
3 3

3
8
2

3 2
2 1
1

log

log

23 3
2 3
8 3


2 0
0 2
2

log

log

22 2
2 2
8 2

3
 0.528  0.39    0.39  0.528  0
3
8
 0.69
Average Entropy for Height = 0.69
For attribute Weight:
1 Positive
1 Negative
b1
Weight
b2
b3
1 Positive
2 Negative
1 Positive
2 Negative
Sample average entropy calculation for the attribute ‘weight’:
2
1 2
1 3 1
1 1
2 1
   log 2  log 2     log 2  log 2 
3
3 3
2 8 3
2 2
8 2
2
1 2
3 1
   log 2  log 2 
3
3 3
8 3
2
3
3
  0.505.   0.5280.39    0.5280.39 
8
8
8
 0.94
Average Entropy for Weight = 0.94
For attribute Fitness Program:
3 Positive
2 Negative
b1
Fitness program
b2
3 Negative
Sample average entropy calculation for the attribute ‘fitness program’
5
3
3
2
2
3
0
0
3
3
   log  log     log  log 
25 5
2 5 8 3
2 3 3
2 3
8 5
5
 0.442  0.529  0
8
 0.61
Average Entropy for Fitness program = 0.61
Attribute
Food habit
Height
Weight
Fitness program
Average Entropy
0.50
0.69
0.94
0.61
Decision tree for the body image
dataset
Food Habit
No
red
meat
No restriction
Vegetarian
Fitness Program
Fitness Program
Yes
Satisfied
No
Dissatisfied
No
Fitness Program
Yes
Dissatisfied
Satisfied
No
Satisfied
Avoiding over-fitted trees
• Stop growing tree if perfect classification
requires too many branching.
• Allow the tree to over-fit and then post-prune.
– Reduced error pruning
– Rule post-pruning
Decision tree software
• Quinlan’s ID3 (1986) captures the main idea of
building decision trees. Thereafter: C4.5 (Weka’s
version is J48), C5.0 (Windows version See5).
• CART (always performs binary splits – two
branches.)
• CHAID (Found in statistical packages like SAS
or SPSS, but works only with categorical data.)
Decision tree by WEKA
Figure 4.8 C4.5 (Weka name is J48) classifier selection from Weka
environment
J48 (C4.5) classifier for CRX data
Figure 4.10 J48 classifier performance for credit scoring data
Visualise decision tree
Figure 4.11 Visualise Decision Tree selection from Weka
A decision tree model for
weather data
Figure 4.12 A decision tree model for Weather data
Tree representation in Weka
attribute9 = t
| attribute10 = t: 1 (228.0/21.0)
| attribute10 = f
• Each line represents a node in the tree. The
attribute without any vertical bar ‘|’ to its left is the
root node. The second two lines, which start with a
'|', is a child node of the first line. In general, a node
with one or more '|' characters before the rule is a
child node of the node immediately preceding that
line but having one less '|' character.
• A leaf node is identified by a colon (:) sign. The
class is identified by the number, in our case 1 or
2, immediately following the colon sign.
• Nodes that generate a classification, such as
| attribute10 = t: 1 (228.0/21.0) are followed by
two numbers (sometimes one) in parentheses. The
first number tells how many instances in the
training set are correctly classified by this node, in
this case 228.0. The second number 21.0
represents the number of instances incorrectly
classified by the node.
Cross-validation
Figure 4.9 10 fold cross validation procedure.
Applications of decision tree
•
•
•
•
•
•
Traffic Risk Analysis
Remote Sensing
Medical Application
Speech Recognition
Character Recognition
And many more
Strengths of decision trees
•
•
•
•
•
Easy-to-understand.
Map nicely to a set of production rules.
Applied to real problems.
Make no prior assumptions about the data.
Able to process both numerical and
categorical data.
• Comparatively faster than Neural Networks.
Weaknesses of decision trees
•
•
•
•
Output attribute must be categorical.
Limited to one output attribute.
Decision tree algorithms are unstable.
Trees created from numeric datasets can be
complex.
Naive Bayes classifier
P (C | A) 
P ( A | C )  P (C )
P ( A)
where C is the hypothesis to be tested
A is the evidence associated with C
[In simple words, if we know how much of A can
happen when C happens, the formula tells us how
much of C can happen when A happens.]
(cont.)
• P(C) is the prior probability or marginal
probability of C. It is "prior" in the sense that it
has not yet accounted for the information
available in A.
• P(C|A) is the conditional probability of C,
given A. It is also called the posterior
probability because it has already incorporated
the outcome of event A.
•P(A|C) is the conditional probability of A given
C.
•P(A) is the prior or marginal probability of A,
which is normally the evidence.
Bayes classifier: An
example
[Called naive because all attributes are given
equal weights and assumed independent]
Data for Bayes classifier
Finance/Investme
nt
promotion
Travel/Tour
promotion
Reading/Magazin
e
promotion
Health/Diet
promotion
Sex
Yes
No
Yes
No
Male
Yes
Yes
No
No
Male
No
Yes
Yes
Yes
Female
No
Yes
No
Yes
Male
Yes
Yes
Yes
Yes
Female
No
No
Yes
No
Female
Yes
No
No
No
Male
Yes
Yes
No
No
Male
No
No
No
Yes
Female
Yes
No
No
No
Male
The instance to be classified
Finance/Investment promotion = No
Travel/Tour promotion = Yes
Reading/Magazine promotion = Yes
Health/Diet promotion = No
Sex = ?
Counts and probabilities for
attribute sex
Finance/Investme
nt
promotion
Travel/Tour
promotion
Reading/Magazi
ne
promotion
Health/Diet
promotion
Sex
Male
Female
Male
Female
Male
Female
Male
Female
Yes
5
1
3
2
1
3
1
3
No
1
3
3
2
5
1
5
1
Ratio:
yes/total
5/6
1/4
3/6
2/4
1/6
3/4
1/6
3/4
Ratio:
no/total
1/6
3/4
3/6
2/4
5/6
1/4
5/6
1/4
Computing the probability for sex
= male
P( sex  male | A) 
P ( A | sex  male) P ( sex  male)
P ( A)
[The denominator is actually the sum of all
possible values of the numerator. Thus in this
case, P(A) = P(A|sex=male)x P(sex=male) +
P(A|sex=female)xP(sex=female).]
Conditional probabilities for
sex = male
P(Finance/Investment = No | Sex = male) = 1/6
P(Travel/Tour = Yes | Sex = male) = 3/6
P(Reading/Magazine = Yes | Sex = male) = 1/6
P(Health/Diet = No | Sex = male) = 5/6
The probability for sex = male
given evidence A
P(sex = male | A)  0.006944 / P(A)
The probability for sex = female
given evidence A
P(sex = female| A)  0.028125 / P(A)
Zero-valued attribute counts
Since probabilities get multiplied we cannot have
a zero value because that will eliminate
everything. When 0 is a possibility, each ratio of
the form n/d should be replaced by:
n  ( k )( p )
d k
k is a value between 0 and 1 (usually 1)
p is an equal fractional part of the total
number of possible values for the attribute
Missing data
With Bayes classifier, missing data
items are ignored.
Numeric data
( x   ) 2 /(2 2 )
f ( x)  1 /( 2  ) e
where
e = the exponential function
m = the class mean for the given numerical
attribute
s = the class standard deviation for the attribute
x = the attribute value
[The formula above gives the probability density
function for a continuous attribute such as weight
or age. The probability for say age 45 is
estimated as: P(age = 45) = Prob (age ≤ 45.5) –
Prob (age ≤ 44.5). These probabilities can be
found in a normal distribution table.]
Naive Bayes’ algorithm
Probability-based solver
Bayes’ theorem
Relates the conditional and marginal probabilities
of stochastic events A and C
P(C | A) 
P ( A | C )  P(C )
 L( A | C )  P(C )
P( A)
where P stands for the probability of the variable
within parenthesis, and L(A|C) is referred to as
likelihood of A given fixed C.
Bayes’s theorem with many
attributes
P( A1 ,..., An | C )  P(C )
P(C | A1 ,... An ) 
P( A1 ,..., An )
where n is the number of attributes in
P(C | A1 ,..., An ) .
Assuming that the attributes are independent
(hence the name ‘naïve’) and ignoring the
denominator, we get:
P(C | A1 ,... An )  P( A1 | C )  P( A2 | C )  ...  An | C )  P(C )
or
n
P(C | A1 ,..., An )   P( Ai | C)  P(C)
i 1
Example of Naive Bayes’
x2
x1
Figure 4.13 A simple example for Naive Bayes classification
X2
X1
Figure 4.14 New instance identification using Naive Bayes methodology
Probability
Measure
P(New Case|Crescent-shape) =
No. of Crescent-shapes in the vicinity
1

Total No. of Crescent-shapes
40
No. of Oval shapes in the vicinity
3

P(New Case|Oval shape) =
Total No. of Oval shapes
20
Posterior probability of New case being Crescent-shaped =
Prior prob. of Crescent shape  Likelihood of new case being Crescent
=
40 1 1
 
60 40 60
Posterior probability of New case being Oval-shaped =
Posterior probability of Oval shape  Likelihood of New case being Oval
=
20 3 3 1
  
60 20 60 20
Naive Bayes’ through WEKA
Figure 4.15 Naive Bayes classifier selection from Weka
Figure 4.16 Naïve Bayes classifier performance with US voting data
Applications of Naive Bayes
•
•
•
•
•
Parameter Estimation
Document Classification
Systems Performance Management
Medical Diagnosis
And many more
Naive Bayes’ algorithm
• Strengths: Simplicity, efficiency, convenience,
etc.
• Weaknesses: Restrictive assumptions, not very
effective, etc.
Recap
• Classification task in Data Mining
• Decision Tree working principle
• Naive Bayes working principle
• Applications of Decision Tree and Nave Bayes
• Data splitting method in Data Mining
Artificial Neural Networks
&
Support Vector Machines
PowerPoint permissions
Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all
courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate
under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning
Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.
Copyright © 2007 Cengage Learning Australia Pty Limited
Objectives
On completion of this lecture you should know:
•
•
•
•
What is artificial neural network
Application and limitations of neural networks
What is support vector machines
Application and limitations of support vector
machines
Artificial neural networks
An information processing conceptual model.
Three broad types of models:
• Analytical (Physics-based, derived from
fundamental principles, also called white box
model.)
• Conceptual (Introduces some artificial elements
to mimic the working of the system, also known
as grey-box model.)
• Black box (Only relates input with output
without any consideration of the internal working
of the system.)
An ANN example
m
01101101
13
a
01100001
01
n
01101110
g
01100111
07
o
01101111
15
mango
...
…
……………….
………………..
Image
Sub-image
ASCII Code
Neural
Network
Trained to
Recognize
characters
14
Neural
Network
output
Figure 5.1 A basic Neural Network process for optical
character recognition
Biological neuron
Nucleus
Axon
Dendrites
Cell Body (Soma)
Presynaptic
terminal
Figure 5.2 Sketch of biological neuron
Usefulness
• Where we cannot formulate an algorithmic
solution.
• Where we can get lots of examples of the
behaviour we require but too few perfect ones.
• Where we need to extract patterns or trends
from data which is too complex or imprecise to
be noticed by humans or other computer
algorithms.
Format
• Mimics human brain consisting of interconnected
neurons.
• One input layer, one or two hidden layer(s), and
one output layer. Two hidden layers are usually
enough to explain all the patterns. Each layer can
have one or more neurons.
• Each neuron is connected to all the neurons of the
preceding layer and the following layer for a fullyconnected network, but not with other neurons in
the same layer.
• In feed-forward neural networks, information
moves only in one direction (from input towards
output i.e. there is no loop to carry information
backwards).
Processing
• Each neuron in the input layer reads the value of
an attribute within the range [0,1] and passes to
the next layer unchanged.
• Each neuron in the next layer receives
information from all the neurons of the previous
layer with each value multiplied by a weight and
then summed. Weights can be negative and do
not have to add to 1. Weights can initially be
selected arbitrarily and adjusted gradually during
model training process.
• The weighted sum may be passed on to the next
layer of neurons without modification or it can be
modified with a simple threshold function (i.e.,
pass on 0 if the sum is less than or 1 if it is
greater than a certain value). The most popular
modification is by sigmoid function (to be
described later).
• During training of the model, the model output is
compared with the observed values and the
weights adjusted till a satisfactory matching
between the two is obtained.
Feed-forward neural networks
Input 1
Node 1
W14
W15
Input 2
Node 2
Node 3
Input Layer
W46
W24
W25
W34
Input 3
Node 4
Node 6
Node 5
Output
W56
W35
Hidden Layer
Output Layer
Figure 5.3 A fully connected feedforward neural network
Genetic learning
Often used to get a good starting set of initial
weights for a neural network model:
• Genetic algorithms are based on
a biological metaphor
• Suitable for both supervised and
unsupervised data mining.
Genetic learning operators
• Crossover: First Step – forms new elements
from population. Parent population create
offspring.
• Mutation: Second step – some offspring
undergoes random changes in their
characteristics.
• Selection: Third Step – All candidates are
evaluated against a fitness function. Only initial
number of most fit population are kept and the
rest are discarded. The final selection is based
on ‘survival of the fittest.’
An initial population of weights
Table 5.1 Population of connection weights for genetic learning
Population
Element
W1
W2
W3
W4
W5 W6
W7
W8
W9
Parent 1
0.1
0.1
0.1
0.1
0.1 0.1
0.1
0.1
0.1
Parent 2
0.9
0.9
0.9
0.9
0.9 0.9
0.9
0.9
0.9
Child 1
Child 2
Child 3
0.1
0.1
0.2
0.1
0.9
0.9
0.9
0.9
0.9
0.9
0.9
0.9
0.1 0.1
0.1 0.1
0.1 0.1
0.9
0.1
0.1
0.9
0.1
0.1
0.1
0.9
0.9
The Sigmoid function
1
f ( x) 
 kx
1 e
Eq 5.2
where k = 1 is popular, and the shape is given on
the next page, and e is the base of the natural
logarithms approximated by 2.718281…
The diagram below is for k =1. Try drawing it for k =
0.5 and 10.
1.200
1.000
0.800
f(x)
0.600
0.400
0.200
x
0.000
-6 -5 -4 -3 -2 -1
0
1
2
3
4
5
6
XOR problem solved by NN
Table 5.2 Boolean XOR problem
XOR
Input 1 Input 2 Output
0
0
0
0
1
1
1
0
1
1
1
0
Input 1
Node 1
W1=1
Node 3
Input 2
Node 2
Output
W2=1
Figure 5.5 Training Neural Networks for XOR problem.
The learning rule for output = W1(Input 1) + W2 (Input 2)
Node 3
1
Input 1
Node 1
1
Node 4
1
Input 2
-1
Node 2
1
Input layer
2
Node 6
Output
-1
Node 5
Hidden layer
Output layer
Figure 5.6 Complete neural network for XOR problem.
The activation of nodes is by threshold function
• We can calculate the output by following a
two-step process:
• Step 1: The input values in the hidden layer
nodes are obtained from the following
relationship (We note that the network is not fully
connected).
Node 3 = (1)(Input 1)
Node 4 = (1)(Input 1) + (1)(Input 2)
Node 5 = (1)(Input 2)
• The output from each hidden layer node will
be either 0 or 1.
(cont.)
• Step 2: The neural network output is
obtained by summing the values:
Output = (-1)(Node 3) + (2)(Node 4) + (1)(Node 5)
Backpropagation networks
Please recall Figure 5.3
(cont.)
Therefore, input to Node 4 = 1(1.2) + 0.5(0.7) +
0.1(0.9) = 1.64
Output from Node 4 = 0.8375 [using Equation (5.2)]
Input to Node 5 = 1(-0.4) + 0.5(0.6) + 0.1(1.1) = 0.01
Output from Node 5 = 0.5025
Input to Node 6 = 0.8375(0.8) + 0.5025(0.5) = 0.9213
Output from Node 6 = 0.7153
Suppose that the actual observed output is 0.6,
therefore the error is (0.7153 – 0.6) = 0.1153.
(cont.)
(cont.)
Which really means that the computed errors
are summed across all output nodes. Since we
have only one output node:
Error (W14) = Error (W24) = Error (W34)
= 0.023(0.8)0.8375(1-0.8375) = 0.00250
Error(W15) = Error(W25) = Error(W35)
=0.023(0.5)0.5025(1-0.5025) = 0.00287
(cont.)
The final step is the weight adjustments made
using delta rule developed by Widrow and Hoff
(Widrow and Lehr, 1995).
The rule is:
Wjk (new) = Wjk (old) + ΔWjk
ΔWjk = (r)[Error(Wjk)](Ok)
where r is learning rate which lies between 0 and 1
– higher value
means faster convergence but can lead to
instability.
With 0.3 as the learning rate, we get:
ΔW56 = ΔW46 = 0.3(0.023)0.7153 = 0.00494
The updated value of W46 = 0.8 + 0.00494
0.80494
The updated value of W56 = 0.5 + 0.00494
0.50494
ΔW14 = ΔW24 = ΔW34 = 0.3(0.00250)0.8375
0.000628
The updated value of W14 = 1.2 + 0.000628
1.200628.
=
=
=
=
General considerations
•
•
•
•
•
What input attributes will be used to build the
network?
How will the network output be represented?
How many hidden layers should the network
contain?
How many nodes should there be in each hidden
layer?
What condition will terminate network training?
Neural network in Weka
Figure 5.7. Iris data file opened for Neural Network model construction
Figure 5.8. Neural Network algorithm selection from Weka
Figure 5.9 Weka environment after Neural Network algorithm selection
Figure 5.10 Output screen after execution of Weka
Parameter value selection for
neural networks
= = = Run information = = =
Scheme:
weka.classifiers.functions.MultilayerPerceptron L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Test mode: 10-fold cross-validation
(cont.)
= = = Classifier model (full training set) = = =
Sigmoid Node 0
Inputs Weights
Threshold -3.5015971588434005
Node 3 -1.0058110853859954
Node 4 9.07503844669134
Node 5 -4.107780453339232
----------------Time taken to build model: 0.66 seconds
= = = Stratified cross-validation = = =
= = = Summary = = =
Correctly Classified Instances
146
97.3333
%
Incorrectly Classified Instances
4
2.6667 %
-----------------
C
B
Y-Axis
A
X-Axis
Figure 5.12 Training process of a neural network. A = under-trained
model, B = properly trained model, and C = over-trained model
Figure 5.13 Parameter details in Weka for neural network algorithm
Neural network of the Iris data
Sepal length
Sepal width
Petal length
Petal width
Node 6
Node 3
Node 0
Iris Setosa
Node 4
Node 1
Iris versicolor
Node 5
Node 2
Iris virginica
Node 7
Node 8
Node 9
Neural network strengths
• Works well with noisy data.
• Can process both numeric and categorical data.
• Appropriate for applications requiring a time
element.
• Have performed well in several domains.
• Appropriate for supervised learning and
unsupervised clustering.
Weaknesses
•
•
•
•
Lack explanation capabilities.
May not provide optimal solutions to
problems.
Overtraining can be a problem.
Support vector machines (SVM)
Class partitioning lines
x2
x2
x2
OH
OH
OH
x1
x1
x1
Figure 5.17 The training patterns Crescent-shaped and Ovalshaped objects are represented by two different symbols based
on their class labels. These patterns are classified by non OH in
(a) and (b), but the solid line indicates the OH in (c) (Ali, 2005)
(b)
(a)
Y-Axis
U
X-Axis
L
OH
U
Optimal hyperplane and
the support vectors
OH L
X-Axis
Figure 5.18 (a) margins are narrow (b) margins are maximum
(a) Input space
(b) Linear separation (c)
Nonlinear
separation
Figure 5.19 (a) The input space, (b) The linear OH construction
with errors, and (c) The nonlinear OH construction
without error by using kernel mapping to a 256
dimensional space
SVM architecture
Figure 5.20 An SVM architecture
XOR solved by SVM
Table 5.3. Boolean XOR Problem
Input data x
(-1,-1)
Output class y
-1
(-1,+1)
+1
(+1,-1)
+1
(+1,+1)
-1
• First, we transform the dataset by polynomial
kernel as:
K (x i , x j )  (1  x i  x j )
T
Here,
xi .x j T
2
 1 1
 1 1  1 1 1 1



 1 1  1 1 1 1


1
1


,
Therefore the kernel matrix is:
9
1
K (x i , x j )  
1

1
1
9
1
1
1
1
9
1
1
1

1

9
We can write the maximization term following SVM
implementation given in Figure 5.20 as:
4
1 4 4
o
i   i    i j yi y j K  xi  x j 
2 i1 j 1
i1
1
         (9 2  2   2   2   9 2 
1 2 3 4 2 1
1 2
1 3
1 4
2
2   2   9 2  2   9 2 )
2 3
2 4
3
3 4
4
4
subject to:
y
i 1
i
i
  1   2   3   4  0
0  1, 0  2 , 0  3 , 0  4
.
9 1   2   3   4  1
   9      1
 1
2
3
4

  1   2  9 3   4  1
 1   2   3  9 4  1
By solving these above equations we can write the
solution to this optimisation problem as:
1   2   3   4 
1
8
Therefore, the decision function in the inner product
representation is:
n
4
i 1
i 1
2
fˆ x    i yi K x i , x   0.125 yi x i  x   1
The 2nd degree polynomial kernel function:
K (xi , x j )  ((xi , x j )  1) 2
 ( xi1 x j1  xi 2 x j 2 ) 2  2( xi1 x j1  xi 2 x j 2 )  1
 1  ( xi1 x j1 ) 2  2( xi1 x j1 )(xi 2 x j 2 )  ( xi 2 x j 2 ) 2  2( xi1 x j1 )  2( xi 2 x j 2 )
  ( x i )T ( x j )
Now we can write the 2nd degree polynomial
transformation function as:
(xi )  [1, xi1 , 2xi1 xi 2 , xi 2 , 2xi1 , 2xi 2 ]T
2
2
4
o 
  y ( x )
i i
i
i 1

1
[(x1 )  (x 2 )  (x 3 )  (x 4 )]
8
=
 1
 1
 1
 1   0

 
 








 1
 1
 1
 1   0

  2   2   2   2   

1 


       1 / 2 

8  1
 1
 1
 1   0

 
 
 
   

   2    2   2   2   0

  2   2   2   2   0

 
 
  
 
1

 2

 x1




  2 x1 x 2 
1
,0,0,0 2
0
0,0,


x
2

 2


 2 x1 


 2 x 2 
Therefore the optimal hyperplane function for
this XOR problem is:
fˆ (x)   x1 x2
Figure 5.21 XOR problem classification by SVM using second
degree polynomial kernel
Kernel trick
(a) linear kernel
(b) polynomial kernel
(c) rbf kernel
(d) spline kernel
(e) sigmoidal kernel
(f) multiquadratic kernel
SVM (SMO) in Weka
Figure 5.23 SVM classifier (named as SMO in Weka list) selection
Figure 5.24 SVM classification performance for Colon Tumor data
RBF kernel selection
Figure 5.25 RBF kernel selection for SVM classifier
Figure 5.26 SVM classification performance with RBF kernel
Advantages of SVM
• Can handle high dimensional dataset well.
• Can map almost any nonlinear space with a
suitable kernel.
• Recent advances have added to its versatility.
Disadvantages of SVM
•
•
•
•
Choice of proper kernel function is not easy.
Once a kernel is chosen, the algorithm is rigid.
Slower than other algorithms.
Not efficient with discrete data.
Recap
•
•
•
Neural networks working principle.
Support vector machines working principle.
Applications and limitation of neural networks
and support vector machines.
Model Evaluation Techniques
PowerPoint permissions
Cengage Learning Australia hereby permits the usage and posting of our copyright controlled PowerPoint slide content for all
courses wherein the associated text has been adopted. PowerPoint slides may be placed on course management systems that operate
under a controlled environment (accessed restricted to enrolled students, instructors and content administrators). Cengage Learning
Australia does not require a copyright clearance form for the usage of PowerPoint slides as outlined above.
Copyright © 2007 Cengage Learning Australia Pty Limited
Objectives
On completion of this lecture you should know:
•
•
•
•
Quantity-Based Model Evaluation Techniques
Appropriateness of Evaluation Techniques
Bagging and Boosting
Model Ranking
What should be evaluated?
The performance criteria should be:
• Reliable (work almost every time)
• Valid (relevant to objectives)
Accuracy/error rate estimation
Figure 6.1 Colon tumor problem training performance with J48
Weka decision tree classifier
Error/accuracy
• The opposite of accuracy is Incorrect
Classification of Instances, or alternatively, Error.
An empirical percent error can be defined as the
ratio of the number of errors to the number of
total instances examined in the experiment, which
in equation form is:
Number of incorrectly classified instances
Percent error 
100
Total number of instances
Figure 6.2 Colon tumor problem 10-fold cross-validation
performance with J48 Weka classifier
Confusion matrix
Actual
Healthy
(Negative)
Sick
(Positive)
Predicted
Healthy
Sick
(Negative) (Positive)
a
b
c
d
Diagonal terms give the correct values.
(cont.)
• Accuracy:
ACC 
ad
abcd
d
cd
• True positive rate:
TPR 
• False positive rate:
b
FPR 
ab
• True negative rate:
TNR 
a
ab
(cont.)
• False negative rate: FNR  c
cd
• Precision:
• F-measure:
P
d
bd
( 2  1)P  TPR
F
 2 P  TPR
• Kappa statistics:
P( A)  P( E )
K
1  P( E )
,
(a  b)(a  c)  (b  d )(c  d )
P( E ) 
(a  b  c  d ) 2
Example 6.1
Computed decision
Actual
Class 1
Class 2
Class 3
Class 1
14
2
5
Class 2
? (x)
40
2
Class 3
1
? (y)
18
Answer:
For class 2 : 0.2 = (x +2)/(x + 40 +2)
Or, 0.2(x + 42) = x + 2
Or, 8.4 – 2 = 0.8 x
Therefore, x = 8
From model accuracy: 0.72 = (72)/(y + 90), which
gives y = 10.
Final confusion matrix
Computed decision
Class 1 Class 2 Class 3
Actual
Class 1
14
2
5
Class 2
8
40
2
Class 3
1
10
18
Error rate estimation
• Mean Absolute Error, MAE =
P1  A1  P2  A2  ...  Pn  An
n
• Root Mean-Squared Error, RMS =
( P1  A1 )2  ( P2  A2 )2  ...  ( Pn  An )2
n
• Relative Absolute Error, RAE =
P1  A1  P2  A2  ...  Pn  An
A1  A  A2  A  ...  An  A
Receiver operating curve (ROC)
True Positive Rate
1
C2
P1
0.5
C1
C0
0
0
0.5
1
False Positive Rate
Figure 6.3 Receiver operating curves. C0 is for a random model,
C1 and C2 are for two different parametric models, and P1 is for
a non-parametric model
Euclidean distance comparison
Pd  1  w (1  TPR)  (1  w)  FPR
2
2
Where w is the weight factor, with a range 0 to 1,
that is used to assign relative importance of false
positives to false negatives.
Pd has a range from 1 for the perfect classifier to 0
for an absolute incorrect classifier.
Lift chart
Lift =
Model A Computed Computed
accept
reject
Actual
300
1,700
accept
Actual
9,700
88,300
reject
Total
10,000
90,000
Lift =
(300/10,000)/
(2,000/100,000) = 1.5
P ( Rs | S )
P ( Rt | G )
Model B Computed Computed
accept
reject
Actual
1,200
800
accept
Actual
38,800
59,200
reject
Total
40,000
60,000
Lift =
(1,200/40,000)/
(2,000/100,000) = 1.5
Table 6.2 Two models with equal lifts
Lift chart
2000
1800
1600
L1
1200
1000
800
L0
Number Responding
1400
600
400
200
0
0
10
20
30
40
50
60
70
80
90
100
Percent Sampled
Figure 6.4 Lift chart for targeted versus mass mailing
Cost and utility
Cost   E ij C ij
i 1 j 1
Utility   EijU ij
i 1 j 1
Cost matrix calculation
Table 6.3 Cost matrix to penalise errors
Predicted
Actual
Negative
Positive
Negative
0
1
Positive
3
0
Table 6.4 Confusion matrix of a hypothetical dataset
Predicted Class
True Class
- 1 -
- 2 -
- 3 -
- 4 -
1
43
3
4
1
2
2
34
1
0
3
5
2
39
2
4
3
1
4
56
Table 6.5 Cost matrix for the hypothetical dataset of Table 6.4
Predicted Class
True Class
- 1 -
- 2 -
- 3 -
- 4 -
1
0
1
2
1
2
1
0
1
3
3
2
2
0
1
4
0
3
1
0
Parsimony
(Occam’s razor principle)
• The simpler of two models with similar
performances is a better choice.
Bagging and boosting
• Bagging is the process of developing a model
for each random sample, and classifying an
unknown instance based on what majority of the
models predict.
• Boosting is similar but more complex than
bagging. In boosting, increased focus is given to
misclassified instances to capture their
behaviour well.
Bagging and boosting performance
evaluation through Weka
Figure 6.5 AdaBoost1 algorithm selection from Weka environment
Figure 6.6 Base classifier selection for AdaBoostM1 algorithms
Figure 6.7 AdaBoostM1 algorithm selection in Weka
environment with J48 as a base classifier
Figure 6.8 AdaBoostM1 performance with base classifier J48
Figure 6.9 Bagging performance with base classifier J48
Figure 6.10 J48 classifier performance for liver-disorders data
Model ranking
• Accuracy
• Computation time
• Memory space required
• Ease of interpretation
Model ranking
Rij  1 
eij  maxei 
minei   maxei 
Where eij is the performance
measure of the jth algorithm on dataset i,
and ei is a vector of accuracy for dataset i.
1  si  f i  1
i  

2 n  2
Where si is the total number of success (best)
cases for the ith classifier, i is the total number
of failure (worst) cases for the same classifier,
and n is the total number of datasets.
Z  ai  ti
Where  and  are the weight parameters for
ranking average accuracy against computational
complexity. The average accuracy and complexity
are denoted by ai and ti. We consider the range for
 and  between 0 and 2.
Example 6.2
Table 6.6 Various measures of accuracy of the models tested (Ali, 2005)
Classifier
IBK
C4.5
PART
KD
NB
OneR
SVM
NN
TPR
0.595
0.595
0.595
0.61
0.495
0.365
0.565
0.645
TNR
0.54
0.6
0.6
0.6
0.44
0.385
0.595
0.605
% of correct
classification
0.505
0.615
0.55
0.565
0.385
0.325
0.62
0.615
F-measure
0.565
0.625
0.625
0.575
0.48
0.365
0.56
0.62
Average
accuracy
0.551
0.609
0.593
0.588
0.45
0.36
0.585
0.621
Table 6.7 Ranking of algorithms in terms of computation time (Ali, 2005)
Classifier
Execution
Time
IBK
C4.5
0.535 0.535
PART KD NB
0.52 0.51 0.705
OneR SVM NN
0.995
0.5 0.015
average accuracy
average computational time
1
average performance
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
IBK
C4.5
PART
KD
NB
OneR
SVM
NN
classifier names
Figure 6.11 Combined performances for the classifiers (Ali, 2005)
IBK
NB
C4.5
OneR
PART
SVM
KD
NN
3
2.5
2
1
0.5
-1
-1.5
-2
beta values
Figure 6.12 Relative performance for the classifiers,
when  (=1) is fixed (Ali, 2005)
1.8
1.4
1
0.6
0.2
-0.4
-0.8
-1.2
-0.5
-1.6
0
-2
Z values
1.5
Model ranking through Weka
Figure 6.13 Weka GUI chooser
Step 2
Figure 6.14 Weka experimenter setup
Step 3
Figure 6.15 Several datasets and classifier selection in Weka environment
Figure 6.16 Weka run performance with several classifier for multiple
problems
Step 4
Figure 6.17 Weka performance analysis window
Step 5
Figure 6.18 Classifiers list visualisation in the WEKA environment.
Step 6
Figure 6.19 Combined classification performance of several classifiers
Step 7
Figure 6.20 Training time performance for different classifiers
No free lunch (NFL) theorem
• If algorithm A outperforms algorithm B on some
cost functions, then loosely speaking there must
exist exactly as many other functions where B
outperforms A.
Recap
• Model Evaluation Techniques.
• Appropriateness of Evaluation Techniques.
• Bagging and Boosting Methods.
• Combined Model Ranking.

Conventional Data Mining Techniques I

Transcript Conventional Data Mining Techniques I

Directory