Transcript ppt - DIT

Dr. Brian Mac Namee (www.comp.dit.ie/bmacnamee)
Business
Systems Intelligence:
5. Classification 2
2
of
25
49
Acknowledgments
These notes are based (heavily) on
those provided by the authors to
accompany “Data Mining: Concepts
& Techniques” by Jiawei Han and
Micheline Kamber
Some slides are also based on trainer’s kits
provided by
More information about the book is available at:
www-sal.cs.uiuc.edu/~hanj/bk2/
And information on SAS is available at:
www.sas.com
3
of
25
49
Classification & Prediction
Today we will look at:
– What are classification & prediction?
– Issues regarding classification and prediction
– Classification techniques:
•
•
•
•
•
•
•
Case based reasoning (k-nearest neighbour algorithm)
Decision tree induction
Bayesian classification
Neural networks
Support vector machines (SVM)
Classification based on from association rule mining concepts
Other classification methods
– Prediction
– Classification accuracy
4
of
25
49
Classification
Classification:
– Predicts categorical class labels
Typical Applications
– {CreditHistory, Salary} -> CreditApproval (Yes/No)
– {Temp, Humidity} --> Rain (Yes/No)
x  X  {0,1} , y  Y  {0,1}
h: X Y
y  h( x )
n
Mathematically
5
of
25
49
Linear Classification
Binary Classification problem
The data above the red line belongs to class ‘x’
The data below red line
x
belongs to class ‘o’
x
x
x
x
Examples
–
SVM,
x
x
o
x x
Perceptron, Probabilistic
o
Classifiers
x
o o o
oo
o
o
o o
o
o
6
of
25
49
Discriminative Classifiers
Advantages
– Prediction accuracy is generally high
– Robust, works when training examples contain
errors
– Fast evaluation of the learned target function
Criticism
– Long training time
– Difficult to understand the learned function
(weights)
– Not easy to incorporate domain knowledge
7
of
25
49
Artificial Neural Networks
A biologically inspired
classification technique
Formed from interconnected
layers of simple artificial neurons
ANN history:
– 1943: McCulloch & Pitts
– 1959: Rosenblatt (Perceptron)
– 1959: Widrow & Hoff (ADALINE and
MADALINE)
– 1969: Marvin Minsky and Seymour Papert's
– 1974: Werbos (Backprop)
– 1982: John Hopfield
8
of
25
49
An Artifical Neuron
The n-dimensional input vector x is mapped
into variable y by means of the scalar product
and a nonlinear function mapping
x0
x1
w0
w1
f (x)
n
xn
wn
f ( x)  thresh(bias   wi * xi )
i 0
bias
9
of
25
49
ANN: Multi-Layer Perceptrons (MLPs)
Multi Layer Perceptrons (MLPs) are one of the
best known ANN types
Composed of layers of fully interconnected
artificial neurons
Training involves repeatedly presenting a
series of training cases to the network and
adjusting neurons’ weights and biases to
minimise classification error
Typically the backpropogation of error
algorithm is used for training
10
of
25
49
MLP Example
Remember our surfing example
An MLP can be built and trained to perform
classification for this problem
Input
Layer
Wind Speed
Wind Direction
Temperature
Wave Size
Wave Period
Hidden
Layer
Output
Layer
Good Surf
11
of
25
49
Network Training
The ultimate objective of training
– Obtain a set of weights that makes almost all of the
tuples in the training data classified correctly
Steps
– Initialize weights with random values
– Feed the input tuples into the network one by one
– For each unit
• Compute the net input to the unit as a linear combination of all
the inputs to the unit
• Compute the output value using the activation function
• Compute the error
• Update the weights and the bias
12
of
25
49
Summary of ANN Classification
Strengths
– Fast classification
– Very good generalization capacity
Weaknesses
– No explanation capability – black box
– Training can be slow – eager learning
– Retraining is difficult
Lots of other network types, but MLP is
probably the most common
13
of
25
49
Support Vector Machines (SVM)
In classification problems we try to create
decision boundaries between classes
A choice must be made between possible
boundaries
Class 2
Class 1
14
of
25
49
SVMs (cont…)
The decision boundary should be as far away
from the data of both classes as possible
Class 2
Class 1
m
15
of
25
49
Margins
Small Margin
Large Margin
Support Vectors
16
of
25
49
Linear Support Vector Machine
Given a set of points xi   with label yi  {  1,1 }
The SVM finds a hyperplane defined by the
pair (w, b), where w is the normal to the plane
and b is the distance from the origin
n
yi ( xi  w  b)  1 i  1,..., N
Where:
• x - feature vector
• b - bias, y- class label
• ||w|| - margin
17
of
25
49
SVMs: The Clever Bit!
What about when classes are not linearly
separable?
f(.)
Input space
f( )
f( )
f( )
f( ) f( ) f( )
f( )
f( )
f( )
f( ) f( )
f( ) f( )
f( )
f( ) f( )
f( )
f( )
Feature space
Kernel functions and the kernel trick are used
to transform data into a different linearly
separable feature space
18
of
25
49
SVMs: The Clever Bit! (cont...)
What if the data is not linearly separable?
Project the data to high dimensional space
where it is linearly separable and then we can
use linear SVM – (Using Kernels)
(0,1) +
+
-
+
-1
0
+1
(0,0)
+
(1,0)
19
of
25
49
SVM Example
20
of
25
49
ResultsSVM
Example (cont…)
21
of
25
49
Summary of SVM Classification
Strengths
– Over-fitting is not common
– Works well with high dimensional data
– Fast classification
– Good generalization capacity
Weaknesses
– Retraining is difficult
– No explanation capability
– Slow training
At the cutting edge of machine learning
22
of
25
49
SVM vs. ANN
SVM
– Relatively new concept
– Nice generalization
properties
– Hard to learn – learned
in batch mode using
quadratic programming
techniques
– Using kernels can learn
very complex functions
ANN
– Quite old
– Generalizes well but
doesn’t have strong
mathematical foundation
– Can easily be learned in
incremental fashion
– To learn complex
functions – use
multilayer perceptron
(not that trivial)
23
of
25
49
SVM Related Links
http://svm.dcs.rhbnc.ac.uk/
http://www.kernel-machines.org/
C. J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Knowledge
Discovery and Data Mining, 2(2), 1998.
SVMlight – Software (in C)
http://ais.gmd.de/~thorsten/svm_light
BOOK: An Introduction to Support Vector
Machines N. Cristianini and J. Shawe-Taylor
Cambridge University Press
24
of
25
49
Association-Based Classification
Several methods for association-based
classification
– ARCS: Quantitative association mining and
clustering of association rules (Lent et al’97)
• It beats C4.5 in (mainly) scalability and also accuracy
– Associative classification: (Liu et al’98)
• It mines high support and high confidence rules in the
form of “cond_set => y”, where y is a class label
– CAEP (Classification by aggregating emerging
patterns) (Dong et al’99)
• Emerging patterns (EPs): the itemsets whose support
increases significantly from one class to another
• Mine Eps based on minimum support and growth rate
25
of
25
49
What Is Prediction?
Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
Prediction is different from classification
– Classification refers to predict categorical class
label
– Prediction models continuous-valued functions
26
of
25
49
Regress Analysis and Log-Linear Models in
Prediction
Linear regression: Y =  +  X
– Two parameters,  and , specify the line and
are to be estimated by using the data at hand
– Using the least squares criterion to the known
values of Y1, Y2,…, X1, X2,….
Multiple regression: Y = b0 + b1X1 + b2X2
– Many nonlinear functions can be transformed
into the above
Log-linear models:
– The multi-way table of joint probabilities is
approximated by a product of lower-order tables
– Probability: p(a, b, c, d) = ab acad bcd
27
of
25
49
Prediction: Numerical Data
28
of
25
49
Prediction: Categorical Data
29
of
25
49
Concerns Over Classification Techniques
When choosing a technique for a specific
classification problem we must consider the
following issues:
– Classification accuracy
– Training speed
– Classification speed
– Danger of over-fitting
– Generalisation capacity
– Implications for retraining
– Explanation capability
30
of
25
49
Evaluating Classification Accuracy
During development, and in testing before
deploying a classifier in the wild, we need to be
able to quantify the performance of the
classifier
– How accurate is the classifier?
– When the classifier is wrong, how is it wrong?
Useful to decide on which classifier (which
parameters) to use and to estimate what the
performance of the system will be
31
of
25
49
Evaluating Classifiers (cont…)
How we do this depends on how much data is
available
If there is unlimited data available then there is
no problem
Usually we have less data than we would like
so we have to compromise
– Use hold-out testing sets
– Cross validation
• K-fold cross validation
• Leave-one-out validation
– Parallel live test
32
of
25
49
Hold-Out Testing Sets
Split the available data into a training set and a
test set
Total number of available examples
Training Set
Test Set
Train the classifier in the training set and
evaluate based on the test set
A couple of drawbacks
– We may not have enough data
– We may happen upon an unfortunate split
33
of
25
49
K-Fold Cross Validation
Divide the entire data set into k folds
For each of k experiments, use kth fold for
testing and everything else for training
Total number of available examples
K=0
K=1
K=2
K=3
Test Set
Test Set
Test Set
Test Set
34
of
25
49
K-Fold Cross Validation (cont…)
The accuracy of the system is calculated as the
average error across the k folds
The main advantages of k-fold cross validation
are that every example is used in testing at
some stage and the problem of an unfortunate
split is avoided
Any value can be used for k
– 10 is most common
– Depends on the data set
35
of
25
49
Leave-One-Out Cross Validation
Extreme case of k-fold cross validation
With N data examples perform N experiments
with N-1 training cases and 1 test case
Total number of available examples
K=0
K=1
K=2
K=N
36
of
25
49
Classifier Accuracy
The accuracy of a classifier on a given test set
is the percentage of test set tuples that are
correctly classified by the classifier
– Often also referred to as recognition rate
– Error rate (or misclassification rate) is the
opposite of accuracy
37
of
25
49
False Positives Vs False Negatives
While it is useful to generate the simple
accuracy of a classifier, sometimes we need
more
When is the classifier wrong?
– False positives vs false negatives
– Related to type I and type II errors in statistics
Often there is a different cost associated with
false positives and false negatives
– Think about diagnosing diseases
38
of
25
49
Confusion Matrix
Device used to illustrate how a classifier is
performing in terms of false positives and false
negatives
Classifier Result
Gives us more
information than a
Class A Class B
(yes)
(no)
single accuracy
Class A
figure
fn

(yes)
Expected
Allows us to think
Class B Result
about the cost of
fp
 (no)
mistakes
Can be extended to any number of classes
39
of
25
49
Other Accuracy Measures
Sometimes a simple accuracy measure is not
enough
t _ pos
sensitivit y 
pos
t _ neg
specificit y 
neg
t _ pos
precision 
t _ pos  f _ pos 
40
of
25
49
ROC Curves
Receiver Operating Characteristic (ROC)
curves were originally used to make sense of
noisy radio signals
Can be used to help us talk about classifier
performance and determine the best operating
point for a classifier
41
of
25
49
ROC Curves (cont…)
Consider how the
relationship between
true positives and false
positives can change
We need to choose the
best operating point
True Positives
1.0
0
False Positives
1.0
For some great ROC curve examples have a look here
42
of
25
49
ROC Curves (cont…)
ROC curves can be
used to compare
classifiers
The greater the area
under the curve the
more accurate the
classifier
True Positives
1.0
0
False Positives
1.0
43
of
25
49
Over-Fitting
When we train a classifier we are trying to a
learn a function approximated by the training
data we happen to use
– What if the training data doesn’t
cover the whole problem space?
We can learn the training data too
closely which hampers the ability to generalise
This problem is known as overfitting
Depending on the type of classifier used there
are different approaches to avoiding this
44
of
25
49
Ensembles
In order to improve classification accuracy we
can aggregate the results of an ensemble of
classifiers
Classifier0
Classifier1
Classifiern
Aggregation
45
of
25
49
Bagging
Given a set S of s samples
Generate a bootstrap sample T from S
– Cases in S may not appear in T or may appear
more than once
Repeat this sampling procedure, getting a
sequence of k independent training sets
A corresponding sequence of classifiers
C1,C2,…,Ck is constructed for each of these
training sets, by using the same classification
algorithm
46
of
25
49
Bagging (cont…)
To classify an unknown sample X,let each
classifier predict or vote
The Bagged Classifier C* counts the votes and
assigns X to the class with the “most” votes
47
of
25
49
Boosting Technique — Algorithm
Assign every example an equal weight 1/N
For t = 1, 2, …, T Do
– Obtain a hypothesis (classifier) h(t) under w(t)
– Calculate the error of h(t) and re-weight the examples
based on the error . Each classifier is dependent on the
previous ones. Samples that are incorrectly predicted
are weighted more heavily
– Normalize w(t+1) to sum to 1 (weights assigned to
different classifiers sum to 1)
Output a weighted sum of all the hypothesis, with
each hypothesis weighted according to its accuracy
on the training set
48
of
25
49
Summary
Classification is an extensively studied problem
– Mainly in statistics and machine learning
Classification is probably one of the most
widely used data mining techniques
Scalability is still an important issue for
database applications
Research directions: classification of nonrelational data, e.g., text, spatial, multimedia,
etc..
49
of
25
49
Questions?