Classification and Prediction

Download Report

Transcript Classification and Prediction

Classification and
Prediction
(baseado nos slides do livro: Data
Mining: C & T)
Classification and Prediction

What is classification? What is prediction?

Issues regarding classification and prediction

Classification by decision tree induction

Bayesian Classification (TI)

Classification by Back Propagation (Neural Networks) (TI)

Support Vector Machines (SVM)

Associative Classification: Classification by association
rule analysis

Other Classification Methods: K-Nearest Neighbor, Casebased reasoning, etc

Prediction: Regression
(TI)
Sistemas de Apoio à Decisão
2003/04
(LEIC Tagus)
Classification vs. Prediction
Classification


predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
Prediction

models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications




Credit approval
Target marketing
Medical diagnosis
Fraud detection
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification—A TwoStep Process (1)
Model construction: describing a set of predetermined
classes



Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute
The set of tuples used for model construction is the
training set
The model is represented as classification rules, decision
trees, or mathematical formulae
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification—A TwoStep Process (2)
Model usage: for classifying future or unknown objects


2003/04
Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples
that are correctly classified by the model
 Test set is independent of training set, otherwise overfitting will occur
If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
Sistemas de Apoio à Decisão
(LEIC Tagus)
Classification Process
(1): Model Construction
Classification
Algorithms
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
2003/04
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Sistemas de Apoio à Decisão
(LEIC Tagus)
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2):
Use the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
2003/04
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Decisão
Assistant Prof Sistemas
7 de Apoio àyes
(LEIC Tagus)
Tenured?
Supervised vs.
Unsupervised Learning
Supervised learning (classification)

The training data (observations, measurements,
etc.) are accompanied by labels indicating the class
of the observations

New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes
de Apoio à Decisão
or clusters in the Sistemas
data(LEIC
Tagus)
2003/04
Classification and
Prediction

What is classification? What is prediction?

Issues regarding classification and prediction

Classification by decision tree induction

Bayesian Classification

Classification by Back Propagation

Support Vector Machines

Associative Classification: Classification by association
rule analysis
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Preparation

Data cleaning


Relevance analysis (feature selection)


Preprocess data in order to reduce noise and
handle missing values
Remove the irrelevant or redundant attributes
Data transformation

2003/04
Generalize and/or normalize data
Sistemas de Apoio à Decisão
(LEIC Tagus)
Evaluating classification methods
Accuracy: classifier accuracy and predictor accuracy
Speed and scalability


time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness

handling noise and missing values
Scalability

efficiency in disk-resident databases
Interpretability

understanding and insight provided by the model
Other measures
e.g., goodness of Sistemas
rules,desuch
as decision tree size or
Apoio à Decisão
(LEIC Tagus)
2003/04
compactness of classification
rules

Classification Accuracy:
Estimating Error Rates
Partition: Training-and-testing

use two independent data sets, e.g., training set (2/3),
test set (1/3)

used for data set with large number of samples
Cross-validation

divide the data set into k sub-samples

use k-1 sub-samples as training data and one subsample as test data—k-fold cross-validation

for data set with moderate size
Bootstrapping (leave-one-out)

2003/04
de Apoio à Decisão
for small size dataSistemas(LEIC
Tagus)
Increasing Classfier
Accuracy: Bagging
General idea: averaging the prediction over
a collection of classifiers
Training data
Classifier C
Classification method (CM)
CM
Classifier C1
Altered Training data
CM
Altered Training data
……..
Aggregation ….
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Classifier C2
Classifier C*
Bagging: The Algorithm






Given a set S of s samples
Generate a bootstrap sample T from S. Cases in S
may not appear in T or may appear more than once.
Repeat this sampling procedure, getting a sequence
of k independent training sets
A corresponding sequence of classifiers C1, C2, …,
Ck is constructed for each of these training sets, by
using the same classification algorithm
To classify an unknown sample X, let each classifier
predict or vote
The Bagged Classifier C* counts the votes and
assigns X to the class
with the “most” votes
Sistemas de Apoio à Decisão
2003/04
(LEIC Tagus)
Classification and Prediction

What is classification? What is prediction?

Issues regarding classification and prediction

Classification by decision tree induction

Bayesian Classification (TI)

Classification by Back Propagation (Neural Networks) (TI)

Support Vector Machines (SVM)

Associative Classification: Classification by association
rule analysis

Other Classification Methods: K-Nearest Neighbor, Casebased reasoning, etc

Prediction: Regression
(TI)
Sistemas de Apoio à Decisão
2003/04
(LEIC Tagus)
Decision Tree Induction:
Training Dataset
This
follows an
example
of
Quinlan’s
ID3
(Playing
Tennis)
2003/04
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
Sistemas de Apoio à Decisão
(LEIC Tagus)
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree
for “buys_computer”
age?
<=30
student?
2003/04
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Sistemas de Apoio à Decisão
(LEIC Tagus)
Algorithm for Decision
Tree Induction (ID3)
Basic algorithm (a greedy algorithm)


Tree is constructed in a top-down recursive divide-and-conquer
manner

At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in
advance)
Examples are partitioned recursively based on selected attributes


Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)

Conditions for stopping partitioning

1)
2)
3)
2003/04
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
There are no samples left
Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute Selection Measure:
Information Gain (ID3/C4.5)





Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
information measures info required to classify any
m
arbitrary tuple
si
si
I( s1,s2,...,sm )   log 2
s
i 1 s
entropy of attribute A with values {a1,a2,…,av}
v
s1 j  ...  smj
E(A)  
I ( s1 j ,..., smj )
s
j 1
information gained by branching on attribute A
Gain(A)  I(s 1, s 2 ,..., sm)  E(A)
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Attribute Selection by Information
Gain Computation
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) =0.940
Compute the entropy for age:
age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
5
I (2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age)  I ( p, n)  E (age)  0.246
Similarly:
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
Sistemas de Apoio à Decisão
5
4
E ( age) 
I ( 2,3) 
I ( 4,0)
14
14
5

I (3, 2)  0.694
14
2003/04
(LEIC Tagus)
Extracting Classification
Rules from Trees

Represent the knowledge in the form of IF-THEN rules

One rule is created for each path from the root to a leaf

Each attribute-value pair along a path forms a conjunction

The leaf node holds the class prediction

Rules are easier for humans to understand

Example:
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “no”
Sistemas de Apoio à Decisão
IF age
= “<=30” AND credit_rating
=(LEIC
“fair”
THEN buys_computer = “yes”
Tagus)
2003/04
Avoid Overfitting in
Classification
Overfitting: An induced tree may overfit the training data


Too many branches, some may reflect anomalies due to
noise or outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting


Prepruning: Halt tree construction early, do not split a node if
this would result in the goodness measure falling below a
threshold; difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the “bestSistemas
pruned
tree”
de Apoio à Decisão
2003/04
(LEIC Tagus)
Approaches to Determine
the Final Tree Size

Separate training (2/3) and testing (1/3) sets

Use cross validation, e.g., 10-fold cross validation

Use all the data for training


but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution
Use minimum description length (MDL) principle

halting growth of the tree when the encoding is
minimized
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Enhancements to Basic
Decision Tree Induction

Allow for continuous-valued attributes



Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete
set of intervals
Handle missing attribute values

Assign the most common value of the attribute

Assign probability to each of the possible values
Attribute construction

Create new attributes based on existing ones that are
sparsely represented

Sistemas de Apoio àrepetition,
Decisão
This reduces fragmentation,
and replication
2003/04
(LEIC Tagus)
Computing Info. Gain for
Continuous-Value Attributes

Let attribute A be a continuous-valued attribute

Must determine the best split point for A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent
values is considered as a possible split point



(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and
Sistemas de Apoio à Decisão
2003/04 D2 is the set of tuples (LEIC
in DTagus)
satisfying A > split-point

Classification in Large
Databases
Classification: a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classify data sets with millions of examples
and hundreds of attributes with reasonable speed

Why decision tree induction in data mining?




relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand classification
rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Scalable Decision Tree
Induction Methods
SLIQ (EDBT’96 — Mehta et al.): builds an index for each
attribute and only class list and the current attribute list
reside in memory
SPRINT (VLDB’96 — J. Shafer et al.): constructs an
attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim): integrates tree
splitting and tree pruning: stop growing the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan &
Ganti): separates the scalability aspects from the
criteria that determine the quality of the tree; builds an
AVC-list (attribute, value,
class
label)
Sistemas de Apoio à Decisão
2003/04
(LEIC Tagus)
Presentation of Classification
Results
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Visualization of a Decision Tree in
SGI/MineSet 3.0
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Bibliografia


(Livro) Data Mining: Concepts and
Techniques, J. Han & M. Kamber, Morgan
Kaufmann, 2001 (Secções 7.1 a 7.3 – livro
2001, Secções 5.1 a 5.3 – draft)
(Livro) Machine Learning, T. Mitchell, McGraw
Hill, 1997
2003/04
Sistemas de Apoio à Decisão
(LEIC Tagus)
Data Cube-Based DecisionTree Induction

Integration of generalization with decision-tree
induction (Kamber et al’97).

Classification at primitive concept levels


E.g., precise temperature, humidity, outlook, etc.

Low-level concepts, scattered classes, bushy
classification-trees

Semantic interpretation problems.
Cube-based multi-level classification

Relevance analysis at multi-levels.

de Apoio à Decisão
Information-gain Sistemas
analysis
with dimension + level.
(LEIC Tagus)
2003/04