Presentation - TKS

Download Report

Transcript Presentation - TKS

DATA MINING :
CLASSIFICATION
Classification : Definition
 Classification is a supervised learning.
 Uses training sets which has correct
answers (class label attributes).
 A model is created by running the
algorithm on the training data.
 Test the model. If accuracy is low,
regenerate the model, after changing
features,reconsidering samples.
 Identify a class label for the incoming
new data.
Applications:

Classifying credit card transactions
as legitimate or fraudulent.

Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil.

Categorizing news stories as finance,
weather, entertainment, sports, etc.
Classification: A two step process




Model construction: describing a set of
predetermined classes.
Each sample is assumed to belong to a
predefined class, as determined by the class
label attribute.
The set of samples used for model
construction is training set.
The model is represented as classification
rules, decision trees, or mathematical formula.






Model usage: for classifying future or unknown
objects.
Estimate accuracy of the model.
The known label of test sample is compared with
the classified result from the model.
Accuracy rate is the percentage of test set
samples that are correctly classified by the
model.
Test set is independent of training set.
If the accuracy is acceptable, use the model to
classify data samples whose class labels are not
known.
Model Construction:
Classification
Algorithms
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2):
Use the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
yes
Professor
5
yes
Assistant Prof
7
yes
Tenured?
Classification techniques:

Decision Tree based Methods
Rule-based Methods
Neural Networks

Bayesian Classification

Support Vector Machines


Algorithm for decision tree
induction:
Basic algorithm:
Tree is constructed in a top-down recursive divideand-conquer manner.
 At start, all the training examples are at the root.
 Attributes are categorical (if continuous-valued, they
are discretized in advance).
 Examples are partitioned recursively based on
selected attributes.
Example of Decision Tree:
Training
Dataset
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium yes fair
medium yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree age
for“buys_computer” <=30
<=30
age?
<=30
student?
overcast
30..40
yes
>40
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating buys_comp
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Advantages of decision tree
based classification:
Inexpensive to construct.
 Extremely fast at classifying unknown records.
 Easy to interpret for small-sized trees.
 Accuracy is comparable to other classification
techniques for many simple data sets.

Enhancements to basic decision tree
induction:

Allow for continuous-valued attributes



Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a
discrete set of intervals
Handle missing attribute values

Assign the most common value of the attribute

Assign probability to each of the possible values
Attribute construction

Create new attributes based on existing ones that
are sparsely represented

This reduces fragmentation, repetition, and
Potential Problem:

Over fitting: This is when the generated model
does not apply to the new incoming data.
» Either too small of training data, not covering
many cases.
» Wrong assumptions
Over fitting results in decision trees that are more
complex than necessary
 Training error no longer provides a good
estimate of how well the tree will perform on
previously unseen records

Need new ways for estimating errors
How to avoid Over fitting:
Two ways to avoid over fitting are –
 Pre-pruning
 Post-pruning
Pre-pruning:



Stop the algorithm before it becomes a fully
grown tree.
Stop if all instances belong to the same class.
Stop if no. of instances is less than some user
specified threshold

Post-pruning:
Grow decision tree to its entirety.
 Trim the nodes of the decision tree in a
bottom-up fashion.
 If generalization error improves after trimming,
replace sub-tree by a leaf node.
 Class label of leaf node is determined from
majority class of instances in the sub-tree.

Bayesian Classification
Algorithm:






Let X be a data sample whose class label is
unknown
Let H be a hypothesis that X belongs to class C
For classification problems, determine P(H/X):
the probability that the hypothesis holds given
the observed data sample X
P(H): prior probability of hypothesis H (i.e. the
initial probability before we observe any data,
reflects the background knowledge)
P(X): probability that sample data is observed
P(X|H) : probability of observing the sample X,
given that the hypothesis holds
Training dataset for Bayesian
Classification:
Class:
C1:buys_compute
r=
‘yes’
C2:buys_compute
r=
‘no’
age income
student
credit_ratingbuys_computer
buys_computer
income
student
credit_rating
<=30
no
high high no no
fair fair
no
<=30
excellent
no
high high no no
excellent
no
high high no no
fair fair
yes
30…40
yes
>40
yes
mediummedium
no no
fair fair
yes
>40
low yes yes
yes
low
fair fair
yes
low
excellent
no
>40
low yes yes
excellent
no
Data sample
low
excellent
yes
31…40
low yes yes
excellent
yes
X =(age<=30,
mediummedium
no no
fair fair
no
no
Income=medium, <=30
low
fair fair
yes
<=30
low yes yes
yes
Student=yes
Credit_rating=
mediummedium
yes yes
excellent
yes
>40
fair
yes
Fair)
mediummedium
yes yes
fair excellent
yes
<=30
yes
mediummedium
no no
excellent
yes
31…40
excellent
yes
high high yes yes
fair fair
yes
31…40
yes
mediummedium
no no
excellent
no
>40
excellent
no
Advantages & Disadvantages of
Bayesian Classification:


Advantages :
 Easy to implement
 Good results obtained in most of the cases
Disadvantages:
 Due to assumption there is loss of accuracy.
 Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history
etc ,Symptoms: fever, cough etc., Disease: lung
cancer, diabetes etc
 Dependencies among these cannot be modeled by
Bayesian Classifier
Conclusion:




Training data is an important factor in
building a model in supervised algorithms.
The classification results generated by
each of the algorithms (Naïve Bayes,
Decision Tree, Neural Networks,…) is not
considerably different from each other.
Different classification algorithms can take
different time to train and build models.
Mechanical classification is faster
References:







www.google.com
http://www.thearling.com
www.mamma.com
www.amazon.com
http://www.kdnuggets.com
C. Apte and S. Weiss. Data mining with decision trees
and decision rules. Future Generation Computer
Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth
International Group, 1984.
Thank you !!!