Classification Techniques - The Institute of Finance

Download Report

Transcript Classification Techniques - The Institute of Finance

Data Mining
Instructor: Bajuna Salehe
Email: [email protected]
Web:
http://www.ifm.ac.tz/staff/bajuna/courses
Classification and Prediction
Classification and Prediction

Classification and prediction are two forms
of data analysis that can be used to
extract models describing important data
classes or to predict future data trends.
Such analysis can help provide us with a
better understanding of the data at large.
An example application




An emergency room in a hospital measures 17
variables (e.g., blood pressure, age, etc) of
newly admitted patients.
A decision is needed: whether to put a new
patient in an intensive-care unit.
Due to the high cost of ICU, those patients who
may survive less than a month are given higher
priority.
Problem: to predict high-risk patients and
discriminate them from low-risk patients.
Another application

A credit card company receives thousands of
applications for new cards. Each application
contains information about an applicant,
 age
 Marital
status
 annual salary
 outstanding debts
 credit rating
 etc.

Problem: to decide whether an application
should approved, or to classify applications into
two categories, approved and not approved.
Machine learning and our focus





Like human learning from past experiences.
A computer does not have “experiences”.
A computer system learns from data, which
represent some “past experiences” of an
application domain.
Our focus: learn a target function that can be
used to predict the values of a discrete class
attribute, e.g., approve or not-approved, and
high-risk or low risk.
The task is commonly called: Supervised
learning, classification, or inductive learning.
Classification and Prediction

Whereas classification predicts
categorical (discrete, unordered) labels,
prediction models continuous valued
functions.
Classification and Prediction

For example, we can build a classification
model to categorize bank loan applications
as either safe or risky, or a prediction
model to predict the expenditures in
dollars of potential customers on computer
equipment given their income and
occupation.
Classification
Classification is the process of finding a
model (or function) that describes and
distinguishes data classes or concepts, for
the purpose of being able to use the model
to predict the class of objects whose class
label is unknown.
 The derived model is based on the
analysis of a set of training data (i.e., data
objects whose class label is known).

What is Classification

Classification is the task of assigning
objects to their respective categories.
 Examples
include classifying email messages
as spam or non-spam based upon the
message header and content, and classifying
galaxies based upon their respective shapes.
What is Classification


Classification can provide a valuable support
for informed decision making in the
organisation.
For example, suppose a mobile phone
company would like to promote a new cellphone product to the public. Instead of mass
mailing the promotional catalog to everyone,
the company may be able to reduce the
campaign cost by targeting only a small
segment of the population
What is Classification

It may classify each person as a potential
buyer or non-buyer based on their
personal information such as income,
occupation, lifestyle, and credit ratings.
Discrete Data

Discrete Data – A set of data is said to be
discrete if the values / observations
belonging to it are distinct and separate,
i.e. they can be counted (1,2,3,....).
Examples might include the number of
kittens in a litter; the number of patients in
a doctors surgery; the number of flaws in
one metre of cloth; gender (male, female);
blood group (O, A, B, AB).
Discrete Data

Any data measurements that are not
quantified on an infinitely divisible numeric
scale. Includes items like counts,
proportions, ratios, or percentage of a
characteristics, (i.e. sex, loan forms,
department attendance, etc.) that have
measurements like pass or fail, leak or no
leak, small, medium, or large, go or no go
tests. (SixSigma.com Dictonary)
Continuous Data

Continuous/Variable Data – A set of data is
said to be continuous if the values /
observations belonging to it may take on
any value within a finite or infinite interval.
You can count, order and measure
continuous data. For example height,
weight, temperature, the amount of sugar
in an orange, the time required to run a
mile.
Continuous Data

Variable data type have real numbers in
the measurement like 2.34, 2.55, etc. (i.e.
data that can be measured on a
continuous scale)
Categorical Data

Categorical Data – A set of data is said to
be categorical if the values or observations
belonging to it can be sorted according to
category. Each value is chosen from a set
of non-overlapping categories. For
example, shoes in a cupboard can be
sorted according to colour: the
characteristic 'colour' can have nonoverlapping categories 'black', 'brown',
'red' and 'other'. People have the
characteristic of 'gender' with categories
'male' and 'female'.
Nominal Data

Nominal Data – A set of data is said to be
nominal if the values / observations
belonging to it can be assigned a code in
the form of a number where the numbers
are simply labels. You can count but not
order or measure nominal data. For
example, in a data set males could be
coded as 0, females as 1; marital status of
an individual could be coded as Y if
married, N if single.
Ordinal Data

Ordinal Data - A set of data is said to be
ordinal if the values / observations
belonging to it can be ranked (put in order)
or have a rating scale attached. You can
count and order, but not measure, ordinal
data.
Ordinal Data

The categories for an ordinal set of data
have a natural order, for example,
suppose a group of people were asked to
taste varieties of biscuit and classify each
biscuit on a rating scale of 1 to 5,
representing strongly dislike, dislike,
neutral, like, strongly like. A rating of 5
indicates more enjoyment than a rating of
4, for example, so such data are ordinal.
Preliminaries
The input data for classification task is
given in the form of collection of records.
 Each record also known as instance or
example is characterised by a tuple (x,y),
where x is the attribute set and y is the
class label

Preliminaries
Table 1. Vertebrate Data Set
Preliminaries

In the above slide, the table shows a
sample data set used for classifying
vertebrates into one of the following
categories: mammal, bird, fish, reptile, or
amphibian.

The attribute set includes properties of a
vertebrate such as its body temperature,
skin cover, method of reproduction, ability
to fly and ability to live in water.
Preliminaries

The attribute set may contain discrete and
continuous features, however on the table
above attribute set contains mostly discrete
values.

The class label on the other hand, must be a
discrete attribute.

This is a key characteristics that distinguishes
classification from another predictive
modeling task known as regression, where y
is a continuous attribute.
What is Classification

Classification can be described as a task
of assigning objects to one of several
predefined categories.
Input
Attribute Set
(x)
Output
Classification
Model
Class label
(y)
The diagram show the classification as task of mapping an input
attribute set x into its class label y
Simple Definition

Classification is the task of learning a
target function f that maps each attribute
set x into one of the pre-defined class
labels y.

The target function is also known
informally as a classification model.
Usefulness of Classification
Model

A classification model is useful for the
following purposes:
 It
may serve as an explanatory tool to
distinguish between objects of different
classes (Descriptive Modeling).
 It
may also be used to predict the class label
of unknown records (Predictive Modeling).
Consider the table below:
Usefulness of Classification
Model
A classification model can be treated as a
black box that automatically assigns a
class label when presented with the
attribute set of an unknown record.
 Example you can be given the
characteristics of creature known as gila
monster.

Usefulness of Classification
Model

By building a classification model from the
data set shown in Table 1, you may use
the model to determine the class to which
the creature belongs.

Classification models are most suited for
predicting or describing data sets with
binary or nominal target attributes.
Classification & Prediction

Classification:
 Predicts categorical class labels
 Classifies data (constructs a model)
based on
the training set and the values (class labels)
in a classifying attribute and uses it in
classifying new data

Prediction:
 Models
continuous-valued functions, i.e.,
predicts unknown or missing values

Typical Applications
 Credit
approval
 Target marketing
– Medical diagnosis
– Treatment effectiveness analysis
Classification
Techniques
Classification Technique
A classification technique is a systematic
approach for building classification models
from an input data set.
 Examples of classification techniques
include:

 Decision
Tree Classifiers
 Rule-Based Classifiers
 Neural Networks
 Support Vector Machines
 Naıve Bayes Classifiers
 Nearest-Neighbor Classifiers
Classification Technique

Each technique employs a learning
algorithm to identify a model that best fits
the relationship between the attribute set
and class label of the input data (produces
outputs consistent with the class labels of
the input data).
Classification Technique
A good classification model must predict
correctly the class labels of records it has
never seen before.
 Building models with good generalization
capability, i.e., models that accurately
predict the class labels of previously
unseen records, is therefore a key
objective of the learning algorithm.

General Approach to Solve a
Classification Problem

A general strategy to solving a classification
problem is that:
 First,
the input data is divided into two disjoint
sets, known as the training set and test set,
respectively.


The training set will be used for building a
classification model.
The induced model is later applied to the test
set to predict the class label of each test
record.
Why are we dividing the data into
two set?

This strategy of dividing the data into
independent training and test sets allows
us to obtain an unbiased estimate of the
performance of a model on previously
unseen records.

A figure below in the next slide depicts
General Approach to Solve a
Classification Problem
Performance Measurement of
Model

Evaluation of the performance of a
classification model is based upon the
number of test records predicted correctly
and wrongly by the model.

The counts are tabulated in a table known
as a confusion matrix.
Performance Measurement of
Model

Table 2 depicts the confusion matrix for a
binary classification problem.
Performance Measurement of
Model
Each entry fij in this table denotes the
number of records from class i predicted to
be of class j.
 For instance, f01 is the number of records
from class 0 wrongly predicted as class 1
 Based on the entries in the confusion
matrix, the total number of correct
predictions made by the model is (f11 +
f00) and the total number of wrong
predictions is (f10 + f01).

Performance Measurement of
Model

Although a confusion matrix provides the
information needed to determine how
good is a classification model, it is useful
to summarize this information into a single
number.

This would make it more convenient to
compare the performance of different
classification models.
Performance Measurement of
Model
There are several performance metrics
available for doing this. One of the most
popular metrics is model accuracy, which
is defined as:
 Accuracy = Number of correct predictions
Total number of predictions
= f11 + f00
f11 + f10 + f01 + f00

Performance Measurement of
Model
Equivalently, the performance of a model
can be expressed in terms of its error rate
given by the following equation:
 Error rate = Number of wrong predictions
Total number of predictions
= f10 + f01
f11 + f10 + f01 + f00

Decision
Trees