Week 10-Part 1- ppt

Download Report

Transcript Week 10-Part 1- ppt

Classification and Prediction
Monash University
Semester 1, March 2006
www.monash.edu.au
Patterns
• Data mining aims to reveal knowledge about the data
under consideration
• Knowledge is viewed as Patterns within the data
• Patterns are also referred to as structures, models
and relationships
• Learning approach is inherently linked to the patterns
• Which approach for what data ? – eternal question!
www.monash.edu.au
2
Taxonomy of Learning Approaches
• Verification Driven
– Query and reporting Statistical Analysis
• Discovery Driven
– Supervised Learning
> Predictive
> Typical Learning Techniques: Classification and Regression
– Unsupervised Learning
> Informative
> Typical Learning Techniques: Clustering, Association
Analysis, Deviation/Outlier Detections
www.monash.edu.au
3
Problems and Uses of Verification Driven
Techniques
• Statistics
– Assumption – you have a well-defined model
> In reality – you often don’t
– Focus on model estimation rather than model selection
– Don’t cope well with large amounts of data
> Time and Effort required is very high
– Highlight linear relationships
– Non-linear stats requires knowledge of the non-linearity
> The ways in which variables interact
> In real multi-dimensional data -> Not often known or
specificiable
– Important lessons from Statistics
> Estimation of uncertainty and checking of assumptions is
absolutely essential
www.monash.edu.au
4
Discovery Driven Data Mining
• Predictive/Supervised Learning
– Build patterns by making a prediction of some
unknown attribute given the values of other
known attributes
– Principal Techniques: Classification and
Regression
– Supervision: Training data (a set of
observations/measurements) is accompanied by
labels indicating the class of the observations
– New data is classified based on the training data.
www.monash.edu.au
5
Discovery Driven Data Mining
• Informative/Unsupervised Learning
– Do not present a solution to a known problem
– They present “interesting” patterns for
consideration by a domain expert
– Principal Techniques: Clustering and
Association Analysis
– Labels of training data unknown
– Given a set of observations/measurements
(data), need to establish existence of classes or
clusters in the data.
www.monash.edu.au
6
Classification
• Difficult to define
• Usually based on a set of classes that are
pre-defined and a set of training samples that
are used for building class models
• Each new object is then assigned to one of
the classes already defined in the class
model
• Typical Apps: Credit Approval, Target
Marketing, Medical Diagnosis…
www.monash.edu.au
7
Classification Vs. Regression
• Classification and Regression are two major
types of prediction techniques
• Classification
– Predict nominal or discrete values
• Regression
– Predict continuous or ordered values
www.monash.edu.au
8
Classification – A two stage process
• Model Construction
– Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
– Set of tuples used for model construction:
Training Set
– The model can be represented as a set of IFTHEN rules, Decision Trees of Mathematical
Formulae
www.monash.edu.au
9
Classification Process (1): Model
Construction
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
www.monash.edu.au
10
Classification – A two stage process
• Model Usage
– For classifying future or unknown objects
– Need to estimate accuracy of the model
– The known label of the test sample is compared
with the classified result from the model
– Accuracy rate is the percentage of test set
samples that are correctly classified by the model
– Test set is independent of the training set
www.monash.edu.au
11
Classification Process (2): Use the Model
in Prediction
Classifier
Testing
Data
Unseen Data
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
(Jeff, Professor, 4)
Tenured?
www.monash.edu.au
12
Evaluating Classification Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provided by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
www.monash.edu.au
13
Decision Trees
• Assumption:
– Data consists of records that have a number of
independent attributes and a dependent attribute
– A flow-chart-like tree structure
– A structure to support a game of 20 questions:
> Each internal node represents a test on an
attribute
> Each branch represents the outcome of the test
> Each leaf represents a class
www.monash.edu.au
14
Classification by Decision Tree Induction
Decision tree generation consists of two phases
– Tree construction
> At start, all the training examples are at the root
> Partition examples recursively based on selected
attributes
– Tree pruning
> Identify and remove branches that reflect noise or
outliers
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the decision
tree
www.monash.edu.au
15
Training Dataset
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
www.monash.edu.au
16
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
www.monash.edu.au
17
Decision Trees
• Classify objects based on the values of their
independent attributes
• Classification is based on a tree structure where
each node of the tree is a rule that involves a
multi-way decision
• To classify an object, the attributes are compared
with tests in each node starting from the root. The
path found leads to a leaf node which is the class
to which the object belongs to.
www.monash.edu.au
18
Decision Trees
• Attractive due to ease of understanding
• Rules can be expressed in natural language
• Aim: Build a DT consisting of a root node,
number of internal nodes and a set of predefined leaf nodes.
• Continuous splitting of root node until the
process is complete.
www.monash.edu.au
19
Algorithm for Decision Tree Induction
•
Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
•
Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
– There are no samples left
www.monash.edu.au
20
Decision Trees
• Attractive due to ease of understanding
• Rules can be expressed in natural language
• Aim: Build a DT consisting of a root node,
number of internal nodes and a set of predefined leaf nodes.
• Continuous splitting of root node until the
process is complete.
www.monash.edu.au
21
Decision Trees
• The quality of the tree depends on the
quality of the training data
• 100% accurate for training data, but
training data is only a sample…cannot
be accurate for all data.
www.monash.edu.au
22
Finding the Split
• A variable has to be chosen to split the data
• This variable has to be such that it does the best job
of discriminating among target classes.
• Finding which variable is most influential is nontrivial.
• One approach – finding the data’s diversity and
choosing a variable that minimises the diversity
amongst the children nodes.
• Several Techniques have been proposed –
Information Gain/Entropy, Gini, Chi-Squared, Rough
Sets etc.
www.monash.edu.au
23
Attribute Selection Measure
• Information gain (ID3/C4.5)
– All attributes are assumed to be categorical
– Can be modified for continuous-valued attributes
• Gini index (IBM IntelligentMiner)
– All attributes are assumed continuous-valued
– Assume there exist several possible split values for each
attribute
– May need other tools, such as clustering, to get the possible
split values
– Can be modified for categorical attributes
www.monash.edu.au
24
Information Theory – 101 
• Suppose there is a variable s that can take either value
a or value b.
• If s is always going to be a, then there is no uncertainty
and no information
• The most common measure of the mount of
information (also known as entropy) is:
– I = - SUM (pi log (pi))
• Let p(a) = 0.9 and p(b) = 0.1
• Let p(a) = 0.5 and p(b) = 0.5
• When is I maximised ?
www.monash.edu.au
25
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and
n elements of class N
– The amount of information, needed to decide if an arbitrary
example in S belongs to P or N is defined as
p
p
n
n
I ( p, n)  
log 2

log 2
pn
pn pn
pn
www.monash.edu.au
26
Information Gain in Decision Tree
Induction
• Assume that using attribute A a set S will be
partitioned into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is
pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n

• The encoding information that would be gained by
branching on A
Gain( A)  I ( p, n)  E ( A)
www.monash.edu.au
27
Attribute Selection by Information Gain
Computation

Class P: buys_computer
= “yes”

Class N:
buys_computer = “no”

I(p, n) = I(9, 5) =0.940

Compute the entropy
for age:
age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
5
4
I ( 2,3) 
I ( 4,0)
14
14
5

I (3,2)  0.69
14
E ( age) 
Hence
Gain(age)  I ( p, n)  E (age)
Similarly
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
www.monash.edu.au
28
Extracting Classification Rules from Trees
• Represent the knowledge in the form of IF-THEN
rules
• One rule is created for each path from the root
to a leaf
• Each attribute-value pair along a path forms a
conjunction
• The leaf node holds the class prediction
• Rules are easier for humans to understand
www.monash.edu.au
29
Extracting Classification Rules from Trees
• Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”
www.monash.edu.au
30
Avoid Overfitting in Classification
• The generated tree may overfit the training
data
– Too many branches, some may reflect
anomalies due to noise or outliers
– Result is in poor accuracy for unseen samples
www.monash.edu.au
31
Avoid Overfitting in Classification
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not
split a node if this would result in the goodness
measure falling below a threshold
> Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully
grown” tree—get a sequence of progressively
pruned trees
> Use a set of data different from the training
data to decide which is the “best pruned
tree”
www.monash.edu.au
32
Approaches to Determine the Final Tree Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to
estimate whether expanding or pruning a node may
improve the entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
minimized
www.monash.edu.au
33
Decision Trees: Strengths and Weaknesses
• Strengths
– Ability to generate understandable rules
– Classify with minimal computational overhead
• Issues
– Need to find the right split variable
– Visualising Trees is tedious – particularly as data
dimensionality increases
– Handling both continuous and categorical data
www.monash.edu.au
34
Enhancements to basic decision tree
induction
• Allow for continuous-valued attributes
– Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set
of intervals
• Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
• Attribute construction
– Create new attributes based on existing ones that are
sparsely represented
– This reduces fragmentation, repetition, and replication
www.monash.edu.au
35
Classification in Large Databases
• Classification—a classical problem extensively studied by
statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
• Why decision tree induction in data mining?
– relatively faster learning speed (than other classification
methods)
– convertible to simple and easy to understand classification
rules
– can use SQL queries for accessing databases
– comparable classification accuracy with other methods
www.monash.edu.au
36
Other Classification Methods
•
•
•
•
Bayesian approaches
Neural Networks
k-nearest neighbor classifier
case-based reasoning
• Genetic algorithm
• Rough set approach
• Fuzzy set approaches
www.monash.edu.au
37