Data mining - Department of Computer Science and Engineering

Download Report

Transcript Data mining - Department of Computer Science and Engineering

Tree Based Data mining
Techniques
Haiqin Yang
1
Why Data Mining?
- Necessity is the Mother of Invention

The amount of data increases

Need to convert such data -> useful knowledge

The gap between data and analysts is increasing


Hidden information is not always evident
Applications

Business management, Production control,
Bioinformatics, Market analysis, Engineering design,
Science exploration...
2
What is Data Mining?


Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) patterns or knowledge
from huge amount of data
Alternative names

Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence,
etc.
3
KDD Process
Database
Selection
Transformation
Preparation
Training
Mining
Model,
Patterns
Evaluation/
Verification
4
Main Tasks in KDD
Classification
 Clustering
 Regression
 Association rule learning

5
Top 10 Algorithms in KDD
Identified in ICDM’06
 A series of criteria
 Select from 18 nominations and vote from
various researchers
 In 10 topics


association analysis, classification, clustering,
statistical learning, bagging and boosting,
sequential patterns, integrated mining, rough
sets, link mining, and graph mining
6
Top 10 Algorithms in KDD










C4.5
k-means
Support vector machines
The Apriori algorithms (Association rule analysis)
The EM algorithm
PageRank
AdaBoost
kNN:k-nearest neighbor classification
Naive Bayes
CART: Classification and Regression Tree
7
Good Resources

Weka-Data Mining Software in Java


Spider-Machine Learning in Matlab


http://pyml.sourceforge.net/
Scikits.learn-Machine Learning in Python


http://people.kyb.tuebingen.mpg.de/spider/
PyML-Machine Learning in Python


http://www.cs.waikato.ac.nz/ml/weka/
http://scikit-learn.sourceforge.net/
KDnuggets

http://www.kdnuggets.com/
8
Tree-Based Data Mining Techniques
Decision Tree (C4.5)
 CART
 Adaboost (Ensemble Method)
 Random Forest (Ensemble Method)
…

9
An Example



Data: Loan application data
Task: Predict whether a loan should be approved or not
Performance measure: accuracy
10
An Example
Data: Loan application data
 Task: Predict whether a loan should be
approved or not
 Performance measure: accuracy

No learning: classify all future applications
(test data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%
 Can we do better than 60% with learning?
11
Decision tree

Decision tree learning is one of the most
widely used techniques for classification


Its classification accuracy is competitive with
other methods, and
it is very efficient
The classification model is a tree, called
decision tree.
 C4.5 by Ross Quinlan is perhaps the best
known system. Open source

12
A decision tree from the loan data

Decision nodes and leaf nodes (classes)
13
Prediction
No
14
Is the decision tree unique?


No. Here is a simpler tree.
Objective: smaller tree and accurate tree



Easy to understand and perform better
Finding the best tree
is NP-hard
All current tree
building algorithms
are heuristic
15
Algorithm for decision tree learning

Basic algorithm (a greedy divide-and-conquer algorithm)






Assume attributes are categorical (continuous attributes can be
handled too)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Examples are partitioned recursively based on selected attributes
Attributes are selected on the basis of an impurity function (e.g.,
information gain)
Conditions for stopping partitioning



All examples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority class is the leaf
There are no examples left
16
Choose an attribute to partition data
The key to building a decision tree - which
attribute to choose in order to branch.
 The objective is to reduce impurity or
uncertainty in data as much as possible.



A subset of data is pure if all instances belong
to the same class.
The heuristic in C4.5 is to choose the
attribute with the maximum Information
Gain or Gain Ratio based on information
theory.
17
CART-Classification and Regression Tree
By L. Breiman, J. Friedman, R. Olshen,
and C. Stone (1984)
 A non-parametric technique using tree
building
 Binary splits
 Splits based only on one variable

18
CART-Load Example (again)
19
Three Steps in CART
Tree building
Pruning
Optimal tree selection
1.
2.
3.
•
•
If the attribute is categorical, then a
classification tree is used
If it is continuous, regression trees are
used
20
Step of Tree Building
1.
For each non-terminal node
1.
For a variable
1.
2.
2.
3.
4.
2.
At all its split points, splits samples into two binary nodes
Select the best split in the variable in terms of the
reduction in impurity (gini index)
Rank all of the best splits and select the variable that
achieves the highest purity at root
Assign classes to the nodes according to a rule that
minimizes misclassification costs
Grow a very large tree Tmax until all terminal nodes are
either small or pure or contain identical measurement
vectors
Prune and choose final tree using the cross
validation
21
Advantages of CART
Can cope with any data structure or type
 Classification has a simple form
 Uses conditional information effectively
 Invariant under transformations of the
variables
 Is robust with respect to outliers
 Gives an estimate of the misclassification
rate

22
Disadvantages of CART




CART does not use combinations of variables
Tree can be deceptive – if variable not
included it could be as it was “masked” by
another
Tree structures may be unstable – a change
in the sample may give different trees
Tree is optimal at each split – it may not be
globally optimal
23
AdaBoost

Definition of Boosting:
Boosting refers to the general problem of producing a very
accurate prediction rule by combining rough and moderately
inaccurate rules-of-thumb.

History





1990-Boost by majority (Freund)
1995-AdaBoost (Freund & Schapire)
1997-Generalized AdaBoost (Schapire & Singer)
2001-AdaBoost in face detection (Viola & Jones)
Properties





A linear classifier with all its desirable properties
Output converges to the logarithm of likelihood ratio
Good generalization properties
A feature selector by minimizing upper bound of empirical
error
Close to sequential decision making (simple to complex
classifiers)
24
Boosting Algorithm

The intuitive idea


Altering the distribution over the domain in a way
that increases the probability of the “harder” parts
of the space, thus forcing the weak learner to
generate new classifier that make less mistakes on
these parts.
AdaBoost algorithm

Input variables / Formal setting




D: The distribution over all the training samples
Weak Learner: A weak learning algorithm to be
boosted
T: The specified number of iterations
h: Weak classifier h : X →R ∈ {+1,-1}, H = {h(x)}
25
Two-class AdaBoost
Initialize the weights,
uniform distribution
As t increases,
t decreases
Increase (decrease)
distribution D of wrongly
(correctly) classified
samples
The new distribution is used
to train the next classifier.
The process is iterated.
26
AdaBoost Ensemble Decision

Final classification is given by


In two class case, the final output is sign[f(x)]
Advantages of adaboost


Adaboost adjusts the errors of the weak
classifiers adaptively by Weak Learner
The update rule reduces the weight assigned
to those examples on which the classifier
makes a good predictions, and increases the
weight of the examples on which the prediction27
is poor
An illustration of AdaBoost




Weak Learner: Threshold applied to one or other Add
axis
a weak learner: dashed black line.
Ensemble decision boundary is Green
Weights of misclassified samples are
Increased and weights of correctly classified
Current base learner is dashed black line
samples are decreased.
Size of circle indicates weight assigned
Three red circles in the left side and
two blue circles in the right side are
misclassified.
Weights for misclassified points are
increased and weights of others are
decreased. Find another weak classifier
based on current distribution.
Focus on difficult
training samples.
28
Conclusion
Basic concepts on data mining
 Three tree based data mining techniques




Decision tree
CART
AdaBoost
29
Information Gain

Entropy



High: uniform, less predictable
Low: peaks and valley, more predictable
Conditional Entropy
30
Entropy


Suppose X takes n values, V1, V2,… Vn,
P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn
What is the smallest number of bits, on average,
per symbol, needed to transmit the symbols
drawn from distribution of X?
H ( X )   p1 log 2 p1  p2 log 2 p2   pn log 2 pn
n
   pi log 2 pi
i 1

H(X) = the entropy of X
31
Entropy


Suppose X takes n values, V1, V2,… Vn,
P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn
What is the smallest number of bits, on average,
per symbol, needed to transmit the symbols
drawn from distribution of X?
H ( X )   p1 log 2 p1  p2 log 2 p2   pn log 2 pn
n
   pi log 2 pi
i 1

H(X) = the entropy of X
32
Specific Conditional Entropy, H(Y|X=v)
X = College Major
 Given input X, to predict Y
Y = Likes “Gladiator”
From data we can estimate
probabilities
P(LikeG = Yes) = 0.5
P(Major=Math & LikeG=No) = 0.25
P(Major=Math) = 0.5
P(Major=History & LikeG=Yes) = 0
 Note
H(X) = 1.5
H(Y) = 1

X
Math
History
Y
Yes
No
CS
Math
Math
Yes
No
No
CS
History
Math
Yes
No
Yes
33
Specific Conditional Entropy, H(Y|X=v)
X = College Major
 Definition
Y = Likes “Gladiator”  H(Y|X=v) = entropy of Y
X
Math
History
Y
Yes
No
CS
Math
Math
Yes
No
No
CS
History
Math
Yes
No
Yes

among only those records
in which X has value v
Example:
H(Y|X=Math) = 1
H(Y|X=History) = 0
H(Y|X=CS) = 0
34
Conditional Entropy, H(Y|X=v)
X = College Major
Y = Likes “Gladiator”
X
Math
Y
Yes
History
CS
Math
Math
No
Yes
No
No
CS
History
Math
Yes
No
Yes

Definition

H(Y|X) = the average
conditional entropy of Y

= Σi P(X=vi) H(Y|X=vi)
Example:
vi
P(X=vi) H(Y|X=vi)
Math
0.5
1
History
0.25
0
CS
0.25
0
H(Y|X) = 0.5*1+0.25*0+0.25*0 = 0.5
35
Information Gain
X = College Major
 Definition
Y = Likes “Gladiator”  IG(Y|X) = I must transmit Y.
X
Math
Y
Yes
History
CS
Math
Math
No
Yes
No
No
CS
Yes
History
Math
No
Yes
How many bits on average
would it save me if both
ends of the line knew X?
IG(Y|X) = H(Y) – H(Y|X)

Example:
H(Y) = 1
H(Y|X) = 0.5
Thus:
IG(Y|X) = 1 – 0.5 = 0.5
36
Decision tree learning algorithm
37
Acknowledgement
Part of the slides on Decision Tree
borrowed from Bing Liu
 Part of the slides on Information Gain
borrowed from Andrew Moore
 Example of Adaboost is extracted from
Hongbo’s slide

38