Data mining - Department of Computer Science and Engineering
Download
Report
Transcript Data mining - Department of Computer Science and Engineering
Tree Based Data mining
Techniques
Haiqin Yang
1
Why Data Mining?
- Necessity is the Mother of Invention
The amount of data increases
Need to convert such data -> useful knowledge
The gap between data and analysts is increasing
Hidden information is not always evident
Applications
Business management, Production control,
Bioinformatics, Market analysis, Engineering design,
Science exploration...
2
What is Data Mining?
Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) patterns or knowledge
from huge amount of data
Alternative names
Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence,
etc.
3
KDD Process
Database
Selection
Transformation
Preparation
Training
Mining
Model,
Patterns
Evaluation/
Verification
4
Main Tasks in KDD
Classification
Clustering
Regression
Association rule learning
5
Top 10 Algorithms in KDD
Identified in ICDM’06
A series of criteria
Select from 18 nominations and vote from
various researchers
In 10 topics
association analysis, classification, clustering,
statistical learning, bagging and boosting,
sequential patterns, integrated mining, rough
sets, link mining, and graph mining
6
Top 10 Algorithms in KDD
C4.5
k-means
Support vector machines
The Apriori algorithms (Association rule analysis)
The EM algorithm
PageRank
AdaBoost
kNN:k-nearest neighbor classification
Naive Bayes
CART: Classification and Regression Tree
7
Good Resources
Weka-Data Mining Software in Java
Spider-Machine Learning in Matlab
http://pyml.sourceforge.net/
Scikits.learn-Machine Learning in Python
http://people.kyb.tuebingen.mpg.de/spider/
PyML-Machine Learning in Python
http://www.cs.waikato.ac.nz/ml/weka/
http://scikit-learn.sourceforge.net/
KDnuggets
http://www.kdnuggets.com/
8
Tree-Based Data Mining Techniques
Decision Tree (C4.5)
CART
Adaboost (Ensemble Method)
Random Forest (Ensemble Method)
…
9
An Example
Data: Loan application data
Task: Predict whether a loan should be approved or not
Performance measure: accuracy
10
An Example
Data: Loan application data
Task: Predict whether a loan should be
approved or not
Performance measure: accuracy
No learning: classify all future applications
(test data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%
Can we do better than 60% with learning?
11
Decision tree
Decision tree learning is one of the most
widely used techniques for classification
Its classification accuracy is competitive with
other methods, and
it is very efficient
The classification model is a tree, called
decision tree.
C4.5 by Ross Quinlan is perhaps the best
known system. Open source
12
A decision tree from the loan data
Decision nodes and leaf nodes (classes)
13
Prediction
No
14
Is the decision tree unique?
No. Here is a simpler tree.
Objective: smaller tree and accurate tree
Easy to understand and perform better
Finding the best tree
is NP-hard
All current tree
building algorithms
are heuristic
15
Algorithm for decision tree learning
Basic algorithm (a greedy divide-and-conquer algorithm)
Assume attributes are categorical (continuous attributes can be
handled too)
Tree is constructed in a top-down recursive manner
At start, all the training examples are at the root
Examples are partitioned recursively based on selected attributes
Attributes are selected on the basis of an impurity function (e.g.,
information gain)
Conditions for stopping partitioning
All examples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority class is the leaf
There are no examples left
16
Choose an attribute to partition data
The key to building a decision tree - which
attribute to choose in order to branch.
The objective is to reduce impurity or
uncertainty in data as much as possible.
A subset of data is pure if all instances belong
to the same class.
The heuristic in C4.5 is to choose the
attribute with the maximum Information
Gain or Gain Ratio based on information
theory.
17
CART-Classification and Regression Tree
By L. Breiman, J. Friedman, R. Olshen,
and C. Stone (1984)
A non-parametric technique using tree
building
Binary splits
Splits based only on one variable
18
CART-Load Example (again)
19
Three Steps in CART
Tree building
Pruning
Optimal tree selection
1.
2.
3.
•
•
If the attribute is categorical, then a
classification tree is used
If it is continuous, regression trees are
used
20
Step of Tree Building
1.
For each non-terminal node
1.
For a variable
1.
2.
2.
3.
4.
2.
At all its split points, splits samples into two binary nodes
Select the best split in the variable in terms of the
reduction in impurity (gini index)
Rank all of the best splits and select the variable that
achieves the highest purity at root
Assign classes to the nodes according to a rule that
minimizes misclassification costs
Grow a very large tree Tmax until all terminal nodes are
either small or pure or contain identical measurement
vectors
Prune and choose final tree using the cross
validation
21
Advantages of CART
Can cope with any data structure or type
Classification has a simple form
Uses conditional information effectively
Invariant under transformations of the
variables
Is robust with respect to outliers
Gives an estimate of the misclassification
rate
22
Disadvantages of CART
CART does not use combinations of variables
Tree can be deceptive – if variable not
included it could be as it was “masked” by
another
Tree structures may be unstable – a change
in the sample may give different trees
Tree is optimal at each split – it may not be
globally optimal
23
AdaBoost
Definition of Boosting:
Boosting refers to the general problem of producing a very
accurate prediction rule by combining rough and moderately
inaccurate rules-of-thumb.
History
1990-Boost by majority (Freund)
1995-AdaBoost (Freund & Schapire)
1997-Generalized AdaBoost (Schapire & Singer)
2001-AdaBoost in face detection (Viola & Jones)
Properties
A linear classifier with all its desirable properties
Output converges to the logarithm of likelihood ratio
Good generalization properties
A feature selector by minimizing upper bound of empirical
error
Close to sequential decision making (simple to complex
classifiers)
24
Boosting Algorithm
The intuitive idea
Altering the distribution over the domain in a way
that increases the probability of the “harder” parts
of the space, thus forcing the weak learner to
generate new classifier that make less mistakes on
these parts.
AdaBoost algorithm
Input variables / Formal setting
D: The distribution over all the training samples
Weak Learner: A weak learning algorithm to be
boosted
T: The specified number of iterations
h: Weak classifier h : X →R ∈ {+1,-1}, H = {h(x)}
25
Two-class AdaBoost
Initialize the weights,
uniform distribution
As t increases,
t decreases
Increase (decrease)
distribution D of wrongly
(correctly) classified
samples
The new distribution is used
to train the next classifier.
The process is iterated.
26
AdaBoost Ensemble Decision
Final classification is given by
In two class case, the final output is sign[f(x)]
Advantages of adaboost
Adaboost adjusts the errors of the weak
classifiers adaptively by Weak Learner
The update rule reduces the weight assigned
to those examples on which the classifier
makes a good predictions, and increases the
weight of the examples on which the prediction27
is poor
An illustration of AdaBoost
Weak Learner: Threshold applied to one or other Add
axis
a weak learner: dashed black line.
Ensemble decision boundary is Green
Weights of misclassified samples are
Increased and weights of correctly classified
Current base learner is dashed black line
samples are decreased.
Size of circle indicates weight assigned
Three red circles in the left side and
two blue circles in the right side are
misclassified.
Weights for misclassified points are
increased and weights of others are
decreased. Find another weak classifier
based on current distribution.
Focus on difficult
training samples.
28
Conclusion
Basic concepts on data mining
Three tree based data mining techniques
Decision tree
CART
AdaBoost
29
Information Gain
Entropy
High: uniform, less predictable
Low: peaks and valley, more predictable
Conditional Entropy
30
Entropy
Suppose X takes n values, V1, V2,… Vn,
P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn
What is the smallest number of bits, on average,
per symbol, needed to transmit the symbols
drawn from distribution of X?
H ( X ) p1 log 2 p1 p2 log 2 p2 pn log 2 pn
n
pi log 2 pi
i 1
H(X) = the entropy of X
31
Entropy
Suppose X takes n values, V1, V2,… Vn,
P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn
What is the smallest number of bits, on average,
per symbol, needed to transmit the symbols
drawn from distribution of X?
H ( X ) p1 log 2 p1 p2 log 2 p2 pn log 2 pn
n
pi log 2 pi
i 1
H(X) = the entropy of X
32
Specific Conditional Entropy, H(Y|X=v)
X = College Major
Given input X, to predict Y
Y = Likes “Gladiator”
From data we can estimate
probabilities
P(LikeG = Yes) = 0.5
P(Major=Math & LikeG=No) = 0.25
P(Major=Math) = 0.5
P(Major=History & LikeG=Yes) = 0
Note
H(X) = 1.5
H(Y) = 1
X
Math
History
Y
Yes
No
CS
Math
Math
Yes
No
No
CS
History
Math
Yes
No
Yes
33
Specific Conditional Entropy, H(Y|X=v)
X = College Major
Definition
Y = Likes “Gladiator” H(Y|X=v) = entropy of Y
X
Math
History
Y
Yes
No
CS
Math
Math
Yes
No
No
CS
History
Math
Yes
No
Yes
among only those records
in which X has value v
Example:
H(Y|X=Math) = 1
H(Y|X=History) = 0
H(Y|X=CS) = 0
34
Conditional Entropy, H(Y|X=v)
X = College Major
Y = Likes “Gladiator”
X
Math
Y
Yes
History
CS
Math
Math
No
Yes
No
No
CS
History
Math
Yes
No
Yes
Definition
H(Y|X) = the average
conditional entropy of Y
= Σi P(X=vi) H(Y|X=vi)
Example:
vi
P(X=vi) H(Y|X=vi)
Math
0.5
1
History
0.25
0
CS
0.25
0
H(Y|X) = 0.5*1+0.25*0+0.25*0 = 0.5
35
Information Gain
X = College Major
Definition
Y = Likes “Gladiator” IG(Y|X) = I must transmit Y.
X
Math
Y
Yes
History
CS
Math
Math
No
Yes
No
No
CS
Yes
History
Math
No
Yes
How many bits on average
would it save me if both
ends of the line knew X?
IG(Y|X) = H(Y) – H(Y|X)
Example:
H(Y) = 1
H(Y|X) = 0.5
Thus:
IG(Y|X) = 1 – 0.5 = 0.5
36
Decision tree learning algorithm
37
Acknowledgement
Part of the slides on Decision Tree
borrowed from Bing Liu
Part of the slides on Information Gain
borrowed from Andrew Moore
Example of Adaboost is extracted from
Hongbo’s slide
38