No Slide Title - CSE, IIT Bombay
Download
Report
Transcript No Slide Title - CSE, IIT Bombay
Data Mining Algorithms
Prof. S. Sudarshan
CSE Dept, IIT Bombay
Most Slides Courtesy
Prof. Sunita Sarawagi
School of IT, IIT Bombay
Overview
Decision Tree classification algorithms
Clustering algorithms
Challenges
Resources
Decision Tree Classifiers
Decision tree classifiers
Widely used learning method
Easy to interpret: can be re-represented as if-then-else
rules
Approximates function by piece wise constant regions
Does not require any prior knowledge of data
distribution, works well on noisy data.
Has been applied to:
classify medical patients based on the disease,
equipment malfunction by cause,
loan applicant by likelihood of payment.
Setting
Given old data about customers and
payments, predict new applicant’s loan
eligibility.
Previous customers
Classifier
Decision rules
Salary > 5 L
Age
Salary
Profession
Location
Customer type
Prof. = Exec
New applicant’s data
Good/
bad
Decision trees
Tree where internal nodes are simple
decision rules on one or more attributes
and leaf nodes are predicted class labels.
Salary < 1 M
Prof = teaching
Good
Bad
Age < 30
Bad
Good
Topics to be covered
Tree construction:
Basic tree learning algorithm
Measures of predictive ability
High performance decision tree construction: Sprint
Tree pruning:
Why prune
Methods of pruning
Other issues:
Handling missing data
Continuous class labels
Effect of training size
Tree learning algorithms
ID3 (Quinlan 1986)
Successor C4.5 (Quinlan 1993)
SLIQ (Mehta et al)
SPRINT (Shafer et al)
Basic algorithm for tree
building
Greedy top-down construction.
Gen_Tree (Node, data)
make node a leaf?
Selection
criteria
Yes
Stop
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
Split criteria
Select the attribute that is best for classification.
Intuitively pick one that best separates instances of
different classes.
Quantifying the intuitive: measuring separability:
First define impurity of an arbitrary set S consisting of
K classes
1
Impurity Measures
Information entropy:
k
Entropy ( S ) p i log p i
i 1
Zero when consisting of only one class, one when all classes in
equal number
Other measures of impurity: Gini:
Gini ( S ) 1
k
i 1
pi
2
Split criteria
K classes, set of S instances partitioned
into r subsets. Instance Sj has fraction pij
1/4
instances of class j.
Gini
Information entropy:
r
Sj
k
S p
j 1
i 1
ij
log pij
Gini index:
0
1
Impurity
r =1, k=2
r
Sj
k
S (1 p )
j 1
i 1
2
ij
Information gain
Information gain on partitioning S into r
subsets
Impurity (S) - sum of weighted impurity of
each subset
r
Gain(S , S1..Sr ) Entropy(S )
j 1
Sj
S
Entropy(S j )
Information gain: example
K= 2, |S| = 100, p1= 0.6, p2= 0.4
E(S) = -0.6 log(0.6) - 0.4 log
(0.4)=0.29
S
S1
| S1 | = 70, p1= 0.8, p2= 0.2
E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21
S2
| S2| = 30, p1= 0.13, p2= 0.87
E(S2) = -0.13log0.13 - 0.87 log 0.87=.16
Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1
Meta learning methods
No single classifier good under all cases
Difficult to evaluate in advance the conditions
Meta learning: combine the effects of the
classifiers
Voting: sum up votes of component classifiers
Combiners: learn a new classifier on the outcomes of
previous ones:
Boosting: staged classifiers
Disadvantage: interpretation hard
Knowledge probing: learn single classifier to mimic
meta classifier
SPRINT
(Serial PaRallelizable INduction
of decision Trees)
Decision-tree classifier for data mining
Design goals:
Able to handle large disk-resident training sets
No restrictions on training-set size
Easily parallelizable
Example
Example Data
Age
Car Type
42
family
18
truck
57
21
sports
sports
28
family
72
truck
Age < 25
Risk
Low
CarType in {sports}
High
High
High
High
High
Low
Low
Low
Building tree
GrowTree(TrainingData D)
Partition(D);
Partition(Data D)
if (all points in D belong to the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
use best split found to partition D into D1 and D2;
Partition(D1);
Partition(D2);
Data Setup: Attribute Lists
One list for each attribute
Entries in an Attribute List consist of:
attribute value
class value
record id
Example list:
Age Risk
RID
17
20
23
32
43
68
1
5
0
4
2
3
High
High
High
Low
High
Low
Lists for continuous attributes are in sorted order
Lists may be disk-resident
Each leaf-node has its own set of attribute lists
representing the training examples belonging to
that leaf
Attribute Lists: Example
Age Car Type Risk
Age Risk
RID
Car Type
Risk
RID
23
17
43
68
32
20
23
17
43
68
32
20
0
1
2
3
4
5
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
0
1
2
3
4
5
Age Risk
RID
Car Type
Risk
RID
17
20
23
32
43
68
1
5
0
4
2
3
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
0
1
2
3
4
5
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
Initial Attribute Lists for
the root node:
High
High
High
Low
Low
High
High
High
High
Low
High
Low
Evaluating Split Points
Gini Index
if data D contains examples from c classes
Gini(D) = 1 - pj2
where pj is the relative frequency of class j in D
+
If D split into D1 & D2 with n1 & n2 tuples each
Ginisplit(D) = n1* gini(D1) + n2*
gini(D2)
n
n
+
Note: Only class frequencies are needed to compute index
Finding Split Points
For each attribute A do
evaluate splits on attribute A using attribute
list
Keep split with lowest GINI index
Finding Split Points:
Continuous Attrib.
Consider splits of form: value(A) < x
Example: Age < 17
Evaluate this split-form for every value in an attribute list
To evaluate splits on attribute A for a given tree-node:
Initialize class-histogram of left child to zeroes;
Initialize class-histogram of right child to same as its
parent;
for each record in the attribute list do
evaluate splitting index for value(A) < record.value;
using class label of the record, update class
histograms;
Finding Split Points:
Continuous Attrib.
Attribute List
Age Risk
RID
23
17
43
68
32
20
0
1
2
3
4
5
High
High
High
Low
Low
High
High
Low
4
2
Position of
cursor in scan
0: Age < 17
1: Age < 20
3: Age < 32
6
State of Class Histograms:
Left Child
Right Child
High
Low
High
Low
0
0
4
2
High
Low
High
Low
0
0
0
0
High
Low
High
Low
4
2
4
2
High
Low
High
Low
4
2
0
0
GINI Index:
GINI = undef
GINI = 0.4
GINI = 0.222
GINI = undef
Finding Split Points:
Categorical Attrib.
Consider splits of the form: value(A) {x1, x2, ..., xn}
Example: CarType {family, sports}
Evaluate this split-form for subsets of domain(A)
To evaluate splits on attribute A for a given tree node:
initialize class/value matrix of node to zeroes;
for each record in the attribute list do
increment appropriate count in matrix;
evaluate splitting index for various subsets using the
constructed matrix;
Finding Split Points:
Categorical Attrib.
class/value matrix
Attribute List
High Low
Car Type Risk RID
family
High
0
sports
High
1
sports
High
2
family
Low
3
truck
Low
4
family
High
5
family 2
sports 2
truck 0
Left Child
1
0
1
Right Child
CarType in {family}
High
Low
High
Low
2
1
2
1
CarType in {sports}
High
Low
High
Low
2
0
2
0
CarType in {truck}
High
Low
High
Low
2
1
2
1
GINI Index:
GINI = 0.444
GINI = 0.333
GINI = 0.267
Performing the Splits
The attribute lists of every node must be
divided among the two children
To split the attribute lists of a give node:
for the list of the attribute used to split this node do
use the split test to divide the records;
collect the record ids;
build a hashtable from the collected ids;
for the remaining attribute lists do
use the hashtable to divide each list;
build class-histograms for each new leaf;
Performing the Splits:
Example
Age Risk
RID
Car Type
Risk
RID
17
20
23
32
43
68
1
5
0
4
2
3
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
0
1
2
3
4
5
High
High
High
Low
High
Low
Age < 32
Age Risk
RID
Age Risk
RID
17
20
23
High
High
High
1
5
0
32
43
68
4
2
3
Car Type
Risk
RID
family
sports
family
High
High
High
0
1
5
Hash Table
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left
Low
High
Low
Car Type
Risk
RID
sports
family
truck
High
Low
Low
2
3
4
Sprint: summary
Each node of the decision tree classifier, requires
examining possible splits on each value of each attribute.
After choosing a split attribute, need to partition all data
into its subset.
Need to make this search efficient.
Evaluating splits on numeric attributes:
Sort on attribute value, incrementally evaluate gini
Splits on categorical attributes
For each subset, find gini and choose the best
For large sets, use greedy method
Approaches to prevent
overfitting
Stop growing the tree beyond a certain point
First over-fit, then post prune. (More widely
used)
Tree building divided into phases:
Growth phase
Prune phase
Hard to decide when to stop growing the tree, so
second appraoch more widely used.
Criteria for finding correct final
tree size:
Cross validation with separate test data
Use all data for training but apply statistical test to
decide right size.
Use some criteria function to choose best size
Example: Minimum description length (MDL) criteria
Cross validation approach:
Partition the dataset into two disjoint parts:
1. Training set used for building the tree.
2. Validation set used for pruning the tree
Build the tree using the training-set.
Evaluate the tree on the validation set and at each leaf and
internal node keep count of correctly labeled data.
Starting bottom-up, prune nodes with error less than its
children.
Cross validation..
Need large validation set to smooth out
over-fittings of training data. Rule of
thumb: one-third.
What if training data set size is limited?
Generate many different parititions of data.
n-fold cross validation: partition training data
into n parts D1, D2…Dn.
Train n classifiers with D-Di as training and Di
as test instance.
Pick average.
Rule-based pruning
Tree-based pruning limits the kind of
pruning. If a node is pruned all subtrees
under it has to be pruned.
Rule-based: For each leaf of the tree,
extract a rule using a conjuction of all tests
upto the root.
On the validation set, independently prune
tests from each rule to get the highest
accuracy for that rule.
Sort rule by decreasing accuracy..
MDL-based pruning
Idea: a branch of the tree is over-fitted if
the training examples that fit under it can
be explicitly enumerated (with classes) in
less space than occupied by tree
Prune branch if over-fitted
philosophy: use tree that minimizes
description length of training data
Regression trees
Decision tree with continuous class labels:
Regression trees approximates the function
with piece-wise constant regions.
Split criteria for regression trees:
Predicted value for a set S = average of all values
in S
Error: sum of the square of error of each member
of S from the predicted average.
Pick smallest average error.
Issues
Multiple splits on continuous attributes [Fayyad
93, Multi-interval discretization of continuous
attributes]
Multi attribute tests on nodes to handle
correlated attributes
multivariate linear splits [Oblique trees, Murthy 94]
Methods of handling missing values
assume majority value
take most probable path
Allowing varying costs for different attributes
Pros and Cons of decision
trees
•Pros
+ Reasonable training time
+ Fast application
+ Easy to interpret
+ Easy to implement
+ Can handle large number of
features
•Cons
- Cannot handle complicated
relationship between features
- simple decision boundaries
- problems with lots of missing data
More information: http://www.recursive-partitioning.com/
Clustering or Unsupervised
learning
Distance functions
Numeric data: euclidean, manhattan distances
Minkowski metric: [sum(xi-yi)^m]^(1/m)
Larger m gives higher weight to larger distances
Categorical data: 0/1 to indicate
presence/absence
Euclidean distance: equal weightage to 1 and 0
match
Hamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of 1s) (00 matches not important
Combined numeric and categorical data:weighted normalized
distance:
Distance functions on high
dimensional data
Example: Time series, Text, Images
Euclidian measures make all points equally far
Reduce number of dimensions:
choose subset of original features using random projections,
feature selection techniques
transform original features using statistical methods like
Principal Component Analysis
Define domain specific similarity measures: e.g. for
images define features like number of objects, color
histogram; for time series define shape based measures.
Define non-distance based (model-based) clustering
methods:
Clustering methods
Hierarchical clustering
agglomerative Vs divisive
single link Vs complete link
Partitional clustering
distance-based: K-means
model-based: EM
density-based:
Partitional methods: Kmeans
Criteria: minimize sum of square of distance
Between each point and centroid of the
cluster.
Between each pair of points in the cluster
Algorithm:
Select initial partition with K clusters:
random, first K, K separated points
Repeat until stabilization:
Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/splitting
Properties
May not reach global optima
Converges fast in practice: guaranteed for
certain forms of optimization function
Complexity: O(KndI):
I number of iterations, n number of points, d
number of dimensions, K number of clusters.
Database research on scalable algorithms:
Birch: one/two pass of data by keeping Rtree like index in memory [Sigmod 96]
Model based clustering
Assume data generated from K
probability distributions
Typically Gaussian distribution Soft or
probabilistic version of K-means
clustering
Need to find distribution parameters.
EM Algorithm
EM Algorithm
Initialize K cluster centers
Iterate between two steps
Expectation step: assign points to clusters
P(d i ck ) Pr( ck | d i ) Pr( ck ) Pr( d i | ck ) / Pr( d i )
Pr( d i | ck ) N ( k , k ), d i )
Maximation step: estimate model parameters
k
1
m
d i P ( d i ck )
i 1 P ( d i c j )
m
k
Properties
May not reach global optima
Converges fast in practice: guaranteed for
certain forms of optimization function
Complexity: O(KndI):
I number of iterations, n number of points, d
number of dimensions, K number of clusters.
Scalable clustering
algorithms
Birch: one/two pass of data by keeping Rtree like index in memory [Sigmod 96]
Fayyad and Bradley: Sample repetitively
and update summary of clusters stored in
memory (K-mean and EM) [KDD 98]
Dasgupta 99: Recent theoretical
breakthrough, find Gaussian clusters with
guaranteed performance
Random projections
To Learn More
Books
Ian H. Witten and Frank Eibe,
Data mining : practical machine learning tools
and techniques with Java implementations,
Morgan Kaufmann, 1999
Usama Fayyad et al. (eds), Advances in
Knowledge Discovery and Data Mining,
AAAI/MIT Press, 1996
Tom Mitchell, Machine Learning, McGraw-Hill
Software
Public domain
Weka 3: data mining algos in Java
(http://www.cs.waikato.ac.nz/~ml/weka)
classification, regression
MLC++: data mining tools in C++
mainly classification
Free for universities
try convincing IBM to give it free!
Datasets: follow links from
www.kdnuggets.com to UC Irvine site
Resources
http://www.kdnuggets.com
Great site with links to software, datasets etc. Be sure
to visit it.
http://www.cs.bham.ac.uk/~anp/TheDataMine.html
OLAP: http://altaplana.com/olap/
SIGKDD: http://www.acm.org/sigkdd
Data mining and knowledge discovery journal:
http://www.research.microsoft.com/research/datamine/
Communications of ACM Special Issue on Data Mining,
Nov 1996
Resources at IITB
http://www.cse.iitb.ernet.in/~dbms
IITB DB group home page
http://www.it.iitb.ernet.in/~sunita/it642
Data Warehousing and Data Mining course
offered by Prof. Sunita Sarawagi at IITB