No Slide Title - CSE, IIT Bombay

Download Report

Transcript No Slide Title - CSE, IIT Bombay

Data Mining Algorithms
Prof. S. Sudarshan
CSE Dept, IIT Bombay
Most Slides Courtesy
Prof. Sunita Sarawagi
School of IT, IIT Bombay
Overview
Decision Tree classification algorithms
Clustering algorithms
Challenges
Resources
Decision Tree Classifiers
Decision tree classifiers
 Widely used learning method
 Easy to interpret: can be re-represented as if-then-else
rules
 Approximates function by piece wise constant regions
 Does not require any prior knowledge of data
distribution, works well on noisy data.
 Has been applied to:
classify medical patients based on the disease,
 equipment malfunction by cause,
loan applicant by likelihood of payment.
Setting
Given old data about customers and
payments, predict new applicant’s loan
eligibility.
Previous customers
Classifier
Decision rules
Salary > 5 L
Age
Salary
Profession
Location
Customer type
Prof. = Exec
New applicant’s data
Good/
bad
Decision trees
Tree where internal nodes are simple
decision rules on one or more attributes
and leaf nodes are predicted class labels.
Salary < 1 M
Prof = teaching
Good
Bad
Age < 30
Bad
Good
Topics to be covered
Tree construction:
Basic tree learning algorithm
Measures of predictive ability
High performance decision tree construction: Sprint
Tree pruning:
Why prune
Methods of pruning
Other issues:
Handling missing data
Continuous class labels
Effect of training size
Tree learning algorithms
ID3 (Quinlan 1986)
Successor C4.5 (Quinlan 1993)
SLIQ (Mehta et al)
SPRINT (Shafer et al)
Basic algorithm for tree
building
Greedy top-down construction.
Gen_Tree (Node, data)
make node a leaf?
Selection
criteria
Yes
Stop
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j, data_j)
Split criteria
 Select the attribute that is best for classification.
 Intuitively pick one that best separates instances of
different classes.
 Quantifying the intuitive: measuring separability:
 First define impurity of an arbitrary set S consisting of
K classes
1
Impurity Measures
Information entropy:
k
Entropy ( S )   p i log p i
i 1
Zero when consisting of only one class, one when all classes in
equal number
Other measures of impurity: Gini:
Gini ( S ) 1 
k

i 1
pi
2
Split criteria
K classes, set of S instances partitioned
into r subsets. Instance Sj has fraction pij
1/4
instances of class j.
Gini
Information entropy:
r
Sj
k
 S p
j 1
i 1
ij
log pij
Gini index:
0
1
Impurity
r =1, k=2
r
Sj
k
 S (1   p )
j 1
i 1
2
ij
Information gain
Information gain on partitioning S into r
subsets
Impurity (S) - sum of weighted impurity of
each subset
r
Gain(S , S1..Sr )  Entropy(S )  
j 1
Sj
S
Entropy(S j )
Information gain: example
K= 2, |S| = 100, p1= 0.6, p2= 0.4
E(S) = -0.6 log(0.6) - 0.4 log
(0.4)=0.29
S
S1
| S1 | = 70, p1= 0.8, p2= 0.2
E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21
S2
| S2| = 30, p1= 0.13, p2= 0.87
E(S2) = -0.13log0.13 - 0.87 log 0.87=.16
Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1
Meta learning methods
No single classifier good under all cases
Difficult to evaluate in advance the conditions
Meta learning: combine the effects of the
classifiers
Voting: sum up votes of component classifiers
Combiners: learn a new classifier on the outcomes of
previous ones:
Boosting: staged classifiers
Disadvantage: interpretation hard
Knowledge probing: learn single classifier to mimic
meta classifier
SPRINT
(Serial PaRallelizable INduction
of decision Trees)
Decision-tree classifier for data mining
Design goals:
Able to handle large disk-resident training sets
No restrictions on training-set size
Easily parallelizable
Example
Example Data
Age
Car Type
42
family
18
truck
57
21
sports
sports
28
family
72
truck
Age < 25
Risk
Low
CarType in {sports}
High
High
High
High
High
Low
Low
Low
Building tree
 GrowTree(TrainingData D)

Partition(D);
 Partition(Data D)

if (all points in D belong to the same class) then

return;

for each attribute A do

evaluate splits on attribute A;

use best split found to partition D into D1 and D2;

Partition(D1);

Partition(D2);
Data Setup: Attribute Lists
 One list for each attribute
 Entries in an Attribute List consist of:
attribute value
class value
record id
Example list:
Age Risk
RID
17
20
23
32
43
68
1
5
0
4
2
3
High
High
High
Low
High
Low
 Lists for continuous attributes are in sorted order
 Lists may be disk-resident
 Each leaf-node has its own set of attribute lists
representing the training examples belonging to
that leaf
Attribute Lists: Example
Age Car Type Risk
Age Risk
RID
Car Type
Risk
RID
23
17
43
68
32
20
23
17
43
68
32
20
0
1
2
3
4
5
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
0
1
2
3
4
5
Age Risk
RID
Car Type
Risk
RID
17
20
23
32
43
68
1
5
0
4
2
3
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
0
1
2
3
4
5
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
Initial Attribute Lists for
the root node:
High
High
High
Low
Low
High
High
High
High
Low
High
Low
Evaluating Split Points
Gini Index
if data D contains examples from c classes
Gini(D) = 1 - pj2
where pj is the relative frequency of class j in D
+
If D split into D1 & D2 with n1 & n2 tuples each
Ginisplit(D) = n1* gini(D1) + n2*
gini(D2)
n
n
+
Note: Only class frequencies are needed to compute index
Finding Split Points
For each attribute A do
evaluate splits on attribute A using attribute
list
Keep split with lowest GINI index
Finding Split Points:
Continuous Attrib.
 Consider splits of form: value(A) < x
Example: Age < 17
 Evaluate this split-form for every value in an attribute list
 To evaluate splits on attribute A for a given tree-node:
Initialize class-histogram of left child to zeroes;
Initialize class-histogram of right child to same as its
parent;
for each record in the attribute list do
evaluate splitting index for value(A) < record.value;
using class label of the record, update class
histograms;
Finding Split Points:
Continuous Attrib.
Attribute List
Age Risk
RID
23
17
43
68
32
20
0
1
2
3
4
5
High
High
High
Low
Low
High
High
Low
4
2
Position of
cursor in scan
0: Age < 17
1: Age < 20
3: Age < 32
6
State of Class Histograms:
Left Child
Right Child
High
Low
High
Low
0
0
4
2
High
Low
High
Low
0
0
0
0
High
Low
High
Low
4
2
4
2
High
Low
High
Low
4
2
0
0
GINI Index:
GINI = undef
GINI = 0.4
GINI = 0.222
GINI = undef
Finding Split Points:
Categorical Attrib.
 Consider splits of the form: value(A) {x1, x2, ..., xn}
Example: CarType {family, sports}
 Evaluate this split-form for subsets of domain(A)
 To evaluate splits on attribute A for a given tree node:
initialize class/value matrix of node to zeroes;
for each record in the attribute list do
increment appropriate count in matrix;
evaluate splitting index for various subsets using the
constructed matrix;
Finding Split Points:
Categorical Attrib.
class/value matrix
Attribute List
High Low
Car Type Risk RID
family
High
0
sports
High
1
sports
High
2
family
Low
3
truck
Low
4
family
High
5
family 2
sports 2
truck 0
Left Child
1
0
1
Right Child
CarType in {family}
High
Low
High
Low
2
1
2
1
CarType in {sports}
High
Low
High
Low
2
0
2
0
CarType in {truck}
High
Low
High
Low
2
1
2
1
GINI Index:
GINI = 0.444
GINI = 0.333
GINI = 0.267
Performing the Splits
The attribute lists of every node must be
divided among the two children
To split the attribute lists of a give node:
for the list of the attribute used to split this node do
use the split test to divide the records;
collect the record ids;
build a hashtable from the collected ids;
for the remaining attribute lists do
use the hashtable to divide each list;
build class-histograms for each new leaf;
Performing the Splits:
Example
Age Risk
RID
Car Type
Risk
RID
17
20
23
32
43
68
1
5
0
4
2
3
family
sports
sports
family
truck
family
High
High
High
Low
Low
High
0
1
2
3
4
5
High
High
High
Low
High
Low
Age < 32
Age Risk
RID
Age Risk
RID
17
20
23
High
High
High
1
5
0
32
43
68
4
2
3
Car Type
Risk
RID
family
sports
family
High
High
High
0
1
5
Hash Table
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left
Low
High
Low
Car Type
Risk
RID
sports
family
truck
High
Low
Low
2
3
4
Sprint: summary
 Each node of the decision tree classifier, requires
examining possible splits on each value of each attribute.
 After choosing a split attribute, need to partition all data
into its subset.
 Need to make this search efficient.
 Evaluating splits on numeric attributes:
Sort on attribute value, incrementally evaluate gini
 Splits on categorical attributes
For each subset, find gini and choose the best
For large sets, use greedy method
Approaches to prevent
overfitting
Stop growing the tree beyond a certain point
First over-fit, then post prune. (More widely
used)
Tree building divided into phases:
Growth phase
Prune phase
 Hard to decide when to stop growing the tree, so
second appraoch more widely used.
Criteria for finding correct final
tree size:
 Cross validation with separate test data
 Use all data for training but apply statistical test to
decide right size.
 Use some criteria function to choose best size
Example: Minimum description length (MDL) criteria
 Cross validation approach:
Partition the dataset into two disjoint parts:
1. Training set used for building the tree.
2. Validation set used for pruning the tree
Build the tree using the training-set.
Evaluate the tree on the validation set and at each leaf and
internal node keep count of correctly labeled data.
Starting bottom-up, prune nodes with error less than its
children.
Cross validation..
Need large validation set to smooth out
over-fittings of training data. Rule of
thumb: one-third.
What if training data set size is limited?
Generate many different parititions of data.
n-fold cross validation: partition training data
into n parts D1, D2…Dn.
Train n classifiers with D-Di as training and Di
as test instance.
Pick average.
Rule-based pruning
Tree-based pruning limits the kind of
pruning. If a node is pruned all subtrees
under it has to be pruned.
Rule-based: For each leaf of the tree,
extract a rule using a conjuction of all tests
upto the root.
On the validation set, independently prune
tests from each rule to get the highest
accuracy for that rule.
Sort rule by decreasing accuracy..
MDL-based pruning
Idea: a branch of the tree is over-fitted if
the training examples that fit under it can
be explicitly enumerated (with classes) in
less space than occupied by tree
Prune branch if over-fitted
philosophy: use tree that minimizes
description length of training data
Regression trees
Decision tree with continuous class labels:
Regression trees approximates the function
with piece-wise constant regions.
Split criteria for regression trees:
Predicted value for a set S = average of all values
in S
Error: sum of the square of error of each member
of S from the predicted average.
Pick smallest average error.
Issues
Multiple splits on continuous attributes [Fayyad
93, Multi-interval discretization of continuous
attributes]
Multi attribute tests on nodes to handle
correlated attributes
multivariate linear splits [Oblique trees, Murthy 94]
Methods of handling missing values
assume majority value
take most probable path
Allowing varying costs for different attributes
Pros and Cons of decision
trees
•Pros
+ Reasonable training time
+ Fast application
+ Easy to interpret
+ Easy to implement
+ Can handle large number of
features
•Cons
- Cannot handle complicated
relationship between features
- simple decision boundaries
- problems with lots of missing data
More information: http://www.recursive-partitioning.com/
Clustering or Unsupervised
learning
Distance functions
Numeric data: euclidean, manhattan distances
Minkowski metric: [sum(xi-yi)^m]^(1/m)
Larger m gives higher weight to larger distances
Categorical data: 0/1 to indicate
presence/absence
Euclidean distance: equal weightage to 1 and 0
match
Hamming distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of 1s) (00 matches not important
 Combined numeric and categorical data:weighted normalized
distance:
Distance functions on high
dimensional data
 Example: Time series, Text, Images
 Euclidian measures make all points equally far
 Reduce number of dimensions:
choose subset of original features using random projections,
feature selection techniques
transform original features using statistical methods like
Principal Component Analysis
 Define domain specific similarity measures: e.g. for
images define features like number of objects, color
histogram; for time series define shape based measures.
 Define non-distance based (model-based) clustering
methods:
Clustering methods
Hierarchical clustering
agglomerative Vs divisive
single link Vs complete link
Partitional clustering
distance-based: K-means
model-based: EM
density-based:
Partitional methods: Kmeans
Criteria: minimize sum of square of distance
Between each point and centroid of the
cluster.
Between each pair of points in the cluster
Algorithm:
Select initial partition with K clusters:
random, first K, K separated points
Repeat until stabilization:
Assign each point to closest cluster center
Generate new cluster centers
Adjust clusters by merging/splitting
Properties
May not reach global optima
Converges fast in practice: guaranteed for
certain forms of optimization function
Complexity: O(KndI):
I number of iterations, n number of points, d
number of dimensions, K number of clusters.
Database research on scalable algorithms:
Birch: one/two pass of data by keeping Rtree like index in memory [Sigmod 96]
Model based clustering
Assume data generated from K
probability distributions
Typically Gaussian distribution Soft or
probabilistic version of K-means
clustering
Need to find distribution parameters.
EM Algorithm
EM Algorithm
Initialize K cluster centers
Iterate between two steps
Expectation step: assign points to clusters
P(d i  ck )  Pr( ck | d i )  Pr( ck ) Pr( d i | ck ) / Pr( d i )
Pr( d i | ck )  N ( k ,  k ), d i )
Maximation step: estimate model parameters
k
1

m
d i P ( d i  ck )

i 1  P ( d i  c j )
m
k
Properties
May not reach global optima
Converges fast in practice: guaranteed for
certain forms of optimization function
Complexity: O(KndI):
I number of iterations, n number of points, d
number of dimensions, K number of clusters.
Scalable clustering
algorithms
Birch: one/two pass of data by keeping Rtree like index in memory [Sigmod 96]
Fayyad and Bradley: Sample repetitively
and update summary of clusters stored in
memory (K-mean and EM) [KDD 98]
Dasgupta 99: Recent theoretical
breakthrough, find Gaussian clusters with
guaranteed performance
Random projections
To Learn More
Books
Ian H. Witten and Frank Eibe,
Data mining : practical machine learning tools
and techniques with Java implementations,
Morgan Kaufmann, 1999
Usama Fayyad et al. (eds), Advances in
Knowledge Discovery and Data Mining,
AAAI/MIT Press, 1996
Tom Mitchell, Machine Learning, McGraw-Hill
Software
Public domain
Weka 3: data mining algos in Java
(http://www.cs.waikato.ac.nz/~ml/weka)
classification, regression
MLC++: data mining tools in C++
mainly classification
Free for universities
try convincing IBM to give it free!
Datasets: follow links from
www.kdnuggets.com to UC Irvine site
Resources
 http://www.kdnuggets.com
Great site with links to software, datasets etc. Be sure
to visit it.
 http://www.cs.bham.ac.uk/~anp/TheDataMine.html
 OLAP: http://altaplana.com/olap/
 SIGKDD: http://www.acm.org/sigkdd
 Data mining and knowledge discovery journal:
http://www.research.microsoft.com/research/datamine/
 Communications of ACM Special Issue on Data Mining,
Nov 1996
Resources at IITB
http://www.cse.iitb.ernet.in/~dbms
IITB DB group home page
http://www.it.iitb.ernet.in/~sunita/it642
Data Warehousing and Data Mining course
offered by Prof. Sunita Sarawagi at IITB