#### Transcript Decision Trees

Data Mining amd Knowledge Acquisition — Chapter 5 — BIS 541 2012/2013 Spring 1 Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary 2 Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 3 Prediction Problems: Classification vs. Numeric Prediction Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit/loan approval: Medical diagnosis: if a tumor is cancerous or benign Fraud detection: if a transaction is fraudulent Web page categorization: which category it is 4 Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known 5 Classification Process (1): Model Construction Training Data NAME M ike M ary B ill Jim D ave Anne RANK YEARS TENURED A ssistan t P ro f 3 no A ssistan t P ro f 7 yes P ro fesso r 2 yes A sso ciate P ro f 7 yes A ssistan t P ro f 6 no A sso ciate P ro f 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 6 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME Tom M erlisa G eo rg e Jo sep h RANK YEARS TENURED A ssistan t P ro f 2 no A sso ciate P ro f 7 no P ro fesso r 5 yes A ssistan t P ro f 7 yes Tenured? 7 Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary 8 Issues Regarding Classification and Prediction (1): Data Preparation Data cleaning Relevance analysis (feature selection) Preprocess data in order to reduce noise and handle missing values Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data 9 Issues regarding classification and prediction (2): Evaluating Classification Methods Predictive accuracy Speed and scalability time to construct the model time to use the model Robustness handling noise and missing values Scalability efficiency in disk-resident databases Interpretability: understanding and insight provided by the model Goodness of rules decision tree size compactness of classification rules 10 Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary 11 Classification by Decision Tree Induction Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Once the tree is build Use of decision tree: Classifying an unknown sample 12 Training Dataset This follows an example from Quinlan’s ID3 Han Table 7.1 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Original data from DMPML Ch 4 Sec 4.3 pp 8-9,89-94 13 Output: A Decision Tree for “buys_computer” age? 5 cases <=30 student? 14 cases 4 cases overcast 31..40 >40 5 cases yes credit rating? no yes excellent fair no yes no yes 14 In Practice passed Ii Current Oi time For the i th data, at time I, input information is known At time O, output is asigned (yes/no) For all data object in the training data set (i=1..14) both input and output are known 15 In Practice passed Current In future On time For a new customers n, at the current time I, input information is known But O, output is not known Yet to be classified as (yes or no) before its actual buying behavioris realized Value of a data mining study to predict buying behavior beforehand 16 ID3 Algorithm for Decision Tree Induction ID3 algorithm Quinlan (1986) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left 17 Entropy of a Simple Event average amount of information to predict event s Expected information of an event s is I(s)= -log2ps=log2(1/p) Ps is probability of event s entropy is high for lower probable events and low otherwise as ps goes to zero as ps goes to one Event s becomes rare and rare and hence İts entropy approachs to infinity Event s becomes more and more certain and hence İts entropy approachs to zero Hence entropy is a measure of randomness. disorder rareness of an event 18 Computational Formulas entropy Entropy=-log2p 2 1 0 0.25 0.5 1 Log21=0. log22=1 Log20.5=log21/2=log22-1=-log22=-1 Log20.25=-2 Computational formula Logax=logbx/logba a and b any basis Choosing b as e It is more convenient to compute logarithms of any basis Logax=logex/logea or in particular Log2x=logex/loge2 or Dropping e Log2x= lnx/ln2 probability 19 Entropy of an simple event Entropy can be interpreted as the number of bits to encode that event if p=1/2, -log21/2= 1 bit is required to encode that event 0 and 1 with equal probability if p=1/4 -log21/4= 2 bits are required to encode that event 00 01 10 11 each are equally likely only one represent the specific event if p=1/8 -log21/8= 3 bits are required to encode that event 000 001 010 011 100 101 110 111 each are equally likely only one represent the specific event if p=1/3 -log21/3=-(loge1/3)/loge2=1.58 bits are required to encode that event 20 Composite events Consider the four events A, B, C, D A:tossing a fair coin where PH = PT = ½ B:tossing an unfair coin where PH = ¼ PT = 3/4 C:tossing an unfair coin where PH = 0.001 PT = 0.999 D:tossing an unfair coin where PH = 0.0 PT = 1.0 Which of these events is more certain How much information is needed to guess the result of each toes in A to D What is the expected information to predict the outcome of events A B C D respectively? 21 Entropy of composite events Average or expected information is highest when each event is equally likely Event A Expected information required to guess falls as the probability of head becomes either 0 or 1 as PH goes to 0 or PT goes to 1: moving to case D The composite event of toesing a coin is more and more certain So for case D no information is needed as the answer is already known as tail What is the expected information to predict the outcome of that event when probability of head is p in general ent(S)= p[-log2p]+(1-p)[-log2(1-p)] ent(S)= -plog2p-(1-p)log2(1-p) The lower the entropy the higher the information content of that event Weighted average of simple events weighted by their probailities 22 Examples When the event is certain: pH = 1,pT= 0 or pH = 0, pT = 1 ent(S)= -1log2(1)-0log20= -1*0-0*log20=0 Note that: limx0+ xlog2x=0 For a fair coin pH = 0.5 ,pT= 0.5 ent(S)= -(1/2)log2(1/2)-(1/2)log21/2 = -1/2(-1) -1/2(-1)=1 Ent(S) is 1: p = 0.5 1-p=0.5 if head or tail probabilities are unequal entropy is between 0 and 1 23 P head versus entropy for the event of toesing a coin Entropy= -plog2p-(1-p)log2(1-p) 1 0 1 0.5 Probability of head 24 In general Entropy is a measure of (im)purity of an sample variable S is defined as Ent(S) = sSps(-log2ps) = -sS pslog2ps s is a value of S an element of sample space ps is its estimated or subjective probability of any sS Note that sps = 1 25 Information needed to classify an object class entropy is computed in a similar manner Entropy is 0 if all members of S belong to the same class no information to classify an object entropy of a partition is the weighted average of all entropies of all classes Total number of objects: S There are two classes C1 and C2 with cardinalities S1 and S2 I(S1,S2)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2) Or in general with m classes C1,C2,…,Cm I(S1,S2…Sm)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2)-.. -(Sm/S)*log2(Sm/S) Probability of an objects belonging to class i:Pi=Si/S 26 Example cont.: At the root of the tree There are 14 cases S = 14 9 yes denoted by Y, S1 = 9, py = 9/14 5 no denoted by N, S2 = 5, pn = 5/14 Y s and N s are almost equally likely How much information on the average is needed to classify a person as buyer or not Without knowing any characteristics such as age income … I(S1,S2)=(S1/S)(-log2S1/S) +(S2/S)(-log2S2/S) I(9,5)=(9/14)(-log29/14) + (5/14)(-log25/14) =(9/14)(0.637) + (5/14)(1.485) = 0.940 bits of information close to 1 27 Expected information to classify with an attribute (1) An attribute A with n distinct values as a1,a2,..,an partition the dataset into n distinct parts Si,j is number of objects in partition i (i=1..n) with a class Cj (j=1..m) Expected information to classify an object knowing the value of attribute A is the weighted average of entropies of all partitions Weighted by the frequency of that partition i 28 Expected information to classify with an attribute (2) Ent(A) =I(A) = mj=1(ai/S) *I(Si,1..Si,m) =(a1/S)*I(S1,1..S1,m)+(a2/S)*I(S2,1..S2,m)+… +(an/S)*I(Sn,1..Sn,m) I(Si,1..Si,m) entropy of any partition i I(Si,1..Si,m)=mj=1(Sij/ai)(-log2Sij/ai) =-(Si1/ai)*log2Si1/ai-(Si2/ai)*log2Si2/ai...-(Sim/ai)*log2Sim/ai ai = mj=1Sij = nuber of objects in each partition Here sij/ai is the probability of class j in partition i 29 Information Gain Gain in information using distinct values of attribute A is the reduction in entropy or information need to classify an object Gain(A) = I(S1,..,Sm) – I(A) average information without knowing A – Average information with knowing A Eg. Knowing such characteristics as: Age interval, income interval How much help to classify a new object? Can information gain be negative? Is it always greater then or equal to zero? 30 Another Example Pick up a student at random in BU What is the chance that she is staying in dorm? Initially we have no specific information about her If I ask initials Does it help us in predicting the probability of her staying in dorm. No If I ask her adress and record the city Does it help us in predcting the chance of her staying in dorm Yes 31 Attribute selection at the root There are four attributes age, income, student, credit rating Which of these provıdes the highest informatıon in classıfying a new customer or equivalently Which of these results in hıghest information gaın 32 Testing age at the root node <=30 14 cases 9Y 5N 31..40 5 cases 2Y 3N I(2,3)=0.971 Dec N 4 cases 4Y 0N I(4,0) =0 Dec:Y >40 5 cases 3Y 2N I(3,2)=0.971 Dec Y Accuricy: 10/14, Entropy(age)=5/14*I(2,3)+4/14*I(4,0)+5/14*I(3,2) Entropy(age) = 5/14(-3/5log23/5-2/5log22/5) + 4/14((-4/4log24/4-0/4log20/4) + 5/14(-3/5log23/5-2/5log22/5) = 0.694 Gain(age) = 0.940 – 0.694 = 0.246 33 Expected information for age <=30 If age is <=30 Information need to classify a new customer : I(2,3)=0.971 bits as the training data tells that Knowing that age <=30 with 0.4 probability a customer buys but with 0.6 probability she dose not I(2,3)=-(3/5)log23/5-(2/5)log22/5= 0.971 = 0.6*0.734 + 0.4*1.322 = 0.971 But what is the weight of age range <=30 5 out of 14 samples are in that range (5/14)*I(2,3) is the weighted information need to classify a customer as buyer or not 34 Information gain by age gain(age)= I(Y,N)-Entropy(age) = 0.940 – 0.694 =0.246 is the information gain Or reduction of entropy to classify a new object Knowing the age of the customer increases our ability to classify her as buyer or not or Help us to predict her buying behavior 35 Attribute Selection by Information Gain Computation Class Y: buys_computer = “yes” Class N: buys_computer = “no” I(Y, N) = I(9, 5) =0.940 Compute the entropy for age: age Yi <=30 2 30…40 4 >40 3 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 Ni I(pi, ni) 3 0.971 0 0 2 0.971 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no 5 4 I (2,3) I ( 4,0) 14 14 5 I (3,2) 0.694 14 E (age) 5 I (2,3) means “age <=30” has 5 14 out of 14 samples, with 2 yes’es and 3 no’s. Hence Gain(age) I (Y , N ) E (age) 0.246 Similarly, Gain(income) 0.029 Gain( student ) 0.151 Gain(credit _ rating ) 0.048 36 Testing income at the root node low 4 cases 3Y 1N I(3,1)=? Dec Y 14 cases 9Y 5N medium 6 cases 4Y 2N I(4,2) =? Dec: Y high 4 cases 2Y 2N I(2,2)=1 Dec: ? Accuricy:9/14 Entropy(income) = 4/14(-3/4log23/4-1/4log21/4) + 6/14((-2/6log23/4-4/6log22/6) + 4/14(-2/4log22/4-2/4log22/4) = 0.914 Gain(income) = 0.940 – 0914.=0.026 37 Testing student at the root node no 7 cases 3Y 4N I(3,4)=? Dec N 14 cases 9Y 5N yes 7 cases 6Y 1N I(6,1)=? Dec: Y Accuricy:10/14 Entropy(student) = 7/14(-3/7log23/7-4/7log24/7) + 7/14((-6/7log26/7-1/7log21/7) = 0.789 Gain(student) = 0.940 –0.789 0.151 38 Testing credit rating at the root node faır 8 cases 6Y 2N I(6,2)=? Dec Y 14 cases 9Y 5N excellent 6 cases 3Y 3N I(3,3)=1 Dec: ? Accuricy:9/14 Entropy(student) = 8/14(-2/6log22/6-4/6log24/6)+ + 6/14((-3/6log23/6-3/6log23/6) = 0.892 Gain(credit rating) = 0.940 – 0892=0.048 39 Comparıng gaıns İnformation gains for attrıbutes at the root node: Gain(age) = 0.246 Gain(age) = 0.026 Gain(age) = 0.151 Gain(age) = 0.048 Age provıdes the highest gain in information Age ıs choosen as the attribute at the root node Branch acording to the distinct values of age 40 After Selecting age first level of the tree age? 5 cases <=30 Continue 14 cases 4 cases overcast 31..40 >40 5 cases yes continue 41 Attrıbute selectıon at age <=30 node low 1Y 0N 5 cases 2Y 3N no hıgh income medıum 0Y 3N 0Y 2N 5 cases 2Y 3N student yes 2Y 0N 1Y 1N faır 1Y 2N 5 cases 2Y 3N credit excellent 1Y 1N Informatıon gaın for student ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınformatıon to classıfy her buyıng behavıor 42 Attrıbute selectıon at age >40 node low 1Y 1N 5 cases 3Y 2N no hıgh income medıum 1Y 1N 0Y 0N 5 cases 3Y 2N yes student 2Y 1N 2Y 1N faır 3Y 0N 5 cases 3Y 2N credit excellent 0Y 2N Informatıon gaın for credıt ratıng ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınformatıon to classıfy her buyıng behavıor 43 Exercise Calculate all information gains in the second level of the tree that is after branching by the distinct values of age 44 Output: A Decision Tree for “buys_computer” age? 5 cases <=30 4 cases overcast 31..40 >40 5 cases yes credit rating? student? 3N no no 14 cases 2Y yes 2N excellent yes no Accuricy 14/14 on training set 3N fair yes 45 Advantage and Disadvantages of Decision Trees Advantages: Easy to understand and map nicely to a production rules Suitable for categorical as well as numerical inputs No statistical assumptions about distribution of attributes Generation and application to classify unknown outputs is very fast Disadvantages: Output attributes must be categorical Unstable: slight variations in the training data may result in different attribute selections and hence different trees Numerical input attributes leads to complex threes as attribute splits are usually binary Not suitable for non rectangler regions such as regions separated by linear or nonlnear combination of attributes By lines ( in 2 dimensions) planes( in 3 dimensions) or in general by hyperplanes (n dimensions) 46 A classification problem in that decision trees are not suitable to classify income A o o o x x o x o o The two classes X and O are separated by line AA x x o A x x x x Decision trees are not suitabe For this problem age 47 Other Attribute Selection Measures Gain Ratio Gini index (CART, IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values for each attribute May need other tools, such as clustering, to get the possible split values Can be modified for categorical attributes 48 Gain Ratio Add another attribute transaction TID for each observation TID is different E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...+ (1/14)*I(1,0)=0 gain(TID)= 0.940-0=0.940 the highest gain so TID is the test attribute which makes no sense use gain ratio rather then gain Split information: measure of the information value of split: without considering class information only number and size of child nodes A kind of normalization for information gain 49 Split information = (-Si/S)log2(Si/S) information needed to assign an instance to one of these branches Gain ratio = gain(S)/split information(S) in the previous example Split info and gain ratio for TID: split info(TID) = [(1/14)log2(1/14)]*14=3.807 gain ratio(TID) = (0.940-0.0)/3.807=0.246 Split info for age:I(5,4,5)= (5/14)log25/14+ (4/14)log24/14 +(5/14)log25/14=1.577 gain ratio(age) = gain(age)/split info(age) = 0.247/1.577=0.156 50 Gini Index (IBM IntelligentMiner) If a data set T contains examples from n classes, gini index, n gini(T) is defined as gini(T ) 1 p2 j 1 where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as gini split (T ) j N 1 gini( ) N 2 gini( ) T1 T2 N N The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). 51 Gini index (CART) Example Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” 2 2 9 5 gini ( D) 1 0.459 14 14 Suppose the attribute income partitions D into 10 in D1: {low, medium} 10 4 and 4 in D2 giniincome{low,medium} ( D) Gini ( D1 ) Gini ( D1 ) 14 14 but gini{medium,high} is 0.30 and thus the best since it is the lowest D1:{medium,high}, D2:{low} gini index: 0.300 D1:{low,high}, D2:{medium} gini index: 0.315 Highest gini is for D1:{low,medium}, D2:{high} 52 Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” 53 Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle halting growth of the tree when the encoding is minimized 54 Enhancements to basic decision tree induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication 55 Avoid Overfitting in Classification Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” 56 If the error rate of a subtree is higher then the error obtained by replacing the tree with its most frequent leaf or branch prune the subtree How to estimate the prediction error do not use training samples pruning always increase error of the training sample estimate error based on test set cost complexity or reduced error pruning 57 Example Labour negotiation data dependent variable or class to be predicted: independent variables: acceptability of contract: good or bad duration, wage increase 1th year: <%2.5, >%2.5 working hours per week: <36, >36 health plan contribution: none, half, full Figure 6.2 WF shows a branch of a d.tree 58 wage increase 1th year >2.5 <2.5 working hours per week <=36 1 Bad 1 good >36 health plan contribution none 4 bad 2 good a half 1 bad 1good b full 4 bad 2 good c 59 AnswerTree Variables measurement levels case weights frequency variables Growing methods Stopping rules Tree parameters costs, prior probabilities, scores and profits Gain summary Accuracy of tree Cost-complexity pruning 60 Variables Categorical Variables nominal or ordinal Continuous Variables All grouping method accept all types of variables QUEST requires that tatget variable be nominal Target and predictor variables target variable (dependent variable) predictor (independent variables) Case weight and frequency variables 61 Case weight and frequency variables CASE WEIGHT VARIABLES unequal treatment to the cases Ex: direct marketing all responders but %1 nonresponders(10,000) 10,000 households respond and 1,000,000 do not respond case weight 1 for responders and case weight 100 for nonresponders FREQUENCY VARIABLES count of a record representing more than one individual 62 Tree-Growing Methods CHAID Chi-squared Automatic Interaction Detector Kass (1980) Exhaustive CHAID Biggs, de Ville, Suen (1991) CR&T Classification and Regression Trees Breiman, Friedman, Olshen, and Stone (1984) QUEST Quick, Unbiased, Efficient Statistical Tree Loh, Shih (1997) 63 CHAID evaluate all values of a potential predictor merges values that are judged to be statistically homogenous target variable continuous F test categorical Chi-square test not binary: produce more than two categories at any particular level wider tree than do the binary methods works for all types of variables case weights and frequency variables missing values as a single valid category 64 Exhaustive CHAID CHAID may not find the optimal split for a variable as it stop merging categories all are statistically different continue merging until two super categories are left examine series of merges for the predictor set of categories giving strongest association with the target computes and adjusted p values for that association best split for each predictor choose predictor based on p values otherwise identical to CHAID longer to compute safer to use 65 CART Binary tree growing algorithm may not present efficiently partitions data into two subsets same predictor variable may be used several times at different levels misclassification costs prior probability distribution Computation can take a long time with large data sets Surrogate splitting for missing values 66 Steps in the CART analysis at the root node t=1, search for a split s* giving the largest decrease in impurity (s*,1) = max sS(s,1) split 1 as t=2 and t=3 using s* repeat the split searching process in t=2 and t=3 continue until one of the stopping rules is met 67 Stoping rules all cases have identical values for all predictors the node becomes pure; all cases have the same value of the target variable the depth of the tree has reached its prespecified maximum value the number of cases in a node less than a prespecified minimum parent node size split at node results in producing a child node with cases less than a prespecified min child node size for CART only: max decrease in impurity is less 68 QUEST variable selection and split point selection separately computationally efficient than CART 69 Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods 70 Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label) 71