chapter6 - ODU Computer Science

Download Report

Transcript chapter6 - ODU Computer Science

Chapter 6: Implementations
Why are simple methods not good enough?
• Robustness: Numeric attributes, missing
values, and noisy data
Decision Trees
• Divide and conquer method
• Earlier discussion of this method worked
for nominal values. How to deal with
numeric attributes?
• How to calculate information?
Decision Trees and Numeric Attributes
• Generally, when a numeric attribute is used to split data, it is a
binary split. So the same numeric attribute may be tested several
times.
• Selection of the attribute to select at first once again is based on the
information gain: sort the attribute values and determine a
breakpoint where the information gain is maximized.
• For example, when the values of a numeric attribute are sorted as
follows, with the corresponding classification, the information gain at
the specified breakpoint (between 62 and 65) is computed as:
<45 (C1), 53 (C2), 55 (C2), 62 (C2), 65 (C1), 71 (C1), 75 (C1), 80 (C2)>
info([1,3],[3,1]) = 4/8*info([1,3])+4/8*info([3,1])
= info([1,3])
= -1/4*log(1/4) - 3/4*log(3/4)
= 0.811
Decision Trees and Missing Values
• Treat is as a separate category if it has some
significance
• Alternately, choose the most popular branch at a
split point when that attribute is missing for a
given instance
• More sophisticated approach: Notionally split the
instance into weights based on the portion of
instances that go along that path; whenever it is
further split at an intermediate node, the weights
are further split; finally, all branches lead to the
leaf nodes; the decision is also weighted based
on the weights and summed up.
Decision Trees: Pruning
•
•
•
•
Postpruning (or backward pruning) and prepruning (forward pruning)
Most decision tree builders employ postpruning.
Postpruning involves subtree replacement and subtree raising
Subtree replacement: Select some subtrees and replace them with a
single leaf node---see Figure 1.3a
• Subtree raising: More complex and not always worthwhile. C4.5
scheme uses it. See Fig. 6.1---generally restricted to the most
popular branch.
Decision Trees: Estimating Error
• In making the decision of subtree pruning or subtree raising, we
need to know the resulting estimated error.
• Keep in mind that training set is just a small subset of an entire
universe of data---so the tree should not be fitting just the training
data---the error estimation also should take this into account
• Method 1: Reduced error pruning: Hold back some of the training
data and use it to estimate the error due to pruning---not very good
as it reduces the training data
• Method 2: Error estimate based on the entire training data
Classification Rules
• Simple separate-and-conquer technique
• Problem: Tend to overfit the training data and do not generalize well
to independent sets, particularly on noisy data
• Criteria for choosing tests (in a rule):
– Maximize the correctness: p/t where t is total instances covered by the
rule out of which p are positive.
– Based on information gain: p[log(p/t)-log(P/T)] where P is total +ve
instances and T total instances before the rule was applied.
– Test 1 places more importance on correctness rather than coverage;
Test 2 is also concerned about coverage
• Missing values: Best to treat them as if the values on which the
missing values are being tested do not match; this way they may
match on other attributes in other rules
• Numeric attributes: Sort the attribute values and use break points to
make rules
Classification Rules: Generating Good Rules
• Objective: Instead of deriving rules that overfit to the training data, it
is best to generate sensible rules that stand a better chance of
performing well on new test instances.
• Coverage versus accuracy: Should we choose a rule that is true
over 15/20 instances or the one that is 2/2 (that is 100% correct)?
• Split the training data set into: growing set and pruning set
– Use the growing set to form rules
– Then, remove part of a rule and see its effect on the pruning set; if
satisfied, remove that part of the test
• Algorithm for forming rules by incremental reduced-error pruning
• Worth of a rule based on the pruning set: If it gets p instances right
out of the t instances it covers, and P is the total right instances out
of T. If N= T-p and n= t-p, then (N-n) are the total negative ones it
does not cover and p it covers p positive ones. So [p+(N-n)]/T is
taken as a metric.
Classification Rules: Global Optimization
• First generate rules based on incremental reduced-error pruning
techniques
• Then a global optimization is performed to increase the accuracy of
the rules---by revising or replacing individual rules
• Postinduction optimization is shown to improve both the size and
performance of the rule set
• But this process in often complex
• RIPPER is a build and optimize algorithm
Classification Rules: Using Partial Decision Trees
•
•
•
•
Alternative approach to rule induction that avoids global optimization
Combines divide-and-conquer of decision tree learning (p. 62) and
separate-and-conquer for rule learning (p. 112)
– Separate-and-conquer: It builds a rule, removes the instances it covers,
and continues creating rules recursively for the remaining instances until
none are left.
• To make a single rule, a pruned decision tree is built for the
currents set of instances, the leaf with the largest coverage is made
into a rule, and the tree discarded
• A partial decision tree is an ordinary decision tree that contains
branches to undefined subtrees.
• Entropy(p1,p2,p3,…,pn)=-p1logp1-p2logp2-…-pnlogpn
• Info([a,b,c])=entropy(a/(a+b+c), b/(a+b+c), c/(a+b+c)])
See Fig. 6.6 for an illustration
Figure 6.5 is the algorithm
• Once a partial tree has been built, a single rule is extracted from it.
– Each leaf corresponds to a possible rule, and we seek the best leaf of those
subtrees that have been expanded into leaves---choose the leaf that covers the
greatest number of instances
– If there is a missing attribute value, it is assigned to each of the branches with a
weight proportional to the number of training instances going down that branch
Extending Linear Models
• Basic techniques:
– Linear regression (for numeric prediction)
– Logistic regression (for linear classification) --- linear model based on
transformed target variables
– Perceptron (for linear classification)
– Winnow (for linear classification) –for data sets with binary attributes
• Basic problem: The boundaries between classes are not necessarily
linear
• Support vector machines use linear models to implement nonlinear
class boundaries
– Transform the input using a nonlinear mapping; map given instance to a new
instance
– Use linear models in the new space
– Transform the boundaries back to original space---they are now nonlinear
– Ex: x = w1a13 + w2a12a2 + w3a1a22 + w4a23
•
•
•
•
•
Here, a1and a2 are the attributes; w1, w2, w3, and w4 are to be learned; x
is the outcome
Train one linear system for each class
Assign an unknown instance to the class that gives the greatest output x--like multiresponse linear regression
Problems: Higher computational complexity; danger of overfitting
SVM solves both problems: Use maximum margin hyperplane---hyperplane
that gives the greatest separation between the classes---it comes no closer
to either than it has to.
– The maximum hyperplane is the perpendicular bisector of the shortest
line connecting the hulls of the two classes (say yes and no)
– The instances that are closest to the maximum margin hyperplane are
called support vectors; there is always at least one (if not more) support
vector for each class
– Given the support vectors for the two classes (say yes and no), the
maximum margin hyperplane can be easily constructed
Support Vector Regression
• Basic regression---find a function that approximates the training data
well by minimizing the prediction error (e.g., MSE)
• What is special about SVR: all deviations up to a user specified
parameter ε are simply discarded.
– Also, what is minimized is absolute error rather than MSE
• The value of ε controls how closely the function fits the training data:
too small an ε leads to overfitting; too large an ε leads to
meaningless predictions.
• See Fig. 6.9
Instance-based Learning
• Basic scheme: Use nearest neighbor technique
– Tends to be slow for large training sets
– Performs badly with noisy data---the class of a single instance is based
on its single nearest neighbor rather than on averaging
– No weights associated to different attributes---generally some may have
larger effect than others
– Does not perform explicit generalization
• Reducing the number of exemplars
– Already seen instances that are used for classification are referred to as
exemplars (or examples)
– Classify each example with the examples already seen and save only
the ones that don’t fit the current ones---expand examples only when
necessary
– Problem: Noisy examples are likely to be classified as new examples
•
Pruning noisy exemplars:
– For a given k, choose the k nearest neighbors and assign the majority class to
the unknown instance
– Alternately, monitor the performance of the stored exemplars---keep the ones
that do well (match well) and discard the rest
– IB3---Instance-based learner version 3 --- uses 5% confidence level for
acceptance and 1.25% for rejection. Criterion for acceptance is more stringent
than for rejection making it more difficult for an instance to be accepted
•
Weighting attributes: Use w1, w2, …, wn as weights in computing the
Euclidean distance metric for the n attributes. (see page 238)
– All attribute weights are updated after each training instance is classified and the
most similar exemplar is used as the basis for updating.
– Suppose x is the training instance and y the most similar exemplar---then for
each attribute I, |xi-yi| is a measure of the contribution of that attribute to the
decision. If the difference is small, contribution is more.
– See page 238 for details of changing the attribute weights
• Generalizing exemplars:
– These are rectangular regions of exemplars---called hyperrectangles
– Now, when classifying new instances, it is necessary to calculate the
distance based on its distance to the hyperrectangle
– When a new exemplar is classified correctly, it is generalized simply by
merging it with the nearest exemplar of the same class
• If the nearest exemplar is a single instance, a new hyperrectangle is created
that covers both exemplars.
• Otherwise, the hyperrectangle is modified to cover the new one
– If the prediction is incorrect, the hyperrectangle’s boundaries are shrunk
so it is separated from the instance that was misclassified
Distance Functions for Generalized Exemplars
• Generalized exemplars are no longer points; instead they are
hyperrectangles. So the distance of an instance from an exemplar is
computed as follows.
– If the point lies within the hyperrectangle, the distance is zero.
– Measure the distance from the outside point instance to the nearest
point on the hyperrectangle boundary or measure the distance between
the outside point and the nearest instance that is within the
hyperrectangle.
– In case hyperrectangles overlap, choose a hyperrectangle that is most
specific or the one that covers the smallest instnace space.
Numeric Prediction
•
•
•
Model tree: Used for numeric prediction. Predicts the class value of
instances that reach a leaf
Regression tree: Here, the leaf nodes represents average value of all
instances that reach that node---a special case of model trees
Model trees:
– Each leaf will contain a linear model based on some of the attribute values, and
is used to yield a raw predicted value for a test instance
– The raw value can be smoothed out by producing linear models for each internal
node as well as for the leaves, at the time the tree is built. Once a raw value has
been obtained at the leaf node, it is filtered along the path back to the root
– Smoothing occurs at each internal node by combining it with the value predicted
by the linear model for that node. Function: p’ = (np+kq)/(n+k) where p’ is the
new prediction, p the prediction from the node below, n is the number of training
instances that reach the node below and k is a smoothing constant.
– Alternately, the leaf node’s model can be modified to reflect the smoothing that
takes place at the internal nodes.
• Building the tree: Here, similar to information gain used with nominal
attributes, expected error reduction is chosen as a metric to choose
an attribute on which to split a tree.
– SDR = Std. Dev. Reduction = sd(T) - ∑ |Ti|/|T|*sd(Ti)
where T is the tree prior to splitting and Tis are the subtrees formed after
splitting choosing a particular attribute to split. In other words, the idea
is to choose an attribute that reduces the variance after a split.
– Splitting process terminates when the class value std. deviation is small
fraction of the std. dev of the original instance.
• Pruning the tree:
– First, a linear model is calculated for each node of the unpruned
tree.
– Only the attributes tested subtree below this node are used in
the regression---assume that all attributes are numeric
– Once a linear model is in place for each interior node, the tree is
pruned back from the leaves as long as the expected estimated
error decreases
• In case there are nominal attributes, they are converted
to binary variables that are treated as numeric. If a
nominal value has k possible values, it is replaced by k-1
synthetic binary attributes
Clustering
• Basic scheme: k-means clustering
• Choosing k?
– MDL: Minimum description length principle
• Occam’s razor: Other things being equal, simple things are better than
complex ones.
• General theory + exceptions = what we learn
• MDL principle: Best theory for a body of data is one that minimizes the size
of the theory plus the amount of information necessary to specify the
exceptions relative to the theory
– The one that minimizes the #of bits required to communicate the generalization,
along with the examples from which it is made (i.e., training set)
Bayesian Networks
• Naïve Bayes classifier: For each class value, estimate the
probability that a given instance belongs to that class.
• More advanced: Bayesian networks---a network of nodes, one for
each attribute, connected by directed edges; a directed acyclic
graph
• Fig 6.20 and Fig. 6.21