Transcript Document

Genetic Algorithms (in 1 Slide)







GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of randomly
generated rules
Based on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules and
their offspring
The fitness of a rule is represented by its classification
accuracy on a set of training examples
Offspring are generated by crossover and mutation
GA’s are a general search/optimization method, not just a
classification method. This can be contrasted with other
methods
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Ensemble Methods

Construct a set of classifiers from the training
data

Predict class label of previously unseen records
by aggregating predictions made by multiple
classifiers

In Olympic Ice-Skating you have multiple judges?
Why?
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
General Idea
D
Step 1:
Create Multiple
Data Sets
Step 2:
Build Multiple
Classifiers
D1
D2
C1
C2
Step 3:
Combine
Classifiers
© Tan,Steinbach, Kumar
....
Original
Training data
Dt-1
Dt
Ct -1
Ct
C*
Introduction to Data Mining
4/18/2004
‹#›
Why does it work?

Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– Assume classifiers are independent
– Probability that the ensemble classifier makes
a wrong prediction:
 25  i
25i

(
1


)
 0.06



 i 
i 13 

25
– Practice has shown that even when
independence does not hold results are good
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Methods for generating Multiple Classifiers


Manipulate the training data
– Sample the data differently each time
– Examples: Bagging and Boosting
Manipulate the input features
– Sample the featurres differently each time
Makes

especially good sense if there is redundancy
– Example: Random Forest
Manipulate the learning algorithm
– Vary some parameter of the learning algorithm
E.g.,
amount of pruning, ANN network topology, etc.
Use different learning algorithms
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Background

Classifier performance can be impacted by:
– Bias: assumptions made to help with
generalization
"Simpler
is better" is a bias
– Variance: a learning method will give different
results based on small changes (e.g., in
training data).
When
I run experiments and use random sampling
with repeated runs, I get different results each time.
– Noise: measurements may have errors or the
class may be inherently probabilistic
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
How Ensembles Help

Ensemble methods can assist with the bias and variance
– Averaging the results over multiple runs will reduce the
variance
I observe this when I use 10 runs with random sampling and
see that my learning curves are much smoother

– Ensemble methods especially helpful for unstable
classifier algorithms
Decision
trees are unstable since small changes in the training
data can greatly impact the structure of the learned decision tree
– If you combine different classifier methods into an
ensemble, then you are using methods with different
biases
You are more likely to use a classifier with a bias that is a good
match for the problem
You may even be able to identify the best methods and weight
them more

© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Examples of Ensemble Methods



How to generate an ensemble of classifiers?
– Bagging
– Boosting
These methods have been shown to be quite effective
A technique ignored by the textbook is to combine
classifiers built separately
– By simple voting
– By voting and factoring in the reliability of each
classifier
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Bagging
Sampling with replacement
 Build classifier on each bootstrap sample
 Each sample has probability (1 – 1/n)n of being
selected (about 63% for large n)
– Some values will be picked more than once
 Combine the resulting classifiers, such as by
majority voting
 Greatly reduces the variance when compared to
a single base classifier

© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Boosting

An iterative procedure to adaptively change
distribution of training data by focusing more on
previously misclassified records
– Initially, all N records are assigned equal
weights
– Unlike bagging, weights may change at the
end of boosting round
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Boosting
Records that are wrongly classified will have their
weights increased
 Records that are classified correctly will have
their weights decreased

Original Data
Boosting (Round 1)
Boosting (Round 2)
Boosting (Round 3)
1
7
5
4
2
3
4
4
3
2
9
8
4
8
4
10
5
7
2
4
6
9
5
5
7
4
1
4
8
10
7
6
9
6
4
3
10
3
2
4
• Example 4 is hard to classify
• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Class Imbalance


Class imbalance occurs when the classes in the
distribution are very unevenly distributed
– Examples would include fraud prediction and
identification of rare diseases
If there is class imbalance, accuracy may be high even if
the rare class is never predicted
– This could be okay, but only if both classes are
equally important
This
is usually not the case. Typically the rare class is more
important
The
cost of a false negative is usually much higher than the
cost of a false positive
– The rare class is designated the positive class
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Confusion Matrix

The following abbreviations are standard:
– FP: false positive
– TP: True Positive
– FN: False Negative
– TN: True Negative
Actual Class
+
© Tan,Steinbach, Kumar
+
Predicted Class
-
TP
FN
FP
TN
Introduction to Data Mining
4/18/2004
‹#›
Classifier Evaluation Metrics
Accuracy = (TP + TN)/(TP + TN + FN + FP)
 Precision and Recall
– Recall = TP/(TP + FN)

Recall
measure the fraction of positive class
examples that are correctly identified
– Precision = TP/(TP + FP)
Precision
is essentially the accuracy of the examples
that are classified as positive
– We would like to maximize precision and recall
They
are usually competing goals
These measure are appropriate for class imbalance
since recall explicitly addresses the rare class
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Classifier Evaluation Metrics Cont.
One issue with precision and recall is that they
are two numbers, not one
– That makes simple comparisons more
difficulty and it is not clear how to determine
the best classifier
– Solution: combine the two
 F-measure combines precision and recall
– The F1 measure is defined as:

(2
x recall x precision)/(recall + precision)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Cost-Sensitive Learning


Cost-sensitive learning will factor in cost information
– For example, you may be given the relative cost of a
FN vs. FP
– You may be given the costs or utilities associated with
each quadrant in the confusion matrix
Cost-sensitive learning can be implemented by sampling
the data to reflect the costs
– If the cost of a FN is twice that of a FP, then you can
increase the ratio of positive examples by a factor of 2
when constructing the training set
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
More on autonomous vehicles

DARPA Grand Challenges. $1,000,000 prize.
– Motivation: by 2015 1/3 of ground military
forces autonomous.
– 2004: 150 mile race through the Mojave
desert. No one finished. CMU’s car made it
the farthest at 7.3 miles
– 2005: Same race. 22 of 23 surpassed the best
distance from 2004. Five vehicles completed
the course. Stanford first, CMU second.
Sebastian Thrun leader for Stanford team.
– 2005 Grand Challenge Video
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
More DARPA Grand Challenge

2007 Urban Challenge
– 60 mile urban area course to be completed in
6 hours.
Must
obey all traffic laws and avoid other robot cars
– Urban Challenge Video
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Google Autonomous Vehicle
Google “commercial” video
 Second Google driverless car video


Alternative future autonomous “vehicles”
– video
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›