PPT4 - Avoiding Overtrainingx

Download Report

Transcript PPT4 - Avoiding Overtrainingx

Eco 6380
Predictive Analytics For Economists
Spring 2016
Professor Tom Fomby
Department of Economics
SMU
Presentation 4
The Dangers of Overtraining a
Model and What to do to Avoid It
Chapter 2 in SPB
Choosing and Maintaining a Best
Supervised Learning Model
• In the case of a Supervised Learning Problem (either
prediction or classification) the standard practice is to
partition the data into training, validation, and test
subsets (more about the rationale of this partitioning
below)
• Use these data partitions to determine a best individual
model or Ensemble Model to use for the task at hand
(more about how to determine the best model later). In
some sense we are “letting the data speak” as to the best
model(s) to use for prediction or classification purposes.
Only with the advent of the large data sets that arise in
data mining tasks has the data partitioning approach
become viable.
Two Major Problems faced by both the
Statistical and Algorithmic (AI) Modeling
Approaches
• Over-training – the process of fitting an overly complex
model to a set of data. The Complex model, when applied to
an independent test data set, performs poorly in terms of
predictive accuracy. It is often said that, “the model has not
only fit the signal in the data but the noise as well.”
• Multiplicity of Good Models – several models provide
equally good fits of the data and in independent data sets
perform equally well in terms of predictive accuracy.
Goodness-of-fit (lower the better)
“Over-Training” Problem
On test data
On Training data
Complexity of Models
(Fomby Diagram)
“Over-Training” Problem
An overly complex model fit to training data will
invariably not perform as well ( in terms of predictive
accuracy ) on independent test data. The overly
complex model tends to, unfortunately, fit not only
the signal in the training data but also to the noise in
the training data and thus gives a false impression of
predictive accuracy.
Model M
Model L
Model K
…
Model J
Model 3
Model 2
Model 1
Predictive Accuracy (lower the better)
“Multiplicity of Good Models” Problem
Models
(Fomby Diagram)
“Multiplicity of Good Models” Problem
When there is a multiplicity of good models one can
often improve on predictive performance of any of
the individual “good” models by building an
ensemble of the best of the good models.
Predictive Accuracy (lower the better)
A Corollary :
“Complexity of Ensemble Model” Problem
On test data
On Training data
Number of Models Making Up Ensemble
(Fomby Diagram)
A Good Approach to Building a Good
Ensemble Model:
Build a “Trimmed” Ensemble
• Examine several model “groups” having distinct architectures
(designs) like MLR, KNN, ANN, CART, SVM, etc.
• Determine a Best (“super”) Model for each Model Group and then,
among these “super” models, choose the best 3 or best 4 of them,
form an ensemble model (usually an equally weighted one), and
then apply it to an independent data set. It will, with high
probability, outperform the individual “super” models that make up
the ensemble.
• In other words don’t just throw together a bunch of individual
models without careful pre-selection (“trimming”)
• In essence, this is the approach of the Auto Numeric and Auto
Classifier Nodes in SPSS Modeler. See
“property_values_numericpredictor.str” and
“pm_binaryclassifier.str” in the Demo Streams directory of SPSS
Modeler.
Using Cross-Validation and Data
Partitioning to Avoid the Over-training
(Over-fitting) of Models
• As we shall see, data mining is vulnerable to the danger
of “over-fitting” where a model is fit so closely to the
available sample of data (by making it overly complex)
that it describes not merely structural characteristics of
the data, but random peculiarities as well. In
engineering terms, the model is fitting the noise, not
just the signal.
• Over-fitting can also occur in the sense that with so
many models and their architectures to choose from,
“one is bound to find a model that works sooner or
later.”
Using Cross-Validation and Data Partitioning to
Avoid the Over-training (Over-fitting) of Models
continued
• To overcome the tendency to over-fit, data mining techniques
require that a model be “tuned” on one sample and
“validated” on a separate sample. Models that are badly
over-fit to one sample due to over-complexity will generally
perform poorly in a separate, independent data set.
Therefore, validation of a data mining model on a separate
data set from the data set used to formulate the model
(cross validation) tends to reduce the incidence of overfitting and results in an overall better performing model.
Also if one of many models just “happens” to fit a data set
well, its innate inadequacy is likely to be revealed when
applied to an independent data set.
Using Cross-Validation and Data Partitioning to Avoid
the Over-training (Over-fitting) of Models continued
•
•
•
•
The definitions and purposes of the data partitions are as follows:
Training Partition: That portion of the data used to fit the “parameters” of a
statistical model or determine the “architecture” of a data-algorithmic (machine
learning) model
Validation Partition: That portion of the data used to assess whether a proposed
model was over-fit to the training data and thus how good a proposed model
really is. For example, is a single hidden-layer artificial neural network to be
preferred over a double hidden-layer artificial neural network? Is a best subset
multiple regression determined with a critical p-value of 0.10 to be preferred to
one determined with a critical p-value of 0.01? The Validation data set helps us
answer such questions. Also, as we shall subsequently see, the Validation data set
helps us build Ensemble models that are constructed as “combinations” of the
best individual performing models.
Test Partition: That portion of the data that is used to assess the “unconditional”
(generalized) performances of “best” models that may have been determined
from the Validation data set. Also the Test data set can be used to determine the
relative efficiencies of competing Ensemble models for the purpose of choosing an
optimal Ensemble to use in practice.
Binning Continuous Variables
and Stratified Sampling
•
•
Sometimes when continuous input variables are very multi-collinear, it helps
to create categorical variables out of the continuous variables by a process
called “binning”. That is, the continuous variable is partitioned into class
intervals and an indicator (0 and 1) variable is assigned for each interval. For
example, an income variable divided into income_30, income_60, income_90,
and income_above, these variables representing income groups of (0,30K),
(30K, 60K), (60K,90K), and (90K,90K+). Binning crucial highly multi-collinear
continuous variables can break down the multicollinearity that might
otherwise prevent the construction of a strong prediction or classification
model. The XLMINER program supports the binning operation.
It is sometimes the case that models are best built by strata. That is, models
built by separate strata, for example by regions north, south, east, and west,
can often give rise to more accurate predictions and classifications when
considering all of the data (strata) as a whole. Thus, predictions or
classification built up by strata might yield more accurate predictions or
classifications than of a model using all of the data at once. The XLMINER
programs supports stratified sampling.
Classroom Exercise:
Exercise 2