MBG404_LS_08

Download Report

Transcript MBG404_LS_08

Computational Biology
Classification
(some parts taken from
Introduction to Data Mining
by
Tan, Steinbach, Kumar)
Lecture Slides Week 10
MBG404 Overview
Processing
Pipelining
Generation
Data
Storage
Mining
Classification Workflow
Support Vector Machines
• The line that maximizes the minimum
margin is a good bet.
– The model class of “hyper-planes
with a margin of m” has a low VC
dimension if m is big.
• This maximum-margin separator is
determined by a subset of the
datapoints.
– Datapoints in this subset are
called “support vectors”.
– It will be useful computationally if
only a small fraction of the
datapoints are support vectors,
because we use the support
vectors to decide which side of the
separator a test case is on.
The support vectors
are indicated by the
circles around them.
Training a linear SVM
• To find the maximum margin separator, we have to solve
the following optimization problem:
w.xc  b  1 for positive cases
w.xc  b  1 for negative cases
and || w ||2 is as small as possible
•
This is tricky but it’s a convex problem. There is only one
optimum and we can find it without fiddling with learning
rates or weight decay or early stopping.
– Don’t worry about the optimization problem. It has been
solved. Its called quadratic programming.
– It takes time proportional to N^2 which is really bad for
very big datasets
• so for big datasets we end up doing approximate optimization!
Testing a linear SVM
• The separator is defined as the set of points for which:
w.x  b  0
so if w.x c  b  0 say its a positive case
and if w.x  b  0 say its a negative case
c
What to do if there is no separating plane
• Use a much bigger set of features.
– This looks as if it would make the computation hopelessly slow,
but in the next part of the lecture we will see how to use the
“kernel” trick to make the computation fast even with huge
numbers of features.
• Extend the definition of maximum margin to allow nonseparating planes.
– This can be done by using “slack” variables
A picture of the best plane with a slack variable
The story so far
• If we use a large set of non-adaptive features, we can
often make the two classes linearly separable.
– But if we just fit any old separating plane, it will not
generalize well to new cases.
• If we fit the separating plane that maximizes the margin
(the minimum distance to any of the data points), we will
get much better generalization.
– Intuitively, by maximizing the margin we are
squeezing out all the surplus capacity that came from
using a high-dimensional feature space.
• This can be justified by a whole lot of clever mathematics
which shows that
– large margin separators have lower VC dimension.
– models with lower VC dimension have a smaller gap
between the training and test error rates.
Why do large margin separators have lower VC
dimension?
• Consider a random set of N points that
all fit inside a unit hypercube.
• If the number of dimensions is bigger
than N-2, it is easy to find a separating
plane for any labeling of the points.
– So the fact that there is a separating
plane doesn’t tell us much. It like
putting a straight line through 2 data
points.
• But there is unlikely to be a separating
plane with a margin that is big
– If we find such a plane its unlikely to
be a coincidence. So it will probably
apply to the test data too.
How to make a plane curved
• Fitting hyperplanes as
separators is mathematically
easy.
– The mathematics is linear.
• By replacing the raw input
variables with a much larger
set of features we get a nice
property:
– A planar separator in the
high-dimensional space of
feature vectors is a curved
separator in the low
dimensional space of the
raw input variables.
A planar separator in
a 20-D feature space
projected back to the
original 2-D space
A potential problem and a magic solution
• If we map the input vectors into a very high-dimensional
feature space, surely the task of finding the maximummargin separator becomes computationally intractable?
– The mathematics is all linear, which is good, but the
vectors have a huge number of components.
– So taking the scalar product of two vectors is very
expensive.
• The way to keep things tractable is to use
“the kernel trick”
• The kernel trick makes your brain hurt when you first
learn about it, but its actually very simple.
What the kernel trick achieves
• All of the computations that we need to do to find the
maximum-margin separator can be expressed in terms of
scalar products between pairs of datapoints (in the highdimensional feature space).
• These scalar products are the only part of the computation
that depends on the dimensionality of the high-dimensional
space.
– So if we had a fast way to do the scalar products we
would not have to pay a price for solving the learning
problem in the high-D space.
• The kernel trick is just a magic way of doing scalar products
a whole lot faster than is usually possible.
– It relies on choosing a way of mapping to the highdimensional feature space that allows fast scalar
products.
The kernel trick
• For many mappings from
a low-D space to a high-D
space, there is a simple
operation on two vectors
in the low-D space that
can be used to compute
the scalar product of their
two images in the high-D
space.
Low-D
xb

High-D
K ( x a , xb )   ( x a ) . ( xb )
Letting the
kernel do
the work
doing the scalar
product in the
obvious way
xa
 ( xb )
 ( xa )
Some commonly used kernels
Polynomial:
K (x, y )  (x.y  1) p
Gaussian
radial basis
function
K ( x, y )  e
Neural net:
K (x, y )  tanh ( k x.y   )
|| x  y|| 2 / 2 2
Parameters
that the user
must choose
For the neural network kernel, there is one “hidden unit”
per support vector, so the process of fitting the maximum
margin hyperplane decides how many hidden units to use.
Also, it may violate Mercer’s condition.
Performance
• Support Vector Machines work very well in practice.
– The user must choose the kernel function and its
parameters, but the rest is automatic.
– The test performance is very good.
• They can be expensive in time and space for big datasets
– The computation of the maximum-margin hyper-plane
depends on the square of the number of training cases.
– We need to store all the support vectors.
• SVM’s are very good if you have no idea about what
structure to impose on the task.
• The kernel trick can also be used to do PCA in a much
higher-dimensional space, thus giving a non-linear version
of PCA in the original space.
End Theory I
• 5 min Mindmapping
• 10 min Break
Practice I
Learning
• Supervised (Classification)
– Classification
• Decision tree
• SVM
Classification
• Use the iris.txt file for classification
• Follow along as we classify
Playtime
• Try to optimize the classification by playing around with
the parameters
• Who can achieve the best accuracy in 15 min?
End Practice I
• 15 min break
Theory II
Holdout validation
• One way to validate your model is to fit your
model on half your dataset (your “training set”)
and test it on the remaining half of your dataset
(your “test set”).
• If over-fitting is present, the model will perform
well in your training dataset but poorly in your
test dataset.
• Of course, you “waste” half your data this way,
and often you don’t have enough data to
spare…
Alternative strategies:
• Leave-one-out validation (leave one observation out at a
time; fit the model on the remaining training data; test on
the held out data point).
• K-fold cross-validation—what we will discuss today.
When is cross-validation used?
• Very important in microarray experiments (“p is larger
than N”).
• Anytime you want to prove that your model is not over-fit,
that it will have good prediction in new datasets.
10-fold cross-validation (one example of K-fold
cross-validation)
• 1. Randomly divide your data into 10 pieces, 1
through k.
• 2. Treat the 1st tenth of the data as the test
dataset. Fit the model to the other nine-tenths
of the data (which are now the training data).
• 3. Apply the model to the test data (e.g., for
logistic regression, calculate predicted
probabilities of the test observations).
• 4. Repeat this procedure for all 10 tenths of the
data.
• 5. Calculate statistics of model accuracy and fit
Example: 10-fold cross validation
• Gould MK, Ananth L, Barnett PG; Veterans Affairs SNAP
Cooperative Study Group A clinical model to estimate
the pretest probability of lung cancer in patients with
solitary pulmonary nodules. Chest. 2007
Feb;131(2):383-8.
• Aim: to estimate the probability that a patient who
presents with solitary pulmonary nodule (SPNs) in their
lungs has a malignant lung tumor to help guide clinical
decision making for people with this condition.
• Study design: n=375 veterans with SPNs; 54% have a
malignant tumor and 46% do not (as confirmed by a gold
standard test). The authors used multiple logistic
regression to select the best predictors of malignancy.
Results from multiple logistic regression:
Table 2. Predictors of Malignant SPNs
Predictors
OR
95% CI
Smoking history*
7.9
2.6–23.6
Age per 10-yr increment
2.2
1.7–2.8
Nodule diameter per 1-mm increment
1.1
1.1–1.2
Time since quitting smoking per 10-yr increment
0.6
0.4–0.7
*
Ever vs never.
Gould
MK, et al. Chest. 2007 Feb;131(2):383-8.
Prediction model:
Predicted Probability of malignant SPN = ex/(1+ex)
Where X=-8.404 + (2.061 x smoke) + (0.779 x age 10) +
(0.112 x diameter) – (0.567x years quit 10)
Gould
MK, et al. Chest. 2007 Feb;131(2):383-8.
Results…
• To evaluate the accuracy of their model, the
authors calculated the area under the ROC
curve.
• Review: What is an ROC curve?
– Calculate the predicted probability (pi) for every
person in the dataset.
– Order the pi’s from 1 to n (here 375).
– Classify every person with pi > p1 as having the
disease. Calculate sensitivity and specificity of this
rule for the 375 people in the dataset. (sensitivity will
be 100%; specificity should be 0%).
– Classify every person with pi > p2 as having the
disease. Calculate sensitivity and specificity of this
cutoff.
ROC curves continued…
– Repeat until you get to p375. Now specificity will be 100% and
sensitivity will be 0%
– Plot sensitivity against 1 minus the specificity:
AREA UNDER THE
CURVE is a measure
of the accuracy of
your model.
Results
• The authors found an AUC of 0.79 (95% CI: 0.74 to
0.84), which can be interpreted as follows:
– If the model has no predictive power, you have a 50-50 chance
of correctly classifying a person with SPN.
– Instead, here, the model has a 79% chance of correct
classification (quite an improvement over 50%).
A role for 10-fold cross-validation
• If we were to apply this logistic regression model to a
new dataset, the AUC will be smaller, and may be
considerably smaller (because of over-fitting).
• Since we don’t have extra data lying around, we can use
10-fold cross-validation to get a better estimate of the
AUC…
10-fold cross validation
• 1. Divide the 375 people randomly into sets of
37 and 38.
• 2. Fit the logistic regression model to 337 (ninetenths of the data).
• 3. Using the resulting model, calculate predicted
probabilities for the test data set (n=38). Save
these predicted probabilities.
• 4. Repeat steps 2 and 3, holding out a different
tenth of the data each time.
• 5. Build the ROC curve and calculate AUC using
the predicted probabilities generated in (3).
Results…
• After cross-validation, the AUC was 0.78 (95% CI: 0.73
to 0.83).
• This shows that the model is robust.
k-fold Cross Validation
• In k-fold cross-validation the data is first
partitioned into k equally (or nearly equally)
sized segments or folds.
• Subsequently k iterations of training and
validation are performed such that within each
iteration a different fold of the data is held-out for
validation while the remaining k - 1 folds are
used for learning.
• Fig. 1 demonstrates an example with k = 3. The
darker section of the data are used for training
while the lighter sections are used for validation.
• In data mining and machine learning 10-fold
cross-validation (k = 10) is the most common.
k-fold Cross Validation
Leave-One-Out Cross-Validation
• Leave-one-out cross-validation (LOOCV) is a
special case of k-fold cross-validation where k
equals the number of instances in the data.
• In other words in each iteration nearly all the
data except for a single observation are used for
training and the model is tested on that single
observation.
• An accuracy estimate obtained using LOOCV is
known to be almost unbiased but it has high
variance, leading to unreliable estimates [3].
• It is still widely used when the available data are
very rare, especially in bioinformatics where only
dozens of data samples are available.
Repeated K-Fold Cross-Validation
• To obtain reliable performance estimation or
comparison, large number of estimates are
always preferred.
• In k-fold cross-validation, only k estimates are
obtained.
• A commonly used method to increase the
number of estimates is to run k-fold crossvalidation multiple times.
• The data is reshuffled and re-stratified before
each round.
Using Out-of-Sample Data
Holdout Data
Lift Curves & Gains Charts
Validation Data
Cross-Validation
Out-of-Sample Data
•
Simplest idea: Divide data into 2 pieces.
•
•
•
Test data contains:
•
•
•
Training Data: data used to fit model
Test Data: “fresh” data used to evaluate model
actual target value Y
model prediction Y*
We can find clever ways of displaying the relation
between Y and Y*.
•
Lift curves, gains charts, ROC curves…………
Lift Curves
60%
• Sort data by Y* (score).
• Break test data into 10
equal pieces
• Lift measures:
• Segmentation power
• ROI of modeling
project
40%
35%
30%
20%
20%
LR Relativity
• Best “decile”: lowest
score  lowest LR
• Worst “decile”: highest
score  highest LR
• Difference: “Lift”
50%
50%
10%
5%
2%
0%
-2%
-10%
-10%
-20%
-25%
-30%
-30%
-40%
-50%
Test Data
-40%
1
2
3
4
5
6
Decile
7
8
9
10
Lift Curves: Practical Benefits
•
What do we really care about when we build a model?
–
–
•
High R2, etc?
…or increased profitability?
Paraphrase of Michael Berry:
Success is measured in dollars… R2, misclassification
rate… don’t matter.
Lift Curves: Practical Benefit
60%
• Lift curves can be used
to estimate the LR
benefit of implementing
the model.
• The same cannot be
said for R2, deviance,
penalized likelihood…
40%
35%
30%
20%
20%
LR Relativity
– E.g. how would nonrenewing the worst 5%
impact the combined
ratio?
50%
50%
10%
5%
2%
0%
-2%
-10%
-10%
-20%
-25%
-30%
-30%
-40%
-50%
Test Data
-40%
1
2
3
4
5
6
Decile
7
8
9
10
Lift Curves: Other Benefits
•
Allows one to easily compare multiple models on outof-sample data.
–
Which is the best technique?
•
–
Other modeling options:
•
•
GLM, decision tree, neural net, MARS….?
Optimal predictive variables, target variables…
Lends itself to iterative model-building process,
“controlled experiments”.
–
Need for final model validation.
Lift Curves: Other Benefits
•
Some times traditional statistical measures don’t really
give a feel for how successful the model is.
–
Personal line regression model fit on many million records.
•
•
–
–
R2 ≈ .0002
But excellent lift curve
Many traditional statisticians would say we’re wasting our time.
Are we?
Gains Charts: Binary Target
• Y is {0,1}-valued
Fraud Detection - Gains Chart
0.8
0.6
0.2
0.4
perfect model
decision tree
0.0
• Gain: get 90% of the
fraudsters by focusing
on 40% of population.
Perc.Fraud
• Sort data by Y* (score).
• For each data point,
calculate % of “1’s” vs.
% of population
considered so far.
1.0
• Fraud
• Defection
• Cross-Sell
0.0
0.2
0.4
0.6
Perc.Total
0.8
1.0
Gains Charts: Benefits
1.0
0.8
0.6
0.4
perfect model
mars
neural net
pruned tree #1
glm
regression
0.2
– Example to right: actual
analysis of “spam” data.
Spam Email Detection - Gains Charts
0.0
Perc.Spam
• Same as lift curve
benefits.
• Business: “gain”
measures real-life benefit
of using the model.
• Statistical: can easily
compare power of
multiple techniques.
0.0
0.2
0.4
0.6
Perc.Total.Pop
0.8
1.0
Model Selection vs. Validation
•
Suppose we’ve gone though an iterative modelbuilding process.
•
•
•
•
Fit several models on the training data
Tested/compared them on the test data
Selected the “best” model
The test lift curve of the best model might still be overly
optimistic.
•
•
Why: we used the test data to select the best model.
Implicitly, it was used for modeling.
Validation Data
•
It is therefore preferable to divide the data into three
pieces:
•
•
•
•
Training Data: data used to fit model
Test Data:
“fresh” data used to select model
Validation Data: data used to evaluate the final, selected model.
Train/Test data is iteratively used for model building,
model selection.
•
During this time, Validation data set aside and not touched.
Validation Data
• The model lift on train data is overly optimistic.
• The lift on test data might be somewhat optimistic as well.
• The Validation lift curve is a more realistic estimate
of future performance.
LR relativity to average
50%
30%
10%
-10%
-30%
train
test
7
8
val
-50%
1
2
3
4
5
6
Decile
9
10
Validation Data
• This method is the best of all worlds.
• Train/Test is a good way to select an optimal model.
• Validation lift a realistic estimate of future performance.
• Assuming you have enough data!
LR relativity to average
50%
30%
10%
-10%
-30%
train
test
7
8
val
-50%
1
2
3
4
5
6
Decile
9
10
Cross-Validation
•
•
What if we don’t have enough data to set aside a test
dataset?
Cross-Validation:
•
•
Each data point is used both as train and test data.
Basic idea:
•
•
•
•
Fit model on 90% of the data; test on other 10%.
Now do this on a different 90/10 split.
Cycle through all 10 cases.
10 “folds” a common rule of thumb.
Ten Easy Pieces
• Divide data into 10
equal pieces P1…P10.
• Fit 10 models, each on
90% of the data.
• Each data point is
treated as an out-ofsample data point by
exactly one of the
models.
model
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
1
train
train
train
train
train
train
train
train
train
test
2
train
train
train
train
train
train
train
train
test
train
3
train
train
train
train
train
train
train
test
train
train
4
train
train
train
train
train
train
test
train
train
train
5
train
train
train
train
train
test
train
train
train
train
6
train
train
train
train
test
train
train
train
train
train
7
train
train
train
test
train
train
train
train
train
train
8
train
train
test
train
train
train
train
train
train
train
9
train
test
train
train
train
train
train
train
train
train
10
test
train
train
train
train
train
train
train
train
train
Ten Easy Pieces
• Collect the scores from
the red diagonal…
• …You have an out-ofsample lift curve based
on the entire dataset.
• Even though the entire
dataset was also used
to fit the models.
model
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
1
train
train
train
train
train
train
train
train
train
test
2
train
train
train
train
train
train
train
train
test
train
3
train
train
train
train
train
train
train
test
train
train
4
train
train
train
train
train
train
test
train
train
train
5
train
train
train
train
train
test
train
train
train
train
6
train
train
train
train
test
train
train
train
train
train
7
train
train
train
test
train
train
train
train
train
train
8
train
train
test
train
train
train
train
train
train
train
9
train
test
train
train
train
train
train
train
train
train
10
test
train
train
train
train
train
train
train
train
train
Uses of Cross-Validation
• Model Evaluation
• Collect the scores from the ‘red boxes’ and generate a lift curve or
gains chart.
• Simulates the effect of using the train/test method.
• End run around the “small dataset” problem.
• Model Selection
• Index your models by some parameter α.
» # variables in a regression
» # neural net nodes
» # leaves in a tree
• Choose α value resulting in lowest CV error rate.
Model Selection Example
•
•
•
•
Use CV to select an optimal decision tree.
Built into the Classification & Regression Tree (CART)
decision tree algorithm.
Basic idea: “grow the tree” out as far as you can….
Then “prune back”.
CV: tells you when to stop pruning.
How Trees Grow
• Goal: partition the dataset
so that each partition
(“node”) is a pure as
possible.
– How: find the yes/no split
(Xi < θ) that results in the
greatest increase in purity.
• A split is a variable/value
combination.
– Now do the same thing to
the two resulting nodes.
– Keep going until you’ve
exhausted the data.
|
How Trees Grow
• Suppose we are
predicting fraudsters.
• Ideally: each “leaf”
would contain either
100% fraudsters or
100% non-fraudsters.
– The more you split, the
purer the nodes become.
– (Low bias)
– But how do we know
we’re not over-fitting?
– (High variance)
|
Finding the Right Tree
• “Inside every big tree is
a small, perfect tree
waiting to come out.”
|
--Dan Steinberg
2004 CAS P.M. Seminar
• The optimal tradeoff of
bias and variance.
• But how to find it??
|
Growing & Pruning
• One approach: stop
growing the tree early.
|
• But how do you know
when to stop?
• CART: just grow the
tree all the way out;
then prune back.
• Sequentially collapse
nodes that result in the
smallest change in
purity.
• “weakest link” pruning.
|
Cost-Complexity Pruning
• Definition: Cost-Complexity Criterion
Rα= MC + αL
• MC = misclassification rate
– Relative to # misclassifications in root node.
• L = # leaves (terminal nodes)
• You get a credit for lower MC.
• But you also get a penalty for more leaves.
• Let T0 be the biggest tree.
• Find sub-tree of Tα of T0 that minimizes Rα.
• Optimal trade-off of accuracy and complexity.
Weakest-Link Pruning
• Let’s sequentially collapse nodes that result in the
smallest change in purity.
• This gives us a nested sequence of trees that are all
sub-trees of T0.
T0 » T1 » T2 » T3 » … » Tk » …
• Theorem: the sub-tree Tα of T0 that minimizes Rα is in
this sequence!
• Gives us a simple strategy for finding best tree.
• Find the tree in the above sequence that minimizes CV
misclassification rate.
What is the Optimal Size?
• Note that α is a free parameter in:
Rα= MC + αL
• 1:1 correspondence betw. α and size of tree.
• What value of α should we choose?
• α=0  maximum tree T0 is best.
• α=big  You never get past the root node.
• Truth lies in the middle.
• Use cross-validation to select optimal α (size)
Finding α
• Fit 10 trees on the
“blue” data.
• Test them on the “red”
data.
• Keep track of misclassification rates for
different values of α.
• Now go back to the full
dataset and choose the
α-tree.
model
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
1
train
train
train
train
train
train
train
train
train
test
2
train
train
train
train
train
train
train
train
test
train
3
train
train
train
train
train
train
train
test
train
train
4
train
train
train
train
train
train
test
train
train
train
5
train
train
train
train
train
test
train
train
train
train
6
train
train
train
train
test
train
train
train
train
train
7
train
train
train
test
train
train
train
train
train
train
8
train
train
test
train
train
train
train
train
train
train
9
train
test
train
train
train
train
train
train
train
train
10
test
train
train
train
train
train
train
train
train
train
How to Cross-Validate
• Grow the tree on all the data: T0.
• Now break the data into 10 equal-size pieces.
• 10 times: grow a tree on 90% of the data.
• Drop the remaining 10% (test data) down the nested trees
corresponding to each value of α.
• For each α add up errors in all 10 of the test data sets.
• Keep track of the α corresponding to lowest test
error.
• This corresponds to one of the nested trees Tk«T0.
Just Right
2
3
5
6
7
8 10
13
18
21
0.4
0.6
0.8
1.0
1
0.2
• In summary: grow the
tree all the way out.
• Then weakest-link
prune back to the 15
node tree.
size of tree
X-val Relative Error
• Relative error: proportion
of CV-test cases
misclassified.
• According to CV, the 15node tree is nearly
optimal.
Inf
0.059
0.035
0.0093
cp
0.0055
0.0036
End Theory II
• 5 min Mindmapping
• 10 min Break
Practice II
Classification
• Use the orangeexample file for classification
• We are interested if we can distinguish between miRNAs
and random sequences with the selected features
• Try yourself
• Competition, who achieves the best accuracy in the next
30 min?