Slides - The Fenyo Lab

Download Report

Transcript Slides - The Fenyo Lab

Introduction to Machine
Learning
Sackler
Yin Aphinyanaphongs MD/ PhD
12/11/2014
Who Am I

Yin Aphinyanaphongs (yinformatics.com)

MD, PhD from Vanderbilt University in Nashville, TN.

Assistant Professor in the Center for Health Informatics and Bioinformatics.

Primary Expertise



Machine Learning
 Predictive Modeling
 Text Classification
Data Mining
 Social Media
 Large Medical Datasets
Secondary Expertise


Search Engine Design/ Information Retrieval
Natural Language Processing
What I Teach

Introduction to Biomedical Informatics.

Introduction to Medicine for Computer Scientists.

Data Analytics in R for physicians.
Machine Learning Examples

Given an email, classify it as spam or not spam.

Given a handwritten digit, assign it the right number.

Given descriptions of passengers on the titanic, predict who
will survive or not survive.

Given a gene expression microarray of a cancer, predict
whether the cancer will or will not metastasize.
Email Spam Text Classification
http://blog.cyren.com/uploads/blog/google-docs-spamsample.jpg
Digit Classification
http://nonbiritereka.hatenablog.com/entry/2014/09/18/100439
Predicting Titanic Survival

Passenger class

Name



Sex
Age
Number of siblings/ spouses
aboard

Number of parents/ children
aboard

Ticket number

Passenger fare

Cabin

Port of Embarkation
https://www.kaggle.com/c/titanicgettingStarted
Molecular Signatures

Molecular signature is a
computational or
mathematical model that
links high-dimensional
molecular information to
phenotype or other
response variable of
interest.
Golub et al.. (1999)) heatmap
+
Machine Learning
Goal

Construct algorithms to learn from data such that a built
model from training data will generalize to unseen data.
General Framework
Obtain
Seq
Sample
Seq
(Optio
nal)
Label
Seq
Clean
Seq
Encode
Seq
Build a Model
Performance
Evaluation
(Internal)
Model
Application
and
Validation
(External)
Basic Framework
Unseen
Examples
Labeled
Examples
ALL
AM
L
Classification Algorithm
• Random Forests
• Regularized Logistic
Regression
• Support Vector
Machines etc.
Labeled
ALL
AM
L
+
Key Concept – Supervised
Learning
From the book “A Gentle Introduction to Support Vector Machines
in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
Principles and geometric representation
for supervised learning (1/7)
14
• Want to classify objects as boats and houses.
Principles and geometric
representation for supervised
learning (2/7)
• All objects before the coast line are boats and all objects
after the coast line are houses.
• Coast line serves as a decision surface that separates two
classes.
15
Principles and geometric
representation for supervised
learning (3/7)
These boats will be misclassified as houses
This house will be
misclassified as boat
16
Principles and geometric representation
for supervised learning (4/7)
17
Longitude
Boat
Latitude
House
• The methods that build classification models (i.e., “classification
algorithms”) operate very similarly to the previous example.
• First all objects are represented geometrically.
Principles and geometric representation
for supervised learning (5/7)
18
Longitude
Boat
Latitude
Then the algorithm seeks to find a
decision surface that separates classes of
objects
House
Principles and geometric representation
for supervised learning (6/7)
19
Longitude
These objects are classified as houses
?
?
?
?
?
?
These objects are classified as boats
Latitude
Unseen (new) objects are classified as
“boats” if they fall below the decision
surface and as “houses” if the fall above it
Principles and geometric representation
for supervised learning (7/7)
20
Longitude
Object #1
Object #2
Latitude
Object #3
+
Key Concept – Overfitting,
Underfitting
From the book “A Gentle Introduction to Support Vector Machines
in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
Two problems:
Over-fitting & Under-fitting
 Over-fitting
(a model to your data) = building
a model that is good in original data but fails to
generalize well to new/unseen data
 Under-fitting
(a model to your data) =
building a model that is poor in both original
data and new/unseen data
22
Over/under-fitting are related to
complexity of the decision surface
and how well the training data is fit
Outcome of
Interest Y
Predictor X
23
Scenario 1
24
Outcome of
Interest Y
Training Data
Future Data
Predictor X
Scenario 1
25
Outcome of
Interest Y
Training Data
Future Data
Predictor X
Scenario 1
26
Outcome of
Interest Y
Training Data
Future Data
Predictor X
Scenario 1
27
Outcome of
Interest Y
This line is
good!
Training Data
Future Data
Predictor X
This line
overfits!
Scenario 2
28
Outcome of
Interest Y
Training Data
Future Data
Predictor X
Scenario 2
29
Outcome of
Interest Y
Training Data
Future Data
Predictor X
Over/under-fitting are related to
complexity of the decision surface
and how well the training data is fit
Outcome of
Interest Y
Training Data
Future Data
Predictor X
30
Over/under-fitting are related to
complexity of the decision surface
and how well the training data is fit
31
Outcome of
Interest Y
This line is
good!
Training Data
Future Data
Predictor X
This line
underfits!
32
Very important concept…

Successful data analysis methods balance training data
fit with complexity.


Too complex signature (to fit training data well) overfitting (i.e.,
signature does not generalize)
Too simplistic signature (to avoid overfitting)  underfitting (will
generalize but the fit to both the training and future data will be
low and predictive performance small).
+
Key Concept – Performance
Estimation
From the book “A Gentle Introduction to Support Vector Machines
in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
On estimation of classifier accuracy
34
test
Large sample case:
use hold-out
validation
Small
sample
case: use Nfold crossvalidation
data
train
test
test
data
test
test
train train
train train train train test
test
Other versions of this general
notion…

Leave one out cross validation

Leave pair out cross validation

Bootstrap

Single Holdout
+
Key Concept – The Support
Vector Machine
From the book “A Gentle Introduction to Support Vector Machines
in Biomedicine” Statnikov, Aliferis, Hardin, Guyon
The Support Vector Machine (SVM)
approach for building molecular
signatures

Support vector machines (SVMs) is a binary
classification algorithm.

SVMs are important because of (a) theoretical
reasons:
-
Robust to very large number of variables and small samples
Can learn both simple and highly complex classification models
Employ sophisticated mathematical principles to avoid
overfitting
and (b) superior empirical results.
37
38
Main ideas of SVMs (1/3)
Gene Y
Normal patients
Cancer patients
Gene X
• Consider example dataset described by 2 genes, gene X
and gene Y
• Represent patients geometrically (by “vectors”)
39
Main ideas of SVMs (2/3)
Gene Y
Normal patients
Cancer patients
Gene X
• Find a linear decision surface (“hyperplane”) that can
separate patient classes and has the largest distance (i.e.,
largest “gap” or “margin”) between border-line patients
(i.e., “support vectors”);
40
Main ideas of SVMs (3/3)
Gene Y
Cancer
Cancer
Decision surface
kernel
Normal
Normal
Gene X
• If such linear decision surface does not exist, the data is
mapped into a much higher dimensional space (“feature
space”) where the separating decision surface is found;
• The feature space is constructed via very clever
mathematical projection (“kernel trick”).
+
Key Concept - Curse of
Dimensionality
Thanks to Dr. Gutierrez-Osuna http://courses.cs.tamu.edu/rgutier/cs790_w02/l5.pdf.
Curse of Dimensionality (1/3)
Curse of dimensionality (2/3)
Curse of Dimensionality (3/3)
45
The range of features in higher
dimensional data include.
 10,000-50,000
(gene expression microarrays,
aCGH, and early SNP arrays)
 >500,000
arrays)
(exon arrays/tiled microarrays/SNP
 10,000-300,000
 >10,000,000
(MS proteomics)
(LC-MS proteomics)
 >100,000,000
(next-generation sequencing)
46
High Dimensionality in Small
Samples Causes
 Some
methods do not run at all (classical
regression)
 Some
methods give bad results (KNN, Decision
trees)
 Very
slow analysis
 Very
expensive/cumbersome clinical application
 Tends
to “overfit”
+
Cancer Classification Case
Study
From Golub et al. (1999)
Case Study


Classify the values of a gene microarray according to
leukemia type.

AML

ALL
Task meta-data


72 samples

47 ALL

25 AML
5,327 genes
Labeled Microarrays
Treatment
AML
25
ALL
47
Encode Microarray

Within each train fold, normalize the values of each column
between 0 and 1.

Notice that we don’t normalize the entire dataset and then
run our classification algorithms (this would result in
overfitting).
Build a Model - Support Vector
Machine
*
*
*
* *
*
*
* *
*
*
*
* *
*
*
*
*
*
*
*
*
*
* *
*
*
This example illustrates a 2 dimensional
space. The x and y axis represent one word
each. A full text categorization example could
contain upwards of 50,000 words and thus
50,000 dimensions.
Build a Model – K nearest
neighbors
http://mines.humanoriented.com/classes/2010/fall/csci568/
portfolio_exports/lguo/knn.html
Build a Model – Neural Network
http://en.wikipedia.org/wiki/Artifi
cial_neural_network
54
Estimate Performance
Small
sample
case: use Nfold crossvalidation
test
test
data
test
test
train train
train train train train test
test
Results
Proportion of Correct
Classifications
Baseline (All in one class)
65.0%
Support Vector Machine
91.7%
K Nearest Neighbors
87.9%
Neural Network
84.7%
http://bib.oxfordjournals.org/content/7/1/86.fu
ll.pdf+html
Conclusions

Machine Learning Examples

Key Concepts


Supervised Learning

Overfitting/ Underfitting

Support Vector Machines

Cross Validation

Curse of Dimensionality
Case Study – Cancer Classification
Thanks.

Dr. Gutierrez-Osuna

Dr. Alexander Statnikov
+
Molecular Signatures
Slides from Dr Alexander Statnikov PhD.
Definition of a molecular signature
Molecular signature is a
computational or mathematical
model that links high-dimensional
molecular information to phenotype
or other response variable of
interest.
60
Example of a molecular signature
61
Patient with
lung cancer
Primary Lung Cancer
Biopsy
Molecular
signature
Gene
expression
profile
Metastatic Lung Cancer
Main uses of molecular signatures
1.
Direct benefits: Models of disease
phenotype/clinical outcome
•
•
•
2.
Ancillary benefits 1: Biomarkers for diagnosis,
or outcome prediction
•
•
•
3.
Diagnosis
Prognosis, long-term disease management
Personalized treatment (drug selection, titration)
Make the above tasks resource efficient, and easy to use in
clinical practice
Helps next-generation molecular imaging
Leads for potential new drug candidates
Ancillary benefits 2: Discovery of structure &
mechanisms (regulatory/interaction networks,
pathways, sub-types)
• Leads for potential new drug candidates
62
Recent molecular signatures
available for patient care
Agendia
Clarient
Prediction Sciences
LabCorp
OvaSure
University Genomics
BioTheranostics
Applied Genomics
Genomic Health
Veridex
Power3
Correlogic Systems
63
Prostate cancer signatures in the market
64
MammaPrint
• Developed by Agendia (www.agendia.com)
• 70-gene signature to stratify women with
breast cancer that hasn’t spread into “low
risk” and “high risk” for recurrence of the
disease
• Independently validated in >1,000 patients
• So far performed >10,000 tests
• Cost of the test is ~$3,000
• In February, 2007 the FDA cleared the
MammaPrint test for marketing in the U.S. for
node negative women under 61 years of age
with tumors of less than 5 cm.
• TIME Magazine’s 2007 “medical invention of
the year”.
65
Oncotype DX Breast Cancer Assay (Launched66in
2004)

Developed by Genomic Health (www.genomichealth.com)

21-gene signature to predict whether a woman with localized,
ER+ breast cancer is at risk of relapse

Independently validated in thousands of patients

So far performed >200,000 tests

Price of the test is $4,175

Not FDA approved but covered by most insurances including
Medicare

Its sales in 2012 reached $199M.
Economic validity
67
In a 2005 economic analysis of the Recurrence Score result in LN-,ER+ patients receiving tamoxifen,
Hornberger et al. performed a cost-utility analysis using a decision analytic model. Using a model,
recurrence Score result was predicted on average to increase quality-adjusted survival by
16.3 years and reduce overall costs by $155,128.
Instead of using the model, economic benefits can now be assessed from the published clinical utility of
the test and actual health plan costs for adjuvant chemotherapy. For example, in a 2 million
member plan, approximately 773 women are eligible for the test. If half receive the test, given
the high and increasing cost of adjuvant chemotherapy, supportive care and management of
adverse events, the use of the Oncotype DX assay is estimated to save approximately $1,930
per woman tested (given an aggregate 34% reduction in chemotherapy use).
References about health benefits and cost-effectiveness:

“Economic Analysis of Targeting Chemotherapy Using a 21-Gene RT-PCR Assay in Lymph NodeNegative, Estrogen Receptor-Positive, Early-Stage Breast Cancer” Am J Manag Care. 2005; 11(5):313324.

“Impact of a 21-Gene RT-PCR Assay on Treatment Decisions in Early-Stage Breast Cancer, An Economic
Analysis Based on Prognostic and Predictive Validation Studies” Cancer. 2007; 109(6):1011-1018.
Oncotype DX Colon Cancer Assay (Launched68in
2010)

Developed by Genomic Health (www.genomichealth.com)

Multigene gene signature to predict risk of recurrence in
patients with stage II colon cancer

Independently validated in thousands of patients

Price of the test is $3,280

Not FDA approved but covered by most insurances including
Medicare
Oncotype DX Prostate Cancer Assay (Launched
69
in 2013)

Developed by Genomic Health (www.genomichealth.com)

Multigene gene signature to distinguish aggressive prostate
cancer from less threatening one

Independently validated

Price of the test is $3,820

Not FDA approved but covered by most insurances including
Medicare
Oncotype DX Business Metrics
Data from http://investor.genomichealth.com/
70
Conclusions

Machine Learning Examples

Key Concepts

Supervised Learning

Overfitting/ Underfitting

Support Vector Machines

Cross Validation

Curse of Dimensionality

Case Study – Cancer Classification

Case Study – Molecular Signatures
Thanks.

Dr. Gutierrez-Osuna

Dr. Alexander Statnikov