supervised-i

Download Report

Transcript supervised-i

Supervised Learning I
BME 230
Supervised (hypothesis-driven)
learning
In clustering only had expression data
available
In supervised learning, a “response” or
“label” or “class” is known (e.g. group of
related genes, tissue of origin, drug
treatment, concentration, temperature,...)
Supervised Learning in gene
expression analysis
• Use information about known biology
• Usually predict arrays based on their
expression vectors (columns of matrix)
– Samples of different cell types
– Healthy patients versus sick
– Drug treatments
• Find genes that are informative
• Use expression of informative genes to
predict new samples.
Example – Predicting Cancer
Prognosis
• Somatic chromosomal alterations frequently
occur in cancer
• Some may be neutral while others
contribute to pathogenesis
• Identifying aberrations that recur in tumors
Array Comparative Genome
Hybridization
Chin et al. (2007)
171 Primary Breast Tumors
41 Breast Cancer Cell Lines
Cancer Classification Challenge
• Given genomic features like CNV (x)
• Predict clinical outcomes, (y)
• Classifiction if y is discrete
– Cancer vs. normal
– HER2 status
• Regression if y is continuous
– Genome instability index (GII)
Cluster the CNV Data
Patients
Discrete
Resposes, y:
Grade
Continous
Resposes, y:
GII
Predictors, x:
CNV
Classifying leukemia (Golub et al
1999)
• Target cancer w/ the right drug
• Usually classify by morphology
• Acute leukemia
–
–
–
–
–
known to have variable clinical outcome
Nuclear morphologies different
1960’s: enzyme assays (myeloperoxidase +)
1970’s: antibodies to lymphoid vs myeloid
1995: specific chrom translocations Golub et al (1999)
Classifying leukemia (Golub et al
1999)
• 38 bone marrow samples at time of diagnosis
– 27 patients w/ acute lymphoblastic (ALL)
– 11 patients w/ acute myeloblastic (AML)
• Class discovery – can clustering “find”
distinction between the leukemias?
Golub et al (1999)
Leukemia: Neighborhood
analysis
Golub et al (1999)
Classifying leukemia (Golub et al
1999)
Golub et al (1999)
Supervised Learning
Train a model with x1,...,xn examples of
labelled data. Labels are y1,...,yn.
Find function h(x)  y. So can predict the
class y on new observations.
Copy Number Variation Data
patient genomic samples
Normal
Normal
Normal
Cancer
Cancer
sample1 sample2 sample3 sample4 sample5 …
Genes
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
Fewer copies of gene i in sample j
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
y
x
Tumor Classification Genomic Data
Three main types of statistical problems associated with
the high-throughput genomic data:
• Identification of “marker” genes that characterize the
different tumor classes (feature or variable selection).
• Identification of new/unknown tumor classes using
gene expression profiles (unsupervised learning –
clustering)
• Classification of sample into known classes (supervised
learning – classification)
Classification
Y
Normal
Normal
Normal
Cancer
Cancer
sample1 sample2 sample3 sample4 sample5 …
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
X
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
unknown =Y_new
New sample
0.34
0.43
-0.23
-0.91
1.23
X_new
• Each object (e.g. arrays or columns)associated with a class label (or response) Y
 {1, 2, …, K} and a feature vector (vector of predictor variables) of G
measurements: X = (X1, …, XG)
• Aim: predict Y_new from X_new.
Supervised Learning
• Neighbor-based methods
– k-nearest neighbors (KNN)
– Parzen windows & kernel density estimation
• Discriminating hyperplanes
– Linear discriminant (LDA/QDA/PDA)
– Support Vector Machines (SVM)
– Neural nets and perceptrons (ANNs)
• Decision trees (CART)
• Aggregating Classifiers
Neighbor based methods (guilt by
association)
The function of a gene should be similar to
the functions of its neighbors
Neighbors are found in the predictor space X
The neighbors vote for the function of the
gene
Nearest Neighbor Classification
• Based on a measure of distance between observations (e.g.
Euclidean distance or one minus correlation).
• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an
observation X as follows:
– find the k closest observations in the training data,
– predict the class by majority vote, i.e. choose the class that is
most common among those k neighbors.
– k is a parameter, the value of which will be determined by
minimizing the cross-validation error later.
– E. Fix and J. Hodges. Discriminatory analysis. Nonparametric discrimination:
Consistency properties. Tech. Report 4, USAF School of Aviation Medicine,
Randolph Field, Texas, 1951.
Neighbor-based methods
genes
expression vector
a “new” unclassified gene:
Does it encode a
protein involved in
degradation?
degradation
?
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
Neighbor-based methods
genes
1. Find its closest neighors
degradation
similarity
?
1
1
1
0
1
1
1
0
0
0
0
0
0
1
0.8
0.7
0.65
0.6
0.6
0.6
0.54
0.3
0.1
0.05
0.03
0.01
most similar
least similar
Neighbor-based methods
genes
2. Let closest neighbors vote on function
degradation
similarity
?
1
1
1
0
1
1
1
0
0
0
0
0
0
1
0.8
0.7
0.65
0.6
0.6
0.6
0.54
0.3
0.1
0.05
0.03
0.01
count # of 1’s
vs.
count # of 0’s
most similar
least similar
k-nearest neighbors
The k closest neighbors get to vote (no matter
how far away they are)
k=3
function of the
gene is
k-nearest neighbors
genes
k-nearest neighbors w/ k=5
degradation
similarity
?
1
1
1
0
1
1
1
0
0
0
0
0
0
1
0.8
0.7
0.65
0.6
0.6
0.6
0.54
0.3
0.1
0.05
0.03
0.01
most similar
least similar
4/5 say degradation
Parzen windows
Neighbors within distance d get to vote (no
matter how many there are)
distance = d
d
function of the
gene is
Parzen windows
genes
Parzen windows with similarity > 0.1
degradation
similarity
?
1
1
1
0
1
1
1
0
0
0
0
0
0
1
0.8
0.7
0.65
0.6
0.6
0.6
0.54
0.3
0.1
0.05
0.03
0.01
most similar
least similar
6/9 say degradation
KNN for missing data imputation
Microarrays have *tons* of missing data.
Some methods won’t work with NA’s...
What can you do?
Troyanskya et al. 2002
Use k-nearest neighbors for missing data
Troyanskya et al. 2002
Hyperplane discriminants
Search for a single partition that places all
positives on one side and all negatives on
another side
Hyperplane discriminants
Y is discrete
E.g. non-cancer
Y
E.g. cancer
E.g. PCNA expression
X
Hyperplane discriminants
Y is discrete
“Decision Boundary”
Discriminant Line
Separating Hyperplane
E.g. non-cancer
training
misclassified
Y
E.g. cancer
E.g. PCNA expression
X
Hyperplane discriminants
Y is discrete
“Decision Boundary”
Discriminant Line
Separating hyperplane
E.g. non-cancer
training
misclassified
Y
E.g. cancer
E.g. PCNA expression
X
Hyperplane discriminants
Y is discrete
Test Data
E.g. non-cancer
testing
misclassified
Y
E.g. cancer
E.g. PCNA expression
X
Hyperplanes
Y is continuous
“Decision Boundary”
Discriminant Line
Separating hyperplane
Y
E.g. survival
E.g. PCNA expression
X
Hyperplanes
X –a set of predictors
hyperplane
Y
E.g. survival
E.g. PCNA
E.g. Cell-surface
marker
X2
X1
Classify new cases with selected
features
f(Xi)  y
f(Xi) = selected features vote
f(Xi) = ∑jwjXij=Xi β
We want a β that minimizes error. On training data the least
squares error is:
∑i (f(Xi)-Yi)2 = (Xβ-y)T(Xβ-y)
β* = argminβ {Xβ-y)T(Xβ-y)}
β* = (XTX)−1XT
Fisher Linear Discriminant Analysis
• In a two-class classification problem, given n samples
in a d-dimensional feature space. n1 in class 1 and n2
in class 2.
• Goal: to find a vector w, and project the n samples on
the axis y=w’x, so that the projected samples are well
separated.
w1: Poor Seperation
w2: Good Seperation
Fisher Linear Discriminant Analysis
• The sample mean vector for the ith class is mi and the
sample covariance matrix for the ith class is Si.
• The between-class scatter matrix is:
SB=(m1-m2)(m1-m2)’
• The within-class scatter matrix is:
Sw= S1+S2
• The sample mean of the projected points in the ith class is:
m˜ i =
1
w' x = w' mi
å
n i x Îith class
• The variance of the projected points in the ith class is:
S˜ i =
å(w' x - w' m )
i
x Îith class
2
= w' Si w
Fisher Linear Discriminant Analysis
The fisher linear discriminant
analysis will choose the w,
which maximize:
~ m
~ |2 w' S w
|m
B
Spread ( w)  ~12 ~22 
w' S w w
S1  S 2
i.e. the between-class distance should be as large as
possible, meanwhile the within-class scatter should be
as small as possible.
Maximum Likelihood Discriminant Rule
• A maximum likelihood classifier (ML) chooses the
class that makes the chance of the observations the
highest
• the maximum likelihood (ML) discriminant rule
predicts the class of an observation X by that which
gives the largest likelihood to X, i.e., by
h( x)  arg max P( x | y  k )
k
Gaussian ML Discriminant Rules
• Assume the conditional densities for each class is a multivariate
Gaussian (normal), P(X|Y= k) ~ N(k, k),
• Then ML discriminant rule is
h(X) = argmink {(X - k) k-1 (X - k)’ + log| k |}
• In general, this is a quadratic rule (Quadratic discriminant
analysis, or QDA in R)
• In practice, population mean vectors k and covariance matrices
k are estimated from the training set.
ˆ k  xk and ˆ k  S k
h( X )  arg min ( X  ˆ k )ˆ k ( X  ˆ k )  log | ˆ k |

k

Gaussian ML Discriminant Rules
• When all class densities have the same covariance matrix, k = the
discriminant rule is linear (Linear discriminant analysis, or LDA;
FLDA for k = 2):
h(x) = argmink (x - k) -1 (x - k)’
• In practice, population mean vectors k and constant covariance
matrices  are estimated from learning set L.
ˆ k  xk and ˆ   (nk  1) S k /( n  K )

k

h( x)  arg min ( X  ˆ k )ˆ 1 ( X  ˆ k )
k
Gaussian ML Discriminant Rules
• When the class densities have diagonal covariance matrices,
, the discriminant rule is given by additive quadratic
contributions from each variable (Diagonal quadratic discriminant
analysis, or DQDA)
 ( xi   ki ) 2

2
h( x) = argmin k  
 log  ki 
2
 ki
i 1 

G
• When all class densities have the same diagonal covariance matrix
=diag(12… G2), the discriminant rule is again linear (Diagonal
linear discriminant analysis, or DLDA in R)
h( x) = argmin
k
G
( xi   ki ) 2
i 1
 i2

Application of ML discriminant
Rule
• Weighted gene voting method. (Golub et al. 1999)
– One of the first application of a ML discriminant rule to gene
expression data.
– This methods turns out to be a minor variant of the sample
Diagonal Linear Discriminant rule.
–
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP,Coller H, Loh ML,
Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. (1999).Molecular classification of
cancer: class discovery and class prediction bygene expression monitoring. Science. Oct
15;286(5439):531 - 537.
Support Vector Machines
Find the seperating
hyperplane with
maximum margin
In practice,
classes can overlap,
so error measures
how far on the wrong
side a point may be
Hastie, Tibshirani, Friedman (2001)
Support Vector Machines
To find hyperplane, maximize the
Lagrangian:
Which gives us our solutions:
Any α>0 is part of the support
Kernels in SVMs
Dot product is distance between i and j
Can be replaced with any appropriate distance
measure
The distance measure is called a kernel
Changing the distance measure effectively changes
the space where we search for the hyperplane
Hastie et al. (2001)
SVMs
SVMs – 4th degree polynomial
kernel
Hastie et al.(2001)
SVMs in microarray analysis
First application tried to predict gene
categories from Eisen yeast compendium
Train to distinguish between:
ribosome vs not
TCA vs not
respiration vs not
proteasome vs not
histones vs not
helix-turn-helix versus not
Brown et al. (2000)
Predicting TCA w/ SVM
Brown et al. (2000)
Predicting Ribosomes w/ SVM
Brown et al. (2000)
Predicting HTH w/ SVM
Brown et al. (2000)
“False” predictions
FP: Genes newly predicted to be in a
functional group that were thought to
belong to another, may be coregulated with
the new group.
FN: Genes that were thought to belong to a
functional group may not be coregulated
with that group.
Inspecting “errors” often leads to the most
interesting findings!
Looking closely at “false”
predictions w/ SVM
RPN1: regulatory particle, interacts with
DNA damage rad23 Elsasser S et al 2002
shouldn’t have been in the list
Brown et al. (2000)
E.g. of “mis-annotated” gene
EGD1’s
expression profile
closely resembles
other ribosomal
subunits
Brown et al. (2000)
New functional annotations
Brown et al. (2000)