Transcript Slide 1

Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 6
Classification
Exploratory Data Analysis and Essential Statistics using R
†
Boris Steipe
Toronto, September 8–9 2011
DEPARTMENT OF
BIOCHEMISTRY
DEPARTMENT OF
MOLECULAR GENETICS
† Includes material originally developed by
The judgement of Paris. Archaic (560-550 BCE)
Sohrab Shah
Classification
• What is classificiation?
• Supervised learning
• discriminant analysis
• Work from a set of objects with predefined classes
• ie basal vs luminal or good responder vs poor responder
• Task: learn from the features of the objects: what is the
basis for discrimination?
• Statistically and mathematically heavy
Module 6: Classification
bioinformatics.ca
Classification
poor response
poor response
poor response
learn a classifier
good response
good response
good response
new patient
What is the most likely response?
Module 6: Classification
bioinformatics.ca
Classification
• Input a set of measures, variables or features
• Output a discrete label for what the set of features
most closely resemble
• Classification can be probabilistic or deterministic
• How is classification different from clustering?
• We know the groups or classes a priori
• Classification is a prediction on ‘what’ an object is, not
what other objects it most closely resembles
• Clustering is finding patterns in data
• Classification is using known patterns to predict an object type
Module 6: Classification
bioinformatics.ca
Example: DLBCL subtypes
Wright et al,
PNAS (2003)
Module #: Title of Module
Module 6: Classification
bioinformatics.ca
DLBCL subtypes
Wright et al, PNAS (2003)
Module 6: Classification
bioinformatics.ca
Classification approaches
• Wright et al PNAS (2003)
• Weighted features in a linear predictor score:
• aj: weight of gene j determined by t-test statistic
• Xj: expression value of gene j
• Assume there are 2 distinct distributions of LPS:
1 for ABC, 1 for GCB
Module 6: Classification
bioinformatics.ca
Wright et al, DLBCL, cont’d
• Use Bayes’ rule to determine a probability that a sample
comes from group 1:
•
: probability density function that represents
group 1
Module 6: Classification
bioinformatics.ca
Learning the classifier, Wright et al
• Choosing the genes (feature selection):
• use cross validation
• Leave one out cross validation
•
•
•
•
•
Pick a set of samples
Use all but one of the samples as training, leaving one out for test
Fit the model using the training data
Can the classifier correctly pick the class of the remaining case?
Repeat exhaustively for leaving out each sample in turn
• Repeat using different sets and numbers of genes based on tstatistic
• Pick the set of genes that give the highest accuracy
Module 6: Classification
bioinformatics.ca
Overfitting
• In many cases in biology, the number of features is much larger
than the number of samples
• Important features may not be represented in the training data
• This can result in overfitting
• when a classifier discriminates well on its training data, but does not
generalise to orthogonally derived data sets
• Validation is required in at least one external cohort to believe the
results
• example: the expression subtypes for breast cancer have been
repeatedly validated in numerous data sets
Module 6: Classification
bioinformatics.ca
Overfitting
• To reduce the problem of overfitting, one can use
Bayesian priors to ‘regularize’ the parameter estimates
of the model
• Some methods now integrate feature selection and
classification in a unified analytical framework
• see Law et al IEEE (2005): Sparse Multinomial Logistic
Regression (SMLR): http://www.cs.duke.edu/~amink/software/smlr/
• Cross validation should always be used in training a
classifier
Module 6: Classification
bioinformatics.ca
Evaluating a classifier
• The receiver operator characteristic curve
• plots the true positive rate vs the false positive rate
• Given ground truth and a
probabilistic classifier
– for some number of probability
thresholds
– compute the TPR
– proportion of positives that
were predicted as true
– compute the FPR
– number of false predictions
over the total number of
predictions
Module 6: Classification
bioinformatics.ca
Evaluating a classifier
• Important terms:
• Prediction: classifier says the object is a ‘hit’
• Rejection: classifier says the object is a ‘miss’
•
•
•
•
•
True Positive (TP): Prediction that is a true hit
True Negative (TN): Rejection that is a true miss
False Positive (FP): Prediction that is a true miss
False Negative (FN): Rejection that is a true hit
False Positive Rate: FPR=FP/(FP+TN)
• specificity: 1-FPR
• True Positive Rate: TPR=TP/(TP+FN)
• sensitivity: TPR
Module 6: Classification
bioinformatics.ca
Evaluating a classifier
• Use Area under the ROC curve as a
single measure
• encodes the trade-off between FPR and
TPR
• only possible for probabilistic outputs
• requires ground truth and probabilities as
inputs
• at a fixed number of ordered probability
thresholds, calculate FPR and TPR and
plot
• for deterministic methods, it is possible to
calculate a point in the FPR:TPR space
Module 6: Classification
*
bioinformatics.ca
Evaluating a classifier
• ROC curves are useful for comparing classifiers
ROC curves using depth thresholds
of 0-7,10
Breast cancer data using Affy SNP
6.0 genotypes as truth
Module 6: Classification
bioinformatics.ca
All you need to know about ROC analysis
http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf
Tutorial by Tom Fawcett at HP (2003)
Module 6: Classification
bioinformatics.ca
Practical example: detecting single
nucleotide variants from next gen
sequencing data
Module 6: Classification
bioinformatics.ca
Aligning billions of short reads to the genome
•MAQ, BWA, SOAP, SHRiMP, Mosaik, BowTie, Rmap, ELAND
•Chopping, hashing and indexing the genome + string matching with
mismatch tolerance
aattcaggaccca----------------------------aattcaggacccacacga-----------------------aattcaggacccacacgacgggaagacaa-------------attcaggacaaacacgaagggaagacaagttcatgtacttt
----caggacccacacgacgggtagacaagttcatgtacttt
--------acccacacgacgggtagacaagttcatgtacttt
--------acccacacgacgggtagacaagttcatgtacttt
----------------gacgggaagacaagttcatgtacttt
---------------------------------atgtacttt
aattcaggaccaacacgacgggaagacaagttcatgtacttt
Reference sequence
Module 6: Classification
bioinformatics.ca
SNVMix1: Modeling allelic counts
SNVMix1
p
d
a k bk
Gi
Ni
mk
ai
Modeling allelic counts
Genotype k
ai Î {0,1,...N i } the number of reference reads at i
N i Î {0,1,...} the total number of reads at i
Gi = k,k Î {aa,ab,bb} the genotype at i
mk
p
Position i
mk ~ Beta(a k ,bk )
p ~ Dir(d)
the parameter of the genotype specific
Binomial distribution
the prior over genotypes
Module 6: Classification
bioinformatics.ca
Querying SNVMix1
• Given the model parameters, what is the probability that
genotype k gave rise to the observed data at each
position?
p(Gi = k | ai,N i , m1:K , p ) =
p k Binom(ai | mk ,N i )
å
K
j=1
p j Binom(ai | m j ,N i )
SNVMix1 is a mixture of Binomial distributions with component weights
aa
Module 6: Classification
ab
bb
bioinformatics.ca
SNVMix1 obviates the need for depth-based thresholding
Output of SNVMix1 on simulated data with increasing
depth
Module 6: Classification
ROC curves using depth thresholds of 0-7,10
Breast cancer data using Affy SNP 6.0 genotypes as truth
bioinformatics.ca
Learning parameters by model fitting: cancer genomes
• Li et al (2008) – Maq and Li et al (2009) – SOAP
• Use parameters of the Binomial assuming normal diploid
genomes
• Cancer genomes:
• often not diploid
• mixed in with normal cells
• exhibit intra-tumoral heterogeneity
• Need to fit the model to data in order to learn more
representative parameters
Module 6: Classification
bioinformatics.ca
Fitting the model to data using EM
• Recall: (m,p )are unobserved
• Use Expectation Maximization to fit
the model to data
• E-step:
p
d
a k bk
Gi
Ni
p k Binom(ai | mk ,N i )
p(Gi = k | ai,N i , m1:K , p ) =
å
K
j=1
p j Binom(ai | m j ,N i )
• M-step:
mk
ai
Genotype k
Position i
å I(G = k)+d (k)
(k) =
å å I(G = j)+d ( j)
T
p new
i =1
T
i =1
j
å
T
m knew =
å
T
i =1
Module 6: Classification
i =1
i
i
aiI (Gi =k ) + ak - 1
N iI (Gi =k ) + ak + bk - 2
bioinformatics.ca
Fitting the model confers increased accuracy
• 16 ovarian transcriptomes
• 144,271 coding positions with matched
Affy SNP 6.0 data
• 10 repeats of 4 fold x-validation to
estimate parameters
•
•
Run EM on ¾ of the data to estimate
parameters
Compute p(Gi = k | ai,Ni ,m1:K ,p )on remaining
positions
p < 0.00001
Module 6: Classification
bioinformatics.ca
Other methods for classification
•
•
•
•
•
Support vector machines
Linear discriminant analysis
Logistic regression
Random forests
See:
• Ma and Huang Briefings in Bioinformatics (2008)
• Saeys et al Bioinformatics (2007)
Module 6: Classification
bioinformatics.ca
We are on a
Coffee Break &
Networking Session
Module 6: Classification
bioinformatics.ca