Module 5 - Bioinformatics.ca

Transcript Module 5 - Bioinformatics.ca

Canadian Bioinformatics Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 5
Classification
Aline Tabet





 k k
Gi

Ni
k
ai

Genotype k



Position i
Classification
• What is classificiation?
– Supervised learning
– discriminant analysis
• Work from a set of objects with predefined classes
– ie basal vs luminal or good responder vs poor responder
• Task: learn from the features of the objects: what is the basis
for discrimination?
• Statistically and mathematically heavy
Module 5: Classification
bioinformatics.ca
Classification
poor response
poor response
poor response
learn a classifier
good response
good response
good response
new patient
What is the most likely response?
Module 5: Classification
bioinformatics.ca
Classification
• Input a set of measures, variables or features
• Output a discrete label for what the set of features most
closely resemble
– Classification can be probabilistic or deterministic
• How is classification different from clustering?
– We know the groups or classes a priori
– Classification is a prediction on ‘what’ an object is, not what
other objects it most closely resembles
• Clustering is finding patterns in data
• Classification is using known patterns to predict an object type
Module 5: Classification
bioinformatics.ca
Example: DLBCL subtypes
Wright et al,
PNAS (2003)
Module #: Title of Module
Module 5: Classification
7
bioinformatics.ca
DLBCL subtypes
Wright et al, PNAS (2003)
Module 5: Classification
bioinformatics.ca
Classification approaches
• Wright et al PNAS (2003)
• Weighted features in a linear predictor score:
• aj: weight of gene j determined by t-test statistic
• Xj: expression value of gene j
• Assume there are 2 distinct distributions of LPS: 1 for
ABC, 1 for GCB
Module 5: Classification
bioinformatics.ca
Wright et al, DLBCL, cont’d
• Use Bayes’ rule to determine a probability that a sample
comes from group 1:
•
: probability density function that represents group
1
Module 5: Classification
10
bioinformatics.ca
Learning the classifier, Wright et al
• Choosing the genes (feature selection):
– use cross validation
– Leave one out cross validation
•
•
•
•
•
Pick a set of samples
Use all but one of the samples as training, leaving one out for test
Fit the model using the training data
Can the classifier correctly pick the class of the remaining case?
Repeat exhaustively for leaving out each sample in turn
– Repeat using different sets and numbers of genes based on t-statistic
– Pick the set of genes that give the highest accuracy
Module 5: Classification
bioinformatics.ca
Overfitting
• In many cases in biology, the number of features is much larger than
the number of samples
• Important features may not be represented in the training data
• This can result in overfitting
– when a classifier discriminates well on its training data, but does not
generalise to orthogonally derived data sets
• Validation is required in at least one external cohort to believe the
results
• example: the expression subtypes for breast cancer have been
repeatedly validated in numerous data sets
Module 5: Classification
bioinformatics.ca
Overfitting
• To reduce the problem of overfitting, one can use
Bayesian priors to ‘regularize’ the parameter estimates of
the model
• Some methods now integrate feature selection and
classification in a unified analytical framework
– see Law et al IEEE (2005): Sparse Multinomial Logistic Regression
(SMLR): http://www.cs.duke.edu/~amink/software/smlr/
• Cross validation should always be used in training a
classifier
Module 5: Classification
bioinformatics.ca
Evaluating a classifier
• The receiver operator characteristic curve
– plots the true positive rate vs the false positive rate
• Given ground truth and a
probabilistic classifier
– for some number of probability
thresholds
– compute the TPR
– proportion of positives that
were predicted as true
– compute the FPR
– number of false predictions
over the total number of
predictions
Module 5: Classification
bioinformatics.ca
Evaluating a classifier
• Important terms:
– Prediction: classifier says the object is a ‘hit’
– Rejection: classifier says the object is a ‘miss’
•
•
•
•
•
True Positive (TP): Prediction that is a true hit
True Negative (TN): Rejection that is a true miss
False Positive (FP): Prediction that is a true miss
False Negative (FN): Rejection that is a true hit
False Positive Rate: FPR=FP/(FP+TN)
– specificity: 1-FPR
• True Positive Rate: TPR=TP/(TP+FN)
– sensitivity: TPR
Module 5: Classification
bioinformatics.ca
Evaluating a classifier
• Use Area under the ROC curve as a single
measure
– encodes the trade-off between FPR and TPR
– only possible for probabilistic outputs
– requires ground truth and probabilities as
inputs
– at a fixed number of ordered probability
thresholds, calculate FPR and TPR and plot
– for deterministic methods, it is possible to
calculate a point in the FPR:TPR space
Module 5: Classification
*
bioinformatics.ca
Evaluating a classifier
• ROC curves are useful for comparing classifiers
ROC curves using depth thresholds
of 0-7,10
Breast cancer data using Affy SNP
6.0 genotypes as truth
Module 5: Classification
bioinformatics.ca
All you need to know about ROC analysis
http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf
Tutorial by Tom Fawcett at HP (2003)
Module 5: Classification
bioinformatics.ca
Practical example: detecting single
nucleotide variants from next gen
sequencing data
Module 5: Classification
bioinformatics.ca
Aligning billions of short reads to the genome
•MAQ, BWA, SOAP, SHRiMP, Mosaik, BowTie, Rmap, ELAND
•Chopping, hashing and indexing the genome + string matching with
mismatch tolerance
aattcaggaccca----------------------------aattcaggacccacacga-----------------------aattcaggacccacacgacgggaagacaa-------------attcaggacaaacacgaagggaagacaagttcatgtacttt
----caggacccacacgacgggtagacaagttcatgtacttt
--------acccacacgacgggtagacaagttcatgtacttt
--------acccacacgacgggtagacaagttcatgtacttt
----------------gacgggaagacaagttcatgtacttt
---------------------------------atgtacttt
aattcaggaccaacacgacgggaagacaagttcatgtacttt
Reference sequence
Module 5: Classification
bioinformatics.ca
SNVMix1: Modeling allelic counts
SNVMix1


 k k


Modeling allelic counts
Gi
Ni
k
 
ai

Genotype k

ai  {0,1,...Ni } the number of reference reads
 at i
Ni  {0,1,...} the total number of reads at i
Gi  k,k  {aa,ab,bb} the genotype at i
k

the parameter of the genotype specific
Binomial distribution
the prior over genotypes
Module 5: Classification
Position i

k ~ Beta(k ,k )
 ~ Dir( )


bioinformatics.ca
Querying SNVMix1
• Given the model parameters, what is the probability
that genotype k gave rise to the observed data at each
position?
p(Gi  k | ai ,N i , 1:K ,  ) 

 k Binom(ai | k ,N i )
K
j1
 j Binom(ai |  j ,N i )
SNVMix1 is a mixture of Binomial distributions with component weights

Module 5: Classification
aa
ab
bb
bioinformatics.ca
SNVMix1 obviates the need for
depth-based thresholding
Output of SNVMix1 on simulated data with
increasing depth
Module 5: Classification
ROC curves using depth thresholds of 0-7,10
Breast cancer data using Affy SNP 6.0 genotypes as truth
bioinformatics.ca
Learning parameters by model fitting
is important for cancer genomes
• Li et al (2008) – Maq and Li et al (2009) – SOAP
– Use parameters of the Binomial assuming normal diploid
genomes
• Cancer genomes:
– often not diploid
– mixed in with normal cells
– exhibit intra-tumoral heterogeneity
• Need to fit the model to data in order to learn more
representative parameters
Module 5: Classification
bioinformatics.ca
Fitting the model to data using EM
• Recall: (,  ) are unobserved
• Use Expectation Maximization to fit
the model to data

• E-step:
 Binom(a |  ,N ) 
p(Gi  k | ai ,N i , 1:K ,  ) 

k
K
j1
i
k
 k k


 
i 1
T
i 1
j

ai

Genotype k
 I(G  k) (k)
(k) 
  I(G  j) ( j)
 knew 
Ni
k

T
new
Gi
i
 j Binom(ai |  j ,N i )
• M-step:




T
i 1
T
i 1
i
i

Position i

aiI (Gi k )  k  1
N iI (Gi k )  k  k  2
Module 5: Classification
bioinformatics.ca
Fitting the model confers increased accuracy
• 16 ovarian transcriptomes
• 144,271 coding positions with matched
Affy SNP 6.0 data
• 10 repeats of 4 fold x-validation to
estimate parameters
– Run EM on ¾ of the data to estimate
parameters
– Compute p(Gi  k | ai,Ni ,1:K , ) on remaining
positions
p < 0.00001
Module 5: Classification

bioinformatics.ca
Other methods for classification
•
•
•
•
•
Support vector machines
Linear discriminant analysis
Logistic regression
Random forests
See:
– Ma and Huang Briefings in Bioinformatics (2008)
– Saeys et al Bioinformatics (2007)
Module 5: Classification
27
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 5: Classification
bioinformatics.ca
Lab 5: Classification
Module 5: Classification
bioinformatics.ca

Module 5 - Bioinformatics.ca

Transcript Module 5 - Bioinformatics.ca

Directory