psb_2005_new_layout_v1 - BC Bioinformatics

Download Report

Transcript psb_2005_new_layout_v1 - BC Bioinformatics

A Novel Approach To Diploid Base Calling
Aaron R. Quinlan ([email protected]) and Gabor T. Marth ([email protected]),
Department of Biology, Boston College, Chestnut Hill, MA 02467
http://bioinformatics.bc.edu/marthlab/
We are integrating diploid base calling (heterozygote detection) into
Our SNP detection method,
detects SNPs across clonal reads based
on base composition and quality.
C/T
1
1
P(TT|R) =.9991
2
Base Call/Quality
SNP (A/G)
Found
Across
Multiple
Clonal Reads
Probability of
polymorphism
P( SNP ) 

all var iable
Base Composition
Probability of each possible diploid base call
(AA,CC,GG,TT,AC,AG,AT,CG,CT,GT)
3
Pr  S11 , S12 ,
P(TT) = .045
P(Others) = .01



Prior Probability of Each Diploid
Genotype
, Rn  
, S n1 , S n 2 | R1 ,
Pr  S11 , S12 | R1 

Prprior  S11 , S12 
every Si11 , Si12 ,
P(CC|R) =.9995
P(CC) = .045
Observed Diploid Variations/Probabilities
P(CT|R) =.96
Polymorphism Rate
P( S1 | R1 ) P( S N | RN )
 ...
 PPr ior ( S1 ,..., S N )
PPr ior ( S1 ) PPr ior ( S N )
P( Si1 | R1 ) P( SiN | R1 )
S
... 
 ...
 PPr ior ( Si1 ,..., SiN )

P
(
S
)
P
(
S
)
Si1 [ A ,C ,G ,T ] SiN [ A ,C ,G ,T ] Pr ior i1
Pr ior iN
PCR-based
sequences of diploid
individuals
Calls = (CC, CT, TT)

Pr  S n1 , S n 2 | R n 
 Prprior  S11 , S12 ,
Prprior  S n1 , S n 2 


Pr Si11 , Si12 | R i1

, Sin1 , Sin 2  Prprior Si11 , Si12


Pr Sin1 , Sin 2 | R in

Prprior Sin1 , Sin 2

  Pr
prior
, S n1 , S n 2 
S
i11
, Si12 ,
, Sin1 , Sin 2

P(CT) = .9
Depth of Alignment
4
Depth of Alignment
P(CT|R) =.003
Objective: To enhance
Probability of
Each Genotype
From a diploid base call
Each Possible Diploid Base
Call/Probability
with an accurate diploid base calling algorithm
Method for Diploid Base Calling (Support Vector Machine - based)
“Unseen” Alignments
P(CT|R) = .34
Convert SVM
Score to P(Het)
P(CT|R) = .01
P(Het)
1
SVM
P(AC|R) = .999
-
SNP 2
…
+
SVM Score
+ is Het
- is Hom
SVMs Learn a Function to Distinguish
between Positive and Negative based on
the statistics of the features in the training
examples.
SNP 1
0
0
P(AT|R) = .001
SNP N
P(S1S2|R) = Probability of allelic combination given the read
Collect Heterozygous and
Homozygous
Training Examples
Calculate indicative features
Trained SVM Can
that separate heterozygotes
Separate Unseen
from homozygotes.
Homozygotes and Heterozygotes
Make Diploid Base Calls on Unseen Alignments.
Utilizing multiple reads per individual, we can make an individual genotype call.
?
P(GT | Read) = .98
Assessing the Accuracy of the Initial
Prototype:
Individual
Genotype
Call:
Prior(GT Frequency) = .34
Forward Read
Reverse Read
P(GT) = .993
P(GT | Read) = .87
Assessing the Genotyping Accuracy of the Initial Prototype
Number of Alignments Analyzed:
Total Number of Read Positions:
Total Number of Heterozygotes:
Total Number of Homozygotes:
Summary:
993
231874
31411
143370
Note: Polyphred was tested on alignments created by PolyBayes. This allowed
Polyphred to analyze a larger fraction of reads, as
Compared to Phrap Alignments.
Accuracy
1
True Positive Rate
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0-.1
Polyphred 5 was tested with the following settings: quality = 21, score = 99, source, ref_comp
Fraction of Callable Hets Found
1400
1200
1000
800
600
400
200
P(Het) Range
.8
-.9
.9
-.9
5
.9
5.9
9
.7
-.8
.6
-.7
.5
-.6
.4
-.5
.3
-.4
.2
-.3
0
0.0
05
.0
05
- .0
1
.0
1.0
2
.0
2.0
3
.0
3.0
4
.0
4.0
5
.0
5.0
6
.0
6.0
7
.0
7.0
8
.0
8.0
9
.0
9.1
.1
-.2
Number of Calls made
1600
.2-.3
.3-.4
.4-.5
.5-.6
.6-.7
.7-.8
.8-.9
Sensitivity
1
2000
1800
.1-.2
P(Het) Range
Number of Calls within Posterior Probability Range
Data
Accuracy by P(Het) Score
1.We built a diploid base
calling prototype from the
ground up. The initial
prototype’s performance is
similar to Polyphred 5.
2.We are currently compiling a
larger example set to
improve accuracy.
0.9
21851
Rationale: The accuracy of the
consensus diploid base call for an
individual increases with the
number of reads available
for that individual.
.9-.95 .95-.99
.99-1
3.Our method incorporates
information from multiple
reads for a given individual in
a statistically-rigorous
fashion.
0.9
4.This prototype represents the
first major expansion of
.
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
<.1
<.2
<.3
<.4
<.5
<.6
<.7
P(Het) Limit
<.8
<.9
<.95
<.99
<1
5.We are currently working to
expand the prototype to a
production-ready application