Transcript speaking
Charalampos (Babis) E. Tsourakakis
Joint work with Gary Miller, Richard Peng, Russell Schwartz,
Stanley Shackney, Dave Tolliver, Maria A. Tsiarli
Speaking Skills
Machine Learning Journal Club
23 Feb. 2010
Algorithms for Denoising aCGH Data
1
Motivation & Problem Definition
Related Work
Our Problem Formulation and a
O(n2) solution
Experimental Results
~
Theoretical Ramifications: a O(n1.5)
algorithm within additive ε error)
Conclusion & Future Work
Algorithms for Denoising aCGH Data
2
For each probe we
obtain a noisy
measurement of
log(T/R) where
T: true DNA
copy number
R=2 for humans
(diploid organisms)
Test DNA: Patient
Reference DNA: Healthy subject
Algorithms for Denoising aCGH Data
3
Ideal Scenario
Copy Number
log(T/R)
0
-Inf
1
-1
2
0 (Healthy probe)
3
0.58
4
1
T: true DNA
copy number
R=2 for humans
(diploid organisms)
In practice, for a variety of reasons (e.g.,
sample impurity, measurement noise) we
obtain a noisy measurement log(T/R) per
probe.
Algorithms for Denoising aCGH Data
4
Input
A vector (t~1,t~2,…,t~n), where ~ti is the
measurement at the i-th proble
Output
A vector (t1,t2,…,tn) with discrete values
corresponding to the true DNA copy number
Algorithms for Denoising aCGH Data
5
log(T/R)
Probe id
Blue x : noisy measurements (input)
Red □ : true value (output)
Algorithms for Denoising aCGH Data
6
log(T/R)
Nearby probes tend to have
the same DNA copy number!
Probe id
1)Treat Data as 1D time series
2) Fit Piecewise Constant
Segments
Algorithms for Denoising aCGH Data
7
Motivation & Problem Definition
Related Work
Our Problem Formulation and a
O(n2) solution
Experimental Results
~
Theoretical Ramifications: a O(n1.5)
algorithm within additive ε error)
Conclusion & Future Work
Algorithms for Denoising aCGH Data
8
Not
Exhaustive
Lasso (Tibshirani et al., Huang et al.)
Kalman Filters (Xing et al.)
Hidden Markov Models (Fridlyand et al.)
Bayesian Hidden Markov Models (Guha et al.)
Wavelets (Hsu et al.)
Hierarchical clustering (Tibshirani et al.)
Circular Binary Segmentation (Olshen et al.)
Statistical likelihood tests
Loweless, i.e., Locally weighted regression
Genetic Local search (Jong et al.)
Algorithms for Denoising aCGH Data
9
Not
Exhaustive
Gaussian Mixtures fitting using EM (Hodgson
et al.)
Variable-bandwidth kernel methods (Muller
et al.)
Variable-knot splines (Stone et al.)
Fused quantile regression (Wang et al.)
Non parametric regression
Thresholding
Algorithms for Denoising aCGH Data
10
CBS: Matlab Toolbox, modification of Binary
Segmentation.
CGHSEG: Gaussianity, AIC&BIC, DP
Both methods perform consistently better
than others on real data (Lai et al.,
Willenbrock et al.)
Algorithms for Denoising aCGH Data
11
Motivation & Problem Definition
Related Work
Our Problem Formulation and a
O(n2) solution
Experimental Results
~
Theoretical Ramifications: a O(n1.5)
algorithm within additive ε error)
Conclusion & Future Work
Algorithms for Denoising aCGH Data
12
For the vector (p1,…,pn) we define the following recurrence equation:
Breakpoint
Squared
error
Penalty
per segment
Tradeoff(λ)
Algorithms for Denoising aCGH Data
13
Keep the first and
second order moments
in an “online” way.
Run time O(n2)
Algorithms for Denoising aCGH Data
14
λ=0.2
Train on synthetic
data generated by
a realistic simulator
Willenbrock et al
λ=0.2 results in
Precision=0.98
Recall=0.91
Algorithms for Denoising aCGH Data
15
Motivation & Problem Definition
Related Work
Our Problem Formulation and a
O(n2) solution
Experimental Results
~
Theoretical Ramifications: a O(n1.5)
algorithm within additive ε error)
Conclusion & Future Work
Algorithms for Denoising aCGH Data
16
A)CSB
B)CGHTrimmer
Lai et al. (2005)
Dataset Available from
http://compbio.med.harvard.edu
C)CGHSEG
Algorithms for Denoising aCGH Data
17
Snijders et al., 15 Cell lines
Golden Standard Dataset with two main characteristics
a) Knowledge of ground truth after tedious biological tests
b) “Easy” dataset (low noise levels)
#Cell
Description
Color
Lines
CGHTrimmer performs worse that
at least one competitor
0
CGHTrimmer performs equally well
with both competitors
9
CGHTrimmer performs better than
one competitor
5
CGHTrimmer performs better than
both competitors
1
Algorithms for Denoising aCGH Data
18
A)CBS
B)CGHTrimmer
Breast Cancer Cell
Line BT474
Chromosome 1
C)CGHSEG
Algorithms for Denoising aCGH Data
19
A)CBS
B)CGHTrimmer
Breast Cancer Cell
Line BT474
Chromosome 17
Results supported
by oncology literature
C)CGHSEG
Algorithms for Denoising aCGH Data
20
A)CBS
B)CGHTrimmer
Breast Cancer Cell
Line T47D
Chromosome 1
C)CGHSEG
Algorithms for Denoising aCGH Data
21
CGHTrimmer
CGHSEG
CBS
Coriell
5.78sec
8.15min
47.7min
Breast Cancer
22.76
23.3min
4.95hours
REMARKS
A) 1 to 3 orders of magnitude faster.
B) Reason for speedup: different
approach compared to competitors (lack of statistical
assumptions, tests, likelihood functions but an intuitive
formulation and a simple dynamic programming algorithm)
Algorithms for Denoising aCGH Data
22
Motivation & Problem Definition
Related Work
Our Problem Formulation and a
O(n2) solution
Experimental Results
~
Theoretical Ramifications: a O(n1.5)
algorithm within additive ε error)
Conclusion & Future Work
Algorithms for Denoising aCGH Data
23
Is O(n2) tight? Probably not…
Lemma
If |pi-pj| > 2 2C then in the optimal solution
points i,j belong to different segments.
Question: Can we use one of the existing
“tricks” to speed up our dynamic program?
Algorithms for Denoising aCGH Data
24
Unfortunately, existing “tricks” for dynamic
programming do not work for us (e.g., Monge
property)
But we can find an good approximation
algorithm!
Algorithms for Denoising aCGH Data
25
Define a shifted i by2 a constant objective function,
pm
i.e., DPi=OPTi-
m 1
Claim: DPi satisfies the following optimization formula
where Si=p1+…+pi
Algorithms for Denoising aCGH Data
26
Find the maximum and the minimum value that the
shifted objected DPi can take.
Claim: DPi takes values in [0,U2n] where
Algorithms for Denoising aCGH Data
27
Perform binary search on
~
by guessing for each index i a DPi
(
0
)
~
DP
i
U2n
-
DPi
Algorithms for Denoising aCGH Data
28
constant
Dot product of two 4D points
~ j,2Sj,Sj2+jDPj)
~
(x,i,Si,-1) and (j,DP
Reporting points in Halfspaces, Matousek FOCS 1992
Algorithms for Denoising aCGH Data
29
ε/n by performing log( U2n/(ε/n) ) iterations
Remember:
We want DPi not DPi
( )
U2n
0
~
DPi
-
DPi
DPi
Set i=n, our algorithm is an
ε-additive approximation
algorithm.
Run Time:
Algorithms for Denoising aCGH Data
30
Motivation & Problem Definition
Related Work
Our Problem Formulation and a
O(n2) solution
Experimental Results
~
Theoretical Ramifications: a O(n1.5)
algorithm within additive ε error)
Conclusion & Future Work
Algorithms for Denoising aCGH Data
31
CHGtrimmer method: Simple, intuitive
dynamic programming algorithm
outperforms state-of-the-art competitors:
Important biological biomarkers for aCGH data.
Good precision-recall results.
Significantly faster.
New paradigm for Dynamic Programming by
reducing the problem to a computational
geometry halfspace query problem.
Algorithms for Denoising aCGH Data
32
Structure: O(n2) unlikely to be tight.
Extend our approximation to 2D recurrences
(benefit many applications)
Preprocess breast cancer data using Trimmer
to get a discretized version, essential to
perform standard tumor phylogenetics
Make the DS practical.
Algorithms for Denoising aCGH Data
33
CGHTrimmer: Discretizing Noisy Array CGH
Data
C.E.T, D. Tolliver, M. A. Tsiarli, S. Shackney, R.
Schwartz
Algorithms for Denoising aCGH Data
G.L. Miller, R. Peng. R. Schwartz, C.E.T
Code will be made available at
http://www.cs.cmu.edu/~ctsourak/
Algorithms for Denoising aCGH Data
34