A statistical method for alignment of LC

Download Report

Transcript A statistical method for alignment of LC

ICSA, 6/2007
1
Spatial Smoothing and Hot Spot
Detection for CGH data
using the Fused Lasso
Pei Wang
Cancer Prevention Research Program, PHS, FHCRC
Joint work with Robert Tibshirani,
Stanford University, CA
Pei Wang, [email protected]
ICSA, 6/2007
2
Outline
1.
DNA copy number alterations and Array CGH
experiments.
2.
Detect copy number alterations using Fused Lasso
regression.
3.
Simulation and real data examples.
4.
Jointly model copy number alterations and disease
out comes using Fused Lasso regression.
Pei Wang, [email protected]
ICSA, 6/2007
3
DNA Copy Number
•
•
In normal human cells: DNA copy number = 2
Genome instability => Copy number alterations.
Pei Wang, [email protected]
Alberson and Pinkel, Hum. Mol. Gen., 2003
ICSA, 6/2007
4
DNA Copy Number
In cancer researches, knowledge of copy number
aberrations helps to
•
Identify important cancer genes.
•
Reveal different tumor subtypes with different
mechanism of initiation and/or progression.
•
Predict tumor prognosis, and improve clinical
diagnosis
Pei Wang, [email protected]
ICSA, 6/2007
5
Array CGH
• array Comparative Genomic Hybridization.
Scan machine reports the
on the chips, which correspond to:
Pei Wang, [email protected]
for each spot
ICSA, 6/2007
6
Array CGH
 Array CGH has been implemented using a wide
variety of techniques.
• BAC array : produced from bacterial artificial
chromosomes;
• cDNA microarray: made from cDNAs;
• oligo array: made from oligonucleotides (Affy, Agilent,
Illumina).
 Output from array CGH experiment:
 copynumber in thetest sample 

log2 
 copynumber in thereferencesample
Pei Wang, [email protected]
ICSA, 6/2007
7
Goal
• Identify genome regions with DNA copy number
alterations
An example segment
of CGH data from a
GMB primary tumor
(Bredel et al.2005).
Pei Wang, [email protected]
ICSA, 6/2007
8
Goal
• Identify genome regions with DNA copy number
alterations
Raw CGH data.
Pei Wang, [email protected]
Estimated copy number from fused
lasso regression shows copy
number alteration regions.
ICSA, 6/2007
9
Method
• Denote the log2 ratio measurement of a chromosome
(or chromosome arm) as
.
• Assume:
= log2( true copy number / 2) + ei
=
+ ei ,
We are interested in recovering
• Property of
(1)
.
:
=0 for genome regions without alterations;
>0 or <0 for regions of gain/loss.
(2) Profile {
Pei Wang, [email protected]
} has strong spatial correlation along index i.
ICSA, 6/2007
10
Method
• We are interested in finding coefficients
satisfying
(1) Lasso constraint --- detect alteration regions;
(2) Fused constraint --- account for the spatial correlation.
Pei Wang, [email protected]
ICSA, 6/2007
11
lasso & fused lasso
• lasso Regression (Tibshirani 1996)
• fused lasso Regression (Tibshirani et al. 2004)
Pei Wang, [email protected]
ICSA, 6/2007
12
Method
• Apply fused lasso on aCGH data:
(1) Solve the optimization.
(2) Choose the tuning parameters.
(3) Control the False Discovery Rate (FDR).
Pei Wang, [email protected]
ICSA, 6/2007
13
Method
• Apply fused lasso on aCGH data:
(1) Solve the optimization.
(2) Choose the tuning parameters.
(3) Control the False Discovery Rate (FDR).
Pei Wang, [email protected]
ICSA, 6/2007
1. Solve the optimization
2. Choose the tuning parameter
For the general fused lasso regression:
-Use SQOPT by Gill et al. to solve the quadratic
programming problem with sparse linear
constraints (Tibshirani et al., 2004)
Pei Wang, [email protected]
14
ICSA, 6/2007
15
1. Solve the optimization
2. Choose the tuning parameter
For the special application on CGH array:
- Pathwise coordinate optimization (Jerome Friedman et. al.
Tech Report)
• A modification of original Coordinate-wise descent
algorithm (Shooting procedure) (Fu 1998, Daubechies et al.
2004).
• The running time is only 1/100 of the quadratic
programming
Pei Wang, [email protected]
ICSA, 6/2007
1. Solve the optimization
2. Choose the tuning parameter
Estimates s1 and s2 from pre-smoothed version of the data:
• s1 controls the overall copy number alteration amount of
the target chromosome --- using heavily smoothed Y.
• s2 controls the frequency of the copy number alterations
on the target chromosome --- using moderately smoothed Y.
Pei Wang, [email protected]
16
ICSA, 6/2007
17
Other Method
Lai et. al. 2005 provides a thorough review of statistical
methods for aCGH analysis.
- Simple smoothing with Lowess
- Hidden Markov Model (Fridlyand et al. 2004)
- Top Down: Circular Binary Segmentation (Olshen et al. 2004,
Venkatraman et al. 2007)
- Bottom-up: Cluster along chromosomes (Wang et al. 2005)
- Dynamic Programming: CGHseg (Picard et al. 2005)
- Denoising using wavelet (Hsu et al. 2005)
- And many others.
Pei Wang, [email protected]
ICSA, 6/2007
18
Other Method
Lai et. al. 2005 provides a thorough review of statistical
methods for aCGH analysis.
- Simple smoothing with Lowess
- Hidden Markov Model (Fridlyand et al. 2004)
- Top Down: Circular Binary Segmentation (Olshen et al. 2004,
Venkatraman et al. 2007)
- Bottom-up: Cluster along chromosome (Wang et al. 2005)
- Dynamic Programming: CGHseg (Picard et al. 2005)
- Denoising using wavelet (Hsu et al. 2005)
- And many others.
Pei Wang, [email protected]
ICSA, 6/2007
• General smoothing methods are not typically useful for analyzing
CGH data, because their results can be difficult to interpret.
• Fused lasso regression can also be viewed as a smoothing
approach; but, it is able to capture the structure of the CGH data
very well.
Pei Wang, [email protected]
19
ICSA, 6/2007
Comparison of
Fused lasso with
three segmentation
methods:
CGHseg
(Picard et. al. 2005)
CLAC
(Wang et.al. 2005)
CBS
(Olshen et.al. 2004)
Pei Wang, [email protected]
20
ICSA, 6/2007
21
Simulation Example
Further comparison of fused lasso results with the
three segmentation methods on simulation data sets
from Lai et al. 2005.
• Total length of chromosome segment: 100
• Four Different aberration width: 5, 10, 20, 40.
• Signal to Noise ratio is equal to 1.
Normal region: x~ N(0, 0.25);
Alteration region: x~N(0.25, 0.25).
• For each width, simulate 100 independently chromosomes.
Evaluation process:
1. Estimate copy number using different methods.
2. Apply different thresholds on the estimated copy numbers, and
calculate
TPR = # of correct calls / # of total aberration.
FPR = # of false calls / # of total normal probes.
Pei Wang, [email protected]
ICSA, 6/2007
The TPR-FPR
curves for the
fours methods
under different
window sizes.
Pei Wang, [email protected]
22
ICSA, 6/2007
23
Real Data Example
Pei Wang, [email protected]
Breast Cancer Cell line MDA157 (Pollack 2002)
ICSA, 6/2007
24
Computation Time
Comparison of the speed of the four Methods:
Data Simulation:
1. Pre-specify chromosome length p=100, 500, 1000, 2000.
2. Random sample 50 genome segments of length p from 17 Breast Cancer CGH arrays.
3. Apply each method on the 50 segments, and record the CPU time.
Mean (sd)
P=100
P=500
P=1000
P=2000
CBS
0.151
(0.113)
1.243
(0.804)
3.669
(1.135)
8.455
(2.854)
CGHseg
0.063
(0.008)
0.445
(0.016)
1.223
(0.041)
4.205
(0.104)
CLAC
0.049
(0.003)
0.086
(0.013)
0.157
(0.037)
0.368
(0.073)
cghFLasso
0.025
(0.013)
0.140
(0.017)
0.334
(0.036)
0.840
(0.056)
(DNAcopy1.10.0)
Pei Wang, [email protected]
(seconds)
ICSA, 6/2007
Applying Fused Lasso on CGH:
• gives an appropriate way to model aCGH data.
• has favorable performance compared to other
method.
• is computationally efficient.
Pei Wang, [email protected]
25
ICSA, 6/2007
Applying Fused Lasso on CGH:
• provides an appropriate model for aCGH data.
• has favorable performance compared to other
method.
• is computationally efficient.
• Provides a flexible frame work for aCGH
analysis in more complicated settings.
Pei Wang, [email protected]
26
ICSA, 6/2007
27
Joint Model
Study copy number alterations and disease outcomes.
• Model:
Interested in finding disease associated genes.
Pei Wang, [email protected]
ICSA, 6/2007
28
Joint Model
Study copy number alterations and disease outcomes.
• Model:
Interested in finding disease associated genes.
 Naïve method (Two-Steps):
1. call gains and losses for each individual array;
2. use the estimated copy numbers to look for disease
associated genes.
Pei Wang, [email protected]
ICSA, 6/2007
29
Joint Model
 Naïve method (Two-Steps):
1. call gains and losses for each individual array;
2. use the estimated copy numbers to look for disease
associated genes.
Drawbacks:
1. Loss information after first round of data processing.
2. “Smoothing adds to already existing among neighboring
values, thus causing the within-class covariance to be even
more jagged… increase the computational cost with zero
benefit in classification performance” (Hastie et al. 1995 Ann. of Stat.)
Pei Wang, [email protected]
ICSA, 6/2007
30
Joint Model
 Joint modeling:
Pei Wang, [email protected]
ICSA, 6/2007
Compare different approaches on a simulation data set.
• Simulate genome segment with p=50
genes for n=30 samples:
- true copy numbers
- noise CGH measurements
• Generate psuedo phenotype for each
sample using two pre-selected nonadjacent genes.
• Look for disease associated genes
with different methods. Varying the
tuning parameter t and produce ROC
curves for each method.
• Repeat for 200 times and plot the
mean ROC curve.
Pei Wang, [email protected]
31
ICSA, 6/2007
32
Summary
• Fused Lasso Regression can be used to characterize the
spatial structure of array CGH data.
- Tibshirani & Wang, Biostatistics (In press)
- google-> tibshirani -> click on cghFlasso under software
• The flexible framework of the regression model can be
easily extended to solve other problems involving CGH
data.
Pei Wang, [email protected]
ICSA, 6/2007
33
Acknowledgment
 Stanford University, Department of Statisitcs
Robert Tibshirani, Jerry Friedman, Trevor Hastie.
 Stanford University, Department of Pathology
Jonathan Pollack.
Pei Wang, [email protected]