Array CGH Analysis

Download Report

Transcript Array CGH Analysis

Algorithms for
Smoothing Array CGH data
Kees Jong (VU, CS and Mathematics)
Elena Marchiori (VU, Computer Science)
Aad van der Vaart (VU, Mathematics)
Gerrit Meijer (VUMC)
Bauke Ylstra (VUMC)
Marjan Weiss (VUMC)
1
Tumor Cell
Chromosomes of tumor cell:
2
CGH Data

C
o
p
y
#
Clones/Chromosomes 
3
Naïve Smoothing
4
“Discrete” Smoothing
Copy numbers are integers
5
Why Smoothing ?
• Noise reduction
• Detection of Loss, Normal, Gain,
Amplification
• Breakpoint analysis
Recurrent (over tumors) aberrations may indicate:
–an oncogene or
–a tumor suppressor gene
6
Is Smoothing Easy?
Measurements are relative to a reference
sample
Printing, labeling and hybridization may be
uneven
Tumor sample is inhomogeneous
•vertical scale is relative
•do expect only few levels
7
Smoothing: example
8
Problem Formalization
A smoothing can be described by
• a number of breakpoints
• corresponding levels
A fitness function scores each smoothing according
to fitness to the data
An algorithm finds the smoothing with the highest
fitness score.
9
Smoothing
breakpoints
variance
levels
10
Fitness Function
We assume that data are a realization of a
Gaussian noise process and use the
maximum likelihood criterion adjusted with
a penalization term for taking into account
model complexity
We could use better models given insight
in tumor pathogenesis
11
Fitness Function (2)
CGH values: x1 , ... , xn
breakpoints: 0 < y1< … < yN < xN
levels: m1, . . ., mN
error variances: s12, . . ., sN2
likelihood:
12
Fitness Function (3)
Maximum likelihood estimators of μ and s 2
can be found explicitly
Need to add a penalty to log likelihood to
control number N of breakpoints
penalty
13
Algorithms
Maximizing Fitness is computationally hard
Use genetic algorithm + local search to find
approximation to the optimum
14
Algorithms: Local Search
choose N breakpoints at random
while (improvement)
- randomly select a breakpoint
- move the breakpoint one position to left
or to the right
15
Genetic Algorithm
Given a “population” of candidate smoothings
create a new smoothing by
- select two “parents” at random from population
- generate “offspring” by combining parents
(e.g. “uniform crossover” or “union”)
- apply mutation to each offspring
- apply local search to each offspring
- replace the two worst individuals with the
offspring
16
Experiments
• Comparison of
–
–
–
–
GLS
GLSo
Multi Start Local Search (mLS)
Multi Start Simulated Annealing (mSA)
• GLS is significantly better than the other
algorithms.
17
Comparison to Expert
algorithm
expert
18
Relating to Gene Expression
19
Relating to Gene Expression
20
Algorithms for
Smoothing Array CGH data
Kees Jong (VU, CS and Mathematics)
Elena Marchiori (VU, CS)
Aad van der Vaart (VU, Mathematics)
Gerrit Meijer (VUMC)
Bauke Ylstra (VUMC)
Marjan Weiss (VUMC)
21
22
Conclusion
• Breakpoint identification as model fitting to
search for most-likely-fit model given the
data
• Genetic algorithms + local search perform
well
• Results comparable to those produced by
hand by the local expert
• Future work:
– Analyse the relationship between Chromosomal aberrations and
Gene Expression
23
Example of a-CGH Tumor

V
a
l
u
e
Clones/Chromosomes 
24
a-CGH vs. Expression
a-CGH
• DNA
– In Nucleus
– Same for every cell
• DNA on slide
• Measure Copy
Number Variation
Expression
• RNA
– In Cytoplasm
– Different per cell
• cDNA on slide
• Measure Gene
Expression
25
Breakpoint Detection
• Identify possibly damaged genes:
– These genes will not be expressed anymore
• Identify recurrent breakpoint locations:
– Indicates fragile pieces of the chromosome
• Accuracy is important:
– Important genes may be located in a region
with (recurrent) breakpoints
26
Experiments
• Both GAs are Robust:
– Over different randomly initialized runs
breakpoints are (mostly) placed on the same
location
• Both GAs Converge:
– The “individuals” in the pool are very similar
• Final result looks very much like (mean
error = 0.0513) smoothing conducted by the
local expert
27
Genetic Algorithm 1 (GLS)
initialize population of candidate solutions randomly
while (termination criterion not satisfied)
- select two parents using roulette wheel
- generate offspring using uniform crossover
- apply mutation to each offspring
- apply local search to each offspring
- replace the two worst individuals with the
offspring
28
Genetic Algorithm 2 (GLSo)
initialize population of candidate solutions randomly
while (termination criterion not satisfied)
- select 2 parents using roulette wheel
- generate offspring using OR crossover
- apply local search to offspring
- apply “join” to offspring
- replace worst individual with offspring
29
Fitness function (2)
CGH values: x1 , ... , xn
breakpoints: 0 < y1< … < yN < xN
levels: m1, . . ., mN
error variances: s12, . . ., sN2
likelihood:
30