Combinatorial Optimization Problems in Computational Biology

Download Report

Transcript Combinatorial Optimization Problems in Computational Biology

Algorithms for
Universal DNA Tag Array
Design and Optimization
Ion Mandoiu
Computer Science & Engineering Department
University of Connecticut
High-Throughput Assay Design
• Fast growing number of applications, with more stringent
constraints on sensitivity, reproducibility, cost, etc.
• Source of challenging combinatorial problems
–
–
–
–
–
–
Multiplex PCR primer set selection
Probe selection
Fidelity probes
Mask design
…
This talk: design and optimization of universal tag arrays
Overview
Background on DNA Microarrays
 Universal Tag Arrays
 Tag Set Design Problem
 Tag Assignment Problem
 Conclusions

Watson-Crick Complementarity
• Four nucleotide types: A,C,T,G
• A’s paired with T’s (2 hydrogen bonds)
• C’s paired with G’s (3 hydrogen bonds)
DNA Microarrays
Labeled DNA/RNA mixture flushed
over array of probes
Laser activation of fluorescent labels
Optical scanning used to
identify probes with
complements in the mixture
Images courtesy of Affymetrix.
Applications
• Gene expression (transcription analysis)
• Genomic-based microorganism identification
• Single Nucleotide Polymorphism (SNP)
genotyping
Gene Expression
• Cells express different subsets of genes
under different environments
Translation
Transcription
mRNA
Gene
Protein
Two-Color Technique
• Sample labeled RED
• Control labeled GREEN
• YELLOW probes hybridize to both
sample and control
•BLACK probes hybridize to neither
cell type 1
RNA 1
target 1
Cy3 Cy3Cy3
target 2
Cy5Cy5Cy5
RNA 2
cell type 2
Microarray Technologies
• Arrays of cDNAs
– Obtained by reverse transcription from
Expressed Sequence Tags (ESTs)
• Oligonucleotide arrays
– Short (20-60bp) synthetic DNA strands
Robotic cDNA Arrayers
Pin Technology
Quill Pen Technology
Ink jet Technology
Pin Ring Technology
VLSIPS Array Synthesis
Images courtesy of Affymetrix.
Oligonucleotide Arrays
• Pros
– Highly versatile
– Very high integration (VLSIPS ~106 probes/cm2)
• Cons
– Higher cost
– Application specific, non-reusable designs
Overview
Background on DNA Microarrays
 Universal Tag Arrays
 Tag Set Design Problem
 Tag Assignment Problem
 Conclusions

Universal Arrays
• Brenner 97, Morris et al. 98
• Key idea
– Array consisting of application independent tags
– Two-part “reporter” probes: aplication specific
primers ligated to antitags
– Detection carried by a sequence of reactions
separately involving the primer and the antitag part
of reporter probes
Universal Array Experiment
+
Universal Array Advantages
• Cost effectiveness
– Array does not change  economies of scale
• Customization
– Fast, only need to synthesize new set of reporter
probes
• Reliability
– Solution phase hybridization better understood than
hybridization on solid support
Overview
Background on DNA Microarrays
 Universal Tag Arrays
 Tag Set Design Problem
 Tag Assignment Problem
 Conclusions

Tag Set Requirements
• Hybridization constraints
(H1) Antitags hybridize strongly to complementary tags
(H2) No antitag hybridezes to a non-complementary tag
(H3) Antitags do not cross-hybridize to each other
• Two conflicting goals
– Maximize number of tags
– Ensure hybridization constraints
Hybridization Model
• Melting temperature Tm: temperature at
which 50% of duplexes are in hybridized state
• 2-4 rule
Tm = 2 #(As and Ts) + 4 #(Cs and Gs)
• More accurate models exist, e.g., the nearneighbor model, however, 2-4 rule reasonably
accurate for short oligos
Tag Set Design Problem
• [Ben-Dor et al. 00] Conservative formalization of
(H1)+(H2) based on nucleation complex theory
and 2-4 rule
(C1) Every tag must have total weight  h
(C2) No 2 tags share a common substring of weight  c
Where
– w(A)=w(T)=1, w(C)=w(G)=2
– c, h are given constant (Affymetrix uses l+h in C1)
c-h codes
• c-token: left-minimal DNA string of weight  c, i.e.,
– w(x)  c
– w(x’) < c for every proper suffix of x
• [Ben-Dor et al. 00] A set of tags is called c-h code if
(C1) Every tag has weight  h
(C2*) Every c-token is used at most once
c-h code problem: given c and h, find largest c-h code
c-h code problem
• [Ben-Dor et al. 00] give a constructive upperbound on largest c-h code size and a nearoptimal algorithm based on DeBruijn sequences
• [MT05] c-h code problem can be formulated as a
maximum integer flow problem with capacity
constraints on (disjoint) sets of vertices
– Solvable in practical time for small values of c
Token Content of a Tag
c=4
CCAGATT
CC
CCA
CAG
AGA
GAT
GATT
Layered c-token graph, length-l tags
c/2
(c/2)+1 …
l-1
l
c1
s
t
cN
token  s-t path in layer graph
c=4, l=7
tag = CCAGATT
Layer 2
3
4
5
6
7
sCC CCACAGAGAGATGATTt
Integer Program Formulation
O(hN) constraints and variables, where N = #c-tokens
Number of c-tokens
Let Gn = #strings of weight n
 G1 = 2; G2 = 6; Gn = 2Gn-2 + 2Gn-1
W=A or T, S=C or G
Token type
Num tokens
<c-2>S
2 Gc-2
S<c-3>S
4 Gc-3
<c-1>W
2 Gc-1
S<c-2>W
4 Gc-2
ILP Results
Periodic Tags
• If multiple c-token copies are not allowed in a tag
(uniqueness property), every tag uses roughly one ctoken per letter (except for first up to c-1 letters)
• A tag t is periodic if it is the prefix of () for
some string 
– t periodic with period   t uses at most || c-tokens
• Lemma: there is always an optimum solution
consisting exclusively of periodic tags and tags
with uniqueness property
Portion of successor c-token graph, c=4
CC
AAG
AAC
AAAA
AAAT
Tag Set Design Algorithm
1. Construct “successor” graph G over c-tokens
2. T{}
3. For all cycles C defining periodic tags, in increasing
order of cycle length, do
–
If C has no c-tokens in common with T, then
•
•
Add a tag defined by C to T
Remove C from G
4. Perform a pruned alphabetic tree search and add to T
tags with no c-tokens in common with T
5. Return T
Vertex-disjoint Cycle Packing Problem
• Given directed graph G, find maximum number of
vertex disjoint directed cycles in G
• [MT 05] APX-hard even for regular directed graphs
with in-degree and out-degree 2
– h-c/2+1 approximation factor for tag set design problem
• [Salavatipour and Verstraete 05]
– Quasi-NP-hard to approximate within (log1- n)
– O(n1/2) approximation algorithm
Experimental Results
Antitag-to-Antitag Hybridization
• Additional practical constraint: antitags do not
cross-hybridize (including self)
– Ignored by Ben-Dor et al
• Formalization in c-token hybridization model:
(C3) No two (anti)tags contain complementary
substrings of weight  c
• Cycle packing and tree search extend easily
Results w/ Extended Constraints
Overview
Background on DNA Microarrays
 Universal Tag Arrays
 Tag Set Design Problem
 Tag Assignment Problem
 Conclusions

More Possible Mis-Hybridizations
Degrees of freedom: partition of primers on multiple arrays, tag
assignment within an array (may be forced to leave tags unassigned)
Here we focus on avoiding case (a), primer-to-tag hybridization
Assignable Primers
• Set P of primers is assignable to set T of tags if, for
every tags t,t’ and primers p,p’ with t hybridizing to
primer p, at most one of the assignments (p,t), (p,t’),
(p’,t) can be made
p’
t
p
t’
Characterization of Assignable Sets
• [Ben-Dor 04] Set P is assignable to T iff X+Y  |P|,
where, in the conflict graph induced by P+T
– X = number of primers incident to a degree 1 tag
– Y = number of degree 0 tags
X=1
Y=2
MAPS Problem
• Maximum Assignable Primer Set (MAPS) Problem:
given primer set P and tag set T, find maximum size
assignable subset of P
• [Ben-Dor 04] Greedy deletion heuristic: repeatedly
delete primer of maximum weight from P until it
becomes assignable, where
– Potential of tag t is 2-|P(t)|
– Potential of primer p is sum of potentials of conflicting tags
Universal Array Multiplexing Problem
• Multiplexing Problem: given primer set P and tag set
T, find partition of P into minimum number of
assignable sets
• [Ben-Dor 04] Repeatedly find approximate MAP using
greedy deletion
Integration with Probe Selection
• In practice, several primer candidates with equivalent
functionality
– In SNP genotyping, can pick primer from either forward and
reverse strand
– In gene expression/identification applications, many primers
have desired length, Tm, etc.
Pooled Array Multiplexing Problem
•
[MPT 05] Given set of primer pools P and tag set T,
find a primer from each pool and a partition of
selected primers into minimum number of assignable
sets
X+Y Characterization no Longer Holds
Pooled Multiplexing Algorithms
1. Primer-Del = greedy deletion for pools similar to
[Ben-Dor et al 04]
2. Primer-Del+ = same but never delete last primer from
pool unless no other choice
3. Min-Pot = select primer with min potential from each
pool, then run Primer-Del
4. Min-Deg = select primer with min conflict degree,
then run Primer-Del
Results: GenFlex Tags, c=7
Results: GenFlex Tags, c=8
GenFlex vs. PerTags, c=8 (|T|=213)
Overview
Background on DNA Microarrays
 Universal Tag Arrays
 Tag Set Design Problem
 Tag Assignment Problem
 Conclusions

Conclusions
• New techniques for tag set design and tag assignment
lead to significantly improved multiplexing
– Enforcing antitag-to-antitag hybridization constraints make
universal arrays more reliable
• Other applications of universal tags:
– Lab-on-chip, DNA-driven assembly (e.g., carbon nanotubes), DNA computing [Brenneman&Condon 02]
• Other applications of assignment techniques:
– Genotyping by mass-spectroscopy [Aumann et al 05]
– Genotyping using l-mer arrays
Ongoing Work
•
Special type of universal arrays: l-mer arrays
– Initially introduced for sequencing by hybridization, but
proved impractical
– Currently investigated for use in resequencing by hybridization
– More promising application: SNP genotyping
– Isothermal prefix sets instead of l-mer arrays?
•
•
Deeper integration between tag assignment and probe
selection, e.g., in string barcoding
Extend algorithms to more accurate near-neighbor
hybridization model
– monotonic Tm  c-tokens  successor graph
Open Problems
•
Settle approximation complexity of (vertex) disjoint cycle
packing ([Salavatipour and Verstraete 05] show
approximation preserving reductions between edge and
vertex disjoint versions)
•
Establish better approximation bound for special instances
arising in tag set design
•
Improved approximation algorithms for maximum
assignable tag set and the tag assignment problems
Acknowledgments
• Claudia Prajescu and Dragos Trinca
• UCONN Research Foundation