Transcript Document
Optimizing pooling strategies for the
massive next-generation sequencing of
viral samples
Pavel Skums1
Joint work with
Olga Glebova2, Alex Zelikovsky2, Ion Mandoiu3 and
Yury Khudyakov1
1Centers
for Disease Control and Prevention, Atlanta, GA
2Georgia State University, Atlanta, GA
3University of Connecticut, Storrs, CT
Outline
1. Massive NGS of viral samples
2. Optimal pooling design problem
3. Algorithm and results
NGS in epidemiology
Epidemiological
parameters
Transmission
networks
Molecular
surveillance
Vaccination
strategies
Prediction of the
epidemics
progress
NGS in epidemiology
A large-scale molecular surveillance requires sequencing of
unprecedentedly large sets of viral samples.
NGS of tens of thousands of samples is highly cost- and laborintensive.
Example: sequencing 100K samples using 454 senior system
with 50 MIDs doing one sequencing run per day
Cost: 5000*(100 000)/50 = 10 000 000$
Time: (100 000)/50 = 2000 days 5.5 years
Optimal Pooling Design Problem
Goal: a framework for identification of viral sequences from
large number of samples using the smallest possible number
of NGS runs.
Idea: for n samples generate m pools (i.e. mixtures of
samples) with m << n in such a way that every sample is
uniquely identified by the pools to which it belongs.
E
Optimal Pooling Design Problem
Example.
8 samples: 1,2,3,4,5,6,7,8
Sequencing each sample separately: 8 runs
4 pools:
M1 = {1,2,3,4}
M2 = {5,6,7,8}
M3 = {1,2,5,6}
M4 = {1,3,5,7}
Sequencing pools: 4 runs
Optimal Pooling Design Problem
4 pools:
M1 = {1,2,3,4}
M2 = {5,6,7,8}
M3 = {1,2,5,6}
M4 = {1,3,5,7}
{1} = M1M3M4
{2} = (M1M3) \ M4
……………………………………
{8} = (M2 \ M3) \ M4
Optimal Pooling Design Problem
Problem 1 (Optimal Pooling Design Problem).
Given: a set of samples S = {S1,...,Sn}
Find: a set of pools P = {P1,…,Pm} , Pk S for k=1,…,m such that
1)P1…Pm = S
2) for every Si,SjS there exists Pk P such that
(Pk separates Si and Sj)
|Pk{Si,Sj}| = 1
3) m is minimal
Theorem1. There exists a solution of Problem 1 with m = log(n) + 1
Optimal Pooling Design Problem
Additional conditions for the problem
Condition
Reasons
Each pool contains at most
k samples
• number of reads which could be obtained by
each NGS technology is bounded
• if large number of samples are mixed in one
pool, some of them may be lost due to a PCR
bias
|Pj| ≤ k for j = 1,…,m
Optimal Pooling Design Problem
Additional conditions for the problem
Condition
Reasons
Each pool contains at most
k samples
• number of reads which could be obtained by
each NGS technology is bounded
• if large number of samples are mixed in one
pool, some of them may be lost due to a PCR
bias
|Pj| ≤ k for j = 1,…,m
Each sample belongs to at
least l pools
|{j : Si Pj}| ≥ l for i = 1,…,n
• to ensure sufficient coverage for sequences of
each sample
Optimal Pooling Design Problem
Additional conditions for the problem
Condition
Reasons
Each pool contains at most
k samples
• number of reads which could be obtained by
each NGS technology is bounded
• if large number of samples are mixed in one
pool, some of them may be lost due to a PCR
bias
|Pj| ≤ k for j = 1,…,m
Each sample belongs to at
least l pools
• to ensure sufficient coverage for sequences of
each sample
|{j : Si Pj}| ≥ l for i = 1,…,n
Some pairs of samples
should not be put into a
same pool
• Some samples may intersect (if they belong to
the same transmission cluster)
Optimal Pooling Design Problem
A graph G(S) = (V,E), where
V=S
SiSjE if and only if there is a confidence that the samples Si
and Sj do not intersect.
Condition
Reasons
Each pool Pj should be a
clique of the graph G(S)
• Some samples may intersect (if they belong to
the same transmission cluster)
Optimal Pooling Design Problem
Problem 2 (Minimum Clique Test Set Problem).
Given: a graph G=G(S), natural numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that
1) |Pi| ≤ k for every i=1,…,m
2) every vertex v V(G) belongs to at least l cliques from P
3) for every u,vV(G) there exists Pi P such that
|Pi{u,v}| = 1
4) m is minimal
Optimal Pooling Design Problem
Minimum Test Set Problem (Garey, Johnson)
Given: collection Q={Q1,…,Qn} of subsets of a finite set S
Find: a subcollection P = {P1,…,Pm}Q such that
1) for every si,sjS there exists Pr P such that
|Pr{si,sj}| = 1
2) m is minimal
Problem reformulations
Minimum Clique Test Set Problem
Given: a graph G=G(S), natural numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that
1) |Pi| ≤ k for every i=1,…,m
2) every vertex v V(G) belongs to at least l cliques from P
3) for every u,vV(G) there exists Pi P such that
|Pi{u,v}| = 1
4) m is minimal
Only some pairs of vertices should be separated
A graph H with V(H)=V(G), uvE(H) if and only if u and v should
be separated
Problem reformulations
Minimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural
numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that
1) |Pi| ≤ k for every i=1,…,m
2) every vertex v V(G) belongs to at least l cliques from P
3) for every uvE(H) there exists Pi P such that
|Pi{u,v}| = 1
4) m is minimal
Only some pairs of vertices should be separated
A graph H with V(H)=V(G), uvE(H) if and only if u and v should
be separated
Problem reformulations
Minimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural
numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that
1) |Pi| ≤ k for every i=1,…,m
2) every vertex v V belongs to at least l cliques from P
3) for every uvE(H) there exists Pi P such that
|Pi{u,v}| = 1
4) m is minimal
Replace each vertex vV(G) with l pairwise non-adjacent copies
Problem reformulations
Minimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural
number k
Find: a set of cliques P = {P1,…,Pm} of G such that
1) |Pi| ≤ k for every i=1,…,m
2) P1…Pm = V(G)
3) for every uvE(H) there exists Pi P such that
|Pi{u,v}| = 1
4) m is minimal
Replace each vertex vV(G) with l pairwise non-adjacent copies
Problem reformulations
Minimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural
number k
Find: a set of cliques P = {P1,…,Pm} of G such that
1) |Pi| ≤ k for every i=1,…,m
2) P1…Pm = V
3) for every uvE(H) there exists Pi P such that
|Pi{u,v}| = 1
4) m is minimal
For every uV add a vertex xu
and an edge uxuE(H)
2
1
3
Problem reformulations
Minimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices V, natural
number k
Find: a set of cliques P = {P1,…,Pm} of G such that
1) |Pi| ≤ k for every i=1,…,m
2) P1…Pm = V
3) for every uvE(H) there exists Pi P such that
|Pi{u,v}| = 1
4) m is minimal
For every uV add a vertex xu
and an edge uxuE(H)
x1
2
1
x3
3
x2
Problem reformulations
Minimum Clique Test Set Problem
Given: a graphs G and H on the same set of vertices, natural
number k
Find: a set of cliques P = {P1,…,Pm} of G such that
1) |Pi| ≤ k for every i=1,…,m
3) for every uvE(H) there exists Pi P such that
|Pi{u,v}| = 1
4) m is minimal
For every uV add a vertex xu
and an edge uxuE(H)
x1
2
1
x3
3
x2
Heuristic algorithm
Input: a graphs G and H on the same set of vertices V, natural number k
P=
WHILE CP C V OR E(H)
find maximum cut (A,B) in H (using local search);
for every aA put w(a) = # of neighbors of a from B in H;
for every bB put w(b) = # of neighbors of b from A in H;
find maximum clique C1 with |C1|≤k in a subgraph G[A] with weights w;
find maximum clique C2 with |C2|≤k in a subgraph G[B] with weights w;
C := argmax{w(C1),w(C2)} ;
P:= P {C};
E(H) := E(H) \ {uv : uC, vV\C}
ENDWHILE
Algorithm results
1. All samples are unrelated (i.e. G is complete graph)
# samples
4
8
16
32
64
128
256
512
1024
2048
# generated pools
3
4
5
6
7
8
9
10
11
12
Algorithm results
2. Some samples are related (G is a random graph with the
edge probability p)
0.8
0.7
#pools/# samples
0.6
0.5
0.4
p=0.5
p=0.75
0.3
p=1
0.2
0.1
0
# of samples
Thank you!