Reference-Based Indexing of Seqeunce Databases
Download
Report
Transcript Reference-Based Indexing of Seqeunce Databases
Reference-Based Indexing
of Sequence Databases
(VLDB ’06)
Jayendra Venkateswaran
Deepak Lachwani
Tamer Kahveci
Christopher Jermaine
Presented by Angela Siu
Content
Introduction
Related work
Reference-Based Methodology
Selection of References
Mapping of References
Search Algorithm
Experimental evaluation
Conclusion
Introduction
Many and/or very long sequences
Similarity search
Genomics, proteomics, dictionary search
Edit distance metric
Dynamic programming (Expensive)
Reference-based indexing
Reduce number of comparisons
Choice of references
Related work
Index structures
k-gram indexing
Suffix tree
Manage mismatches ineficiently
Excessive memory usage: 10 – 37 bytes per letter
Vector space indexing
Exact matches of k-gram
Extend to find longer alignments with errors
Eg. FASTA, BLAST
Performance deteriorates quickly as the size of the database
increases
SST, frequency vector, …
Store the occurrence of each letter in sequence
Lower bound of actual distance
Performs poorly as the query range increases
Reference-based indexing
A variation of vector space indexing
VP-tree, MVP-Tree, iDistance, Omni, M-Tree, Slim-Tree, DBM-Tree,
DF-Tree
Reference-Based indexing
A seqeunce database S
A set of reference sequences V
Pre-compute edit distances ED
Similarity Search
Distance Threshold ε
Triangle inequality
{ED(si, vj)|(∀si ∈ S) ∧ (∀vj ∈ V )}
Prune sequences that are too close or too far away from a
reference
LB = max(⋁ vj∈V |ED(q, vj) − ED(vj, s)|)
UB = min(⋁ vj∈V |ED(q, vj) + ED(vj, s)|)
If ε < LB, add si to the pruned set
If ε > UB, add si to the result set
If LB ≤ ε ≤ UB, add si to the candidate set
si in candidate set are compared with queries using
dynamic programming
Cost Analysis
Memory
Main memory: B bytes
Number of sequences: N
Number of references assigned: k
Average size of a sequence: z bytes
Sequence-reference mapping of sequence s and reference vi: [i, ED(s,
vi)]
B = <storage of reference> + <storage of pre-computed edit
distances>
B = 8kN + zk
Time
Query Set: Q
Time taken for one sequence comparison: t
Average size of candidate set: cavg
Total query time
= <Query-Reference comparison> + <Candidate-Query comparision>
= tk|Q| + tCavg|Q|
Selection of references
Omni method
Existing approach
References near the convex
hull of the database
Sequences near the hull
pruned by multiple,
redundant references
Sequences far away from
the hull cannot be pruned
Poor pruning rates
Proposed methods
Goal: choose references that represent all
parts of the database
Two novel strategies
Maximum Variance (MV)
Maximize the spread of database around the
references
Maximum Pruning (MP)
Optimizes pruning based on a set of sample queries
Maximum Variance
If q is close to reference v
If q is far away from v
Prune sequences far away from v
Accept sequences close to v
Prune sequences close to v
Select references with high variance of
distances
Assume queries follow the same
distribution as the database
sequences
New reference prunes some part of
the database not pruned by existing
set of references
Maximum Variance
Measure closeness of sequences
L: the length of the longest sequence in S
μi: mean of distances of si
σi: variance of distances of si
w: a cut-off distance
w = L.perc, where 0 < perc < 1
sj is close to si if ED(si, sj) < (μi − w)
sj is far away from si if ED(si, sj) > (μi + w)
Choose perc = 0.15, derived from experiment
Maximum Variance
S1
S2
S3
S4
S5
+
S1
S2
S4
S6
σ4
σ5
σ6
Random
subset
Variance
S6
Sequence
database
Calculate
σ1
σ2
σ3
Sort
S2
S5
S3
S1
S6
S4
Candidate
Reference
Set
Remove
sequences
close to or
far away
from the
reference
Maximum Variance
Algorithm
Time complexity
Step 2: O(N|S’|L2)
Step 4: O(N logN)
Step 5: O(mN)
Overall time:
O(NL2|S| + N logN +
mN)
Maximum Pruning
Combinatorially tries to compute the best
reference sets for a given query distribution
Greedy approach
Start with an initial reference set
Consider each sequence in the database as a candidate
Iteratively, replace an existing reference with a new one
if pruning is improved
Gain – the amount of improvement in pruning
Stop if no further improvement
Sampling-based optimization
Maximum Pruning
Sequence Database
S1 S2 S3 S4 S5 S6
S1
S2
S3
S4
S5
S6
Candidate
Reference
G1
Replace
S1
V1
V2
S3
S1
V3
Q1
Q2
Q3
Get
G2
G3
G4
G5
G6
Current
Reference
Set
Sample
Query
Set
Gain
Max
Maximum Pruning
Algorithm
Time complexity
Sequence Distances: O(N2)
PRUNE(): O(N|Q|)
Step 2:
Number of sequences:
O(N2)
Compute gain: O(m|Q|)
Time: O(N2m|Q|)
Overall worst case
N iterations
O(N3m|Q|)
Maximum Pruning
Sampling-Based Optimization
Estimation of gain
Reduce the number of sequences, use subset of database
Determine accuracy of gain estimate based on Central Limit Theorem
Iteratively randomly select a sequence to calculate the gain of a
candidate until desired accuracy is reached
Time complexity: O(N2fm|Q|), f is the sample size
Estimation of largest gain
Reduce the number of candidate references
Ensure the largest gain is at least τG[e] with ψ probability, where 0 ≤ τ ,
ψ ≤ 1, G[e] has the largest gain
Use Extreme Value Distribution to estimate G[e]
From the sample set of candidates, find mean and standard deviation
Best reference sequence has the expected gain of
where
Sample size:
Time complexity: O(Nfhm|Q|)
Mapping of references
Each sequence has its own set of best
references
Based on a sample query set Q
Assign references that prune the sequence for
most queries in Q
Avoid redundant references
Keep a reference only if it can prune a
total of more than |Q| sequences
Mapping of references
S1
S2
S3
S4
S5
S6
Sequence
database
Q1
Q2
Q3
Q4
Sample
Query
Set
V1
V2
V3
V4
Reference
Set
V2
C1
C2
C3
C4
Query
prune
count
max
Reference
Set for S1
Mapping of references
Algorithm
Time complexity
Distance computation:
O(tm|Q|), sequence
comparison takes t time
Pruning amount calculation:
O(m|Q|)
Overall time: O(Nmk|Q|)
Search Algorithm
Calculate edit distances between queries and every reference
Compute lower bound LB and upper bound UB
ε: query range
By triangle inequality,
Memory complexity
If LB > ε, prune sequence
If UB < ε, accept sequence
Otherwise, perform actual sequence comparison
z: average sequence size
[i, ED(s, vi)]: Sequence-Reference mapping
N: number of database sequences
m: number of references
k: number of reference per database sequence
Overall memory: (8Nk + mz) bytes
Time complexity
Q: query set
L: average sequence length
Cm: average candidate set size for Q using m references
Overall time: O((m + Cm)|Q|L2 + Nk|Q|)
Experimental evaluation
Size of reference set: 200
Datasets
Comparisons of the selection strategies
Text: alphabet size of 36 and 8000 sequences of length
100
DNA: alphabet size of 4 and 20000 sequences
Protein: alphabet size of 20 and 4000 sequences of
length 500
MV-S, MV-D: Maximum variance with same and different
reference sets
MP-S, MP-D: Maximum pruning with same and different
reference sets
Comparisons with existing methods
Omni, FV, M-Tree, DBM-Tree, Slim-Tree, DF-Tree
Comparison of selection strategies
Impact of query range
Impact of number of
reference per
sequence
Comparisons with existing methods
Impact of query range
Number of sequence comparisons
IC: index contruction time
ss: second
ms: minute
QR: query range
MP-D is sampling-based optimized
Comparison with existing methods
Impact of input queries
Number of sequence comparisons
Sample query set in reference selection: E.Coli
Actual query set
HM: a butterfly species
MM: a mouse speciecs
DR: a zebrafish species
QR: query range
Comparison with existing methods
Scalability of database size and sequence length
Conclusion
Similarity search over a large database
Edit distance as the similarity measure
Selection of references
Maximum variance
Maximum pruning
Optimize pruning based on a set of sample queries
Sampling-based optimization
Mapping of references
Maximize spread of database around the database
Each sequence has a different set of references
Experimental evaluation
Outperform existing strategies including Omni and frequency
vectors
MP-D, Maximum pruning with dynamic assignment of
reference sequences, performs the best