Folie 1 - id.ethz.ch

Download Report

Transcript Folie 1 - id.ethz.ch

Piers: An Efficient Model for
Similarity Search in DNA
Sequence Databases
Seminar „Algorithms for Database Systems“
FS 2008
Silvan Tschopp, [email protected]
May 24, 2008
Outline
1.
2.
3.
4.
5.
Background
Definitions and Problem Statement
The Hash-Based Pier Model
Performance Analysis
Conclusion
© Silvan Tschopp, ETH Zurich
2
May 24, 2008
Outline
1. Background
– Similarity Search
– Related Work
2.
3.
4.
5.
Definitions and Problem Statement
The Hash-Based Pier Model
Performance Analysis
Conclusion
© Silvan Tschopp, ETH Zurich
3
May 24, 2008
Similarity Search
∙ Search sequences in a DNA database which are
similar to a given query sequence.
∙ Goals:
–
–
–
–
Approximate or exact sequence matching
Efficient computation
Low memory consumption
High sensitive results
© Silvan Tschopp, ETH Zurich
4
May 24, 2008
Related Work
Several well known approaches:
∙ Dynamic programming
– Smith-Waterman
∙ Suffix trees
– OASIS
∙ Filtering and refinement
– BLAST
=> New method: hash table based model
© Silvan Tschopp, ETH Zurich
5
May 24, 2008
Outline
1. Background
2. Definitions and Problem Statement
– Piers
– Edit Distance
– Problem Statement
3. The Hash-Based Pier Model
4. Performance Analysis
5. Conclusion
© Silvan Tschopp, ETH Zurich
6
May 24, 2008
Piers
∙ In this model, a pier p is defined as a segment
of length lp extracted from a DNA sequence
∙ Example:
© Silvan Tschopp, ETH Zurich
7
May 24, 2008
Edit Distance
∙ Measurement to evaluate similarity of two
sequences
∙ The edit distance of two strings is the minimum
number of operations needed to transform one
string into another.
∙ Operations:
– Insertion
– Deletion
– Substitution
… of any single character
© Silvan Tschopp, ETH Zurich
8
May 24, 2008
Problem Statement
∙ What is given?
– Sequence database D = {S1,…,Sd}
– Query sequence Q
∙ What do we want?
– find all sequences Si D…
– …so that the edit distance between a subsequence
of Si and Q is below a chosen threshold θ.
– Approximate sequence matches
© Silvan Tschopp, ETH Zurich
9
May 24, 2008
Outline
1. Background
2. Definitions and Problem Statement
3. The Hash-Based Pier Model
– Pre-Processing Phase
• Pier Selection
• Hash Table Construction
– Query Processing
4. Complexity and Performance Analysis
5. Conclusion
© Silvan Tschopp, ETH Zurich
10
May 24, 2008
The Hash-Based Pier Model
∙ Two main phases:
– Pre-processing
– Query processing
∙ Goal:
– Extract a set of piers from the database
– For queries, only operate on this set
– Obtain highly sensitive results
© Silvan Tschopp, ETH Zurich
11
May 24, 2008
Pier Selection
∙ Random extraction of the piers from the
sequence database
– Scan the database once and pick certain piers
∙ Problem: Risk of loosing high similarity regions
– Define a minimum length lmin
(i.e. minimal length of a high similarity region)
– Ensure that each segment of length lmin contains
“enough” piers
© Silvan Tschopp, ETH Zurich
12
May 24, 2008
Pier Selection – Example
∙ Select 5 piers from sequence S
∙ Pier length lp = 5
© Silvan Tschopp, ETH Zurich
13
May 24, 2008
Hash Table
∙ Store piers into a hash table
– Put similar piers into same bucket
– Ensure efficient storage and access
– Small enough to keep in memory
∙ Hash function:
encode the prefix of length λ (λ ≤ lp) of each pier
∙ Simple encoding:
© Silvan Tschopp, ETH Zurich
14
May 24, 2008
Hash Table – Example I
∙ Encode previously selected piers
– set λ = 2
© Silvan Tschopp, ETH Zurich
15
May 24, 2008
Hash Table – Example II
∙ Insert the piers into corresponding buckets of
the hash table
© Silvan Tschopp, ETH Zurich
16
May 24, 2008
Outline
1. Background
2. Definitions and Problem Statement
3. The Hash-Based Pier Model
– Pre-Processing Phase
– Query Processing
• Neighbor Enumeration
• Candidate Search
4. Performance Analysis
5. Conclusion
© Silvan Tschopp, ETH Zurich
17
May 24, 2008
Query Processing
∙ Algorithm consists of three steps:
1. Split the query sequence Q into query patterns qi
of length lp
2. Search for candidate piers among the hashed
piers
3. Further processing of the candidates to build the
final result
=> Focus on second step
© Silvan Tschopp, ETH Zurich
18
May 24, 2008
Neighbor Enumeration
∙ Encode λ-prefix of each query pattern qi
=> hash key hi
∙ Expand search to other buckets
∙ Enumerate neighbors ngbrj of hi:
– Allow edit distance ζ
– Find ngbrj, such that edit(ngbrj, hi) ≤ ζ
© Silvan Tschopp, ETH Zurich
19
May 24, 2008
Neighbor Enumeration – Example
∙ Given query sequence, split into q1, q2
∙ Hash keys:
∙ Neighbor enumeration (ζ = 1)
© Silvan Tschopp, ETH Zurich
20
May 24, 2008
Candidate Search
∙ Access each bucket hash_table[ngbrj] and
retrieve all the piers stored in it
∙ Align every retrieved pier pi against the
queries qi
– Allow edit distance ε
– If edit(pi, qi) ≤ ε then
add pi to the candidate set
∙ Further process the candidates and build final
result
© Silvan Tschopp, ETH Zurich
21
May 24, 2008
Candidate Search – Example
• Candidate search with ε = 2
• Query pattern
∙ Neighbors of h1
Ø
Ø
p
1
© Silvan Tschopp, ETH Zurich
Ø
Ø
p
q1: TCG
||!
|||
TCA
p13: TCG
Ø
TCG
!!!
p5: CTA
ε=2
3
22
May 24, 2008
Final Alignment – Example
∙ Align the query sequence to each candidate
– Edit distance threshold θ = 3
∙ Candidate
© Silvan Tschopp, ETH Zurich
23
May 24, 2008
Outline
1.
2.
3.
4.
Background
Definitions and Problem Statement
The Hash-Based Pier Model
Performance Analysis
– Experimental Performance Results
(in comparison to BLAST)
5. Conclusion
© Silvan Tschopp, ETH Zurich
24
May 24, 2008
Experiments I
∙ Performance measurements:
– Hash-based pier model vs. BLAST (latest version)
∙ Different experiment settings:
– Database sizes from 4.86MB – 3.1GB
– Query lengths from 100 – 2000 characters
∙ Hardware (2004):
– Sunfire 4800 with 12 Ultrasparc3 CPU (750MHz)
– 16GB free memory, 70GB free hard disk
© Silvan Tschopp, ETH Zurich
25
May 24, 2008
Experiments II
∙ Measure the pre-processing time with varying
database sizes
© Silvan Tschopp, ETH Zurich
26
May 24, 2008
Experiments III
∙ Fixed database (Patnt: 702.1MB), varying query
lengths
© Silvan Tschopp, ETH Zurich
27
May 24, 2008
Experiments IV
∙ Fixed query length (|Q| = 1500), varying
database sizes
© Silvan Tschopp, ETH Zurich
28
May 24, 2008
Outline
1.
2.
3.
4.
5.
Background
Definitions and Problem Statement
The Hash-Based Pier Model
Performance Analysis
Conclusion
© Silvan Tschopp, ETH Zurich
29
May 24, 2008
Conclusion
∙ The hash-based pier model…
– provides an efficient and effective way of
searching large DNA sequence databases.
– is significantly faster and consumes less memory
than other well known methods.
– scales nicely on database size and query length.
© Silvan Tschopp, ETH Zurich
30
May 24, 2008
Questions
© Silvan Tschopp, ETH Zurich
31
May 24, 2008