Indexing DNA Sequences Using q

Download Report

Transcript Indexing DNA Sequences Using q

Indexing DNA
Sequences Using
q-Grams
Adriano Galati & Bram Raats
Indexing DNA Sequences Using
q-Grams



Method for indexing the DNA sequences efficiently
based on q-grams to facilitate similarity search in a DNA
database
To sidestep the linear scan of the entire database
Proposed:


Hash table
C-trees
based on the q-grams
 These data structures allow quick detection of
sequences
Introduction



Two sequences share a certain number of
q-grams if ed is  a certain threshold
Since there are 4 letters  4q
combinations
Two level index to prune data sequences
Introduction(2)
Two level index

Two level index to prune data sequences:
1.
First level


2.
Clusters of similar q-grams in DNA are generated
A typical Hash table is built in the segments with respect to
the qClusters
Second level


The segments are transformed into the c-signature based
on their q-grams
A new index called the c-signature trees is proposed to
organize the c-signatures of all segments of a DNA
sequence for search efficiency
Edit distance


To process approximate matching, one common
and simple approximation metric is called edit
distance
Definition:
 The
edit distance between two sequences is defined
as the minimum number of edit operations (i.e.
insertions, deletions and substitutions) of single
characters needed to transform the first string into the
second
Preliminaries

Intuition:
 Two
sequences would have a large number of
q-grams in common when the ed between
them is within a certain number
Given a sequence S, its q-grams are
obtained by sliding a window of length q
over the characters of S
 |S| - q + 1 q-grams for a sequence S

Question (Bogdan)

1. I have noticed that the segments of the database text
that are considered in this method are disjoint (see page
4, Introduction). I understand that for each segment all
the consecutive, non-disjoint, q-grams are taken into
consideration when computing the q-cluster and the csignature of the segment. However, I am a bit puzzled
that at the border between two adjacent segments
nothing is done, which means that (q-1) q-grams are
disregarded at each border. Since each segment
contains w-q+1 q-grams, it means that overall a ratio of
approximately (q-1)/(w-q+1) of all q-grams are
disregarded (if we ignore the difference of 1 between the
nr. of segments and the nr. of borders between adjacent
segments). For common values of q=3 and w=30, this
means about 7% of the q-grams. Do you see a solution
for overcoming this problem?
Answer (Bogdan)
Effort to improve the efficiency discarding
the regions (filtering) with low sequence
similarity
 Approximate sequence matching is
preferred to exact matching in genomic
database due to evolutionary mutation in
the genomic sequences and the presence
of noise data in a real sequence database

q-gram Signature


  { A, C , G , T }
q
  4 q kinds of q-grams

All the possible q-grams are denoted as   {r0 , r1 ,..., r4q1 }

The q-gram signature is a bitmap with 4q bits where i-th
bit corresponds to the presence or absence of ri .
For a sequence S, the i-th bit is set as ‘ 1’ if ri   occurs
at least once in sequence S, else ‘ 0’

c-signature

sig 1 ( S )  (a0 , a1 ,..., an1 ) q-gram signature where n  4 q
c
sig
( S )  (u0 , u1 ,..., uk 1 ) where k  n / c  and

ui   j ic
( i 1) c 1

aj  0
aj
when n  j  ck
Example c-signature

P=“ACGGTACT”

q-gram signature is (01 00 00 11 00 11 10 00) with
42 dimensions when q=2

sig
c2
( P)  (10020210)
Hash table

Any DNA segment s can be encoded into a λ-bit
(bitmap e  (e1 , e2 ,..., e ) ) by the coding function:
coding (s)  i 1 2i 1 ei


Hash table with size 2λ respect to qClusters
qClusters  {qCluster1 , qCluster2 ,..., qCluster }

if ( qgram  s  qgram  qClusteri )
then ei  1
else ei  0
Question (Jacob)

I can't get my hands on the c-Trees
(mentioned first on page 9). Could you
please explain how such a tree is built up,
because I can't figure it out.
c-Trees




Group of rooted dynamic trees built for indexing
c-signature
Height l set by user
c
sig
( S )  (u0 , u1 ,..., uk 1 )    k / l  trees T0 ,..., T 1
Given
Each path from the root to a leaf in Ti
corresponds to the c-signature string
sig ic (s)  (vil , vil 1 ,..., v(i 1)l 1 )

 internal node  T there are  c  1 children
Example c-Trees

Consider the five DNA segments:
s0 " ACGGT" s1 " CTTAG" s2 " ACGTT"
s3 "TAAGC" s4 " GACGT"

q  2  c  2  sig 2 ( s0 )  (1001 0200)
sig 2 ( s1 )  (0101 0011)
sig 2 ( s2 )  (1001 0101)
sig 2 ( s3 )  (1100 1010)
sig 2 ( s4 )  (1001 1100)

If l  4
4q
we get
 2 trees
cl
Example c-Trees(2)
Query Processing



HT and c-T are built on the DNA segments
Query sequence Q is also partitioned in Q  w  1
sliding query patterns q1 ,..., q Q w1
Two level filtering
 FLF:
Hash Table Based Similarity Search
 SLF: c-Trees Based Similarity Search
Hash Table Based Similarity
Search




Query pattern qi encoded to a hash key hi (λ bit)
ngbr of hi are enumerated
ngbr are encoded in λ bit from the segments
which are within a ed from qi
Once engbr  qi is enumerated, the segments in
the bucket HT [engbr ] will be retrieved as
candidates and stored into Cht
c-Trees Based Similarity Search




Candidates Cht will be further verified by c-trees
c-signature sig c (q) of query q is divided into 
c
sig
0i 
c-signature strings
i ( q ),
The algorithm retrieves the segment s which
satisfies the range constraint ed (q, s)  
During query processing, for each leaf in the tree
T1 are computed
Space and Time complexity






O
(
2
) for the table head
Space complexity HT is
( D / w) for the bucket of the table
Thus the total space complexity for the Hash

O
(
2
 D / w)
structure is
Time complexity for query O( q )
q
Space complexity of each tree is O(4 /(c))
Question (Bogdan)

I have trouble understanding the graphic in Fig.
2(a). My intuition would tell me that the more
common q-grams exist in the 2 sequences, the
higher the probability of finding a high score
alignment between them. However, the figure
seems to show the opposite: as the nr. of qgrams increases, the probability decreases. I've
obviously got something mixed up here, but I
can't figure out what it is. Could you please
explain?
The Sensitivity vs The Number of
Common q-grams
Answer (Bogdan)
Sensitivity can be measured by the
probability that a high score alignment is
found by the algorithm
 The graph starts with probability almost 1
when we have only 1 common q-gram and
if we increase the number of q-grams, the
probability (sensitivity) of matching the
alignment will surely decrease
