Transcript rec09

Burrows-Wheeler
Transform
the not-so-gory details
Sources
Ferragina and Manzini, Opportunistic data
structures with applications, FOCS 00
Langmead et al. Ultrafast and memory-efficient
alignment of short DNA sequences to the human
genome, Genome Biology 09
CG
1 © Ron Shamir ‘11
Burrows-Wheeler
Transform:
Cyclic shifts sorted
.the_next_text_that_i_index
_i_index.the_next_text_that
_index.the_next_text_that_i
_next_text_that_i_index.the
_text_that_i_index.the_next
_that_i_index.the_next_text
at_i_index.the_next_text_th
dex.the_next_text_that_i_in
e_next_text_that_i_index.th
ex.the_next_text_that_i_ind
ext_text_that_i_index.the_n
ext_that_i_index.the_next_t
hat_i_index.the_next_text_t
he_next_text_that_i_index.t
i_index.the_next_text_that_
index.the_next_text_that_i_
ndex.the_next_text_that_i_i
next_text_that_i_index.the_
t_i_index.the_next_text_tha
t_text_that_i_index.the_nex
t_that_i_index.the_next_tex
text_that_i_index.the_next_
that_i_index.the_next_text_
the_next_text_that_i_index.
x.the_next_text_that_i_inde
xt_text_that_i_index.the_ne
xt_that_i_index.the_next_te
Suffix Arrays
• T[1…u] text
• Suffix array A of T: sequence
lexicographically ordered suffixes of
T represented by pointers to their
starting points
• T=ababc  A=[1,3,2,4,5]
• Requires 4u bytes in practice
3
Burrows-Wheeler Transform
Forward BWT: TL
a.T  T# (# unique smallest char)
b. Form conceptual matrix M: all cyclic
shifts of T#, sorted lexicographically
c.Transformed text L: the last column
of T. L=BWT(T)
Notation: F : first column in M
Close connection to suffix arrays!
4
Key lemmas
• In the i-th row of M, L(i) precedes F(i)
in the original text: T=….L(i)F(i)….
• The i-th occurrence of char X in L
corresponds to the same text
character as the i-th occurrence of X
in F
“last-first (LF) mapping”
5
LF mapping
.
_
_
_
_
_
a
d
e
e
e
e
h
h
i
i
n
n
t
t
t
t
t
t
x
x
x
x
t
i
e
t
t
h
n
h
d
n
t
t
t
_
_
i
_
a
x
x
_
_
.
e
e
e
Inverting BWT
Backward BWT:
a. Compute the array C[1,…|∑|]:
C(c) is the no. of chars {#,1,.. c-1} in T

C(c)+1 is the position of 1st occurrence of c in F (if any)
b. Build LF mapping LF{1,..,u+1}:
LF[i]=C(L[i])+ri where ri = no of occurrences
of char L[i] in the prefix L[1,i]
c. Reconstruct T backward:
• s=1, T(u)=L[1]
• For i=u-1,…1 do s=LF[s] and T[i]=L[s]
8
Backward BWT (1)
Compute the array C[1,…|∑|]:
C(c) is the no. of chars {#,1,.. c-1} in T
Occ(c,r) = no. of occurrences of c in BWT up
to but not including the element at index
r
Alg Stepleft(r):
1. Return C[BWT[r]]+1+Occ[BWT[r],r]
Backward BWT (2)
a. r1; T” “
b. While BWT[r]$ do
c.
T prepend BWT[r] to r
d.
rstepleft[r]
e. Return T
Searching the Index
BWT(acaacg$)
Search(aac, gc$aaac)
• Search last nucleotide and expand
backwards
• Maintain interval of possible matches
11
Alg Exactmatch ( P[1,p] )
1. cP[p]; spC[c]+1; epC[c+1]+1;
ip-1
2. While sp<ep and i1 do
3.
c  P[i]
4.
sp  C[c] + Occ(c,sp) + 1
5.
ep  C[c] + Occ(c,ep) + 1
6.
i  i-1
7. Return sp, ep
Computing Occ(c,r)
1.
Precompute Occ(c,r) naively for each c,r
Time O(||u), but space O(||ulogu), for text
of length u
2. Precompute Occ(c,r) only for r=j*p
When Occ(c,r) is needed – if r not
available go back to the previous multiple
of p and add
Time O(||u) to preprocess, O(p) to compute
Space O((||ulogu)/p)
Computing the Text location of
an exact match
• Problem: Exactmatch P gives row(s) in
M that begin with the query – need to
find their offset - location in the text
• Solution:
– mark some rows with pre-calculated
offsets.
– In search, if row is not marked, do
Stepleft k times until finding a marked
row with offset o; report o+k
• Time/space tradeoff
Burrows & Wheeler
• David Wheeler (1927-2004)
–
–
–
–
Educated in Math, Cambridge
First PhD in the world in CS (51)
Invented the subroutine
Cambridge prof. active in cryptography
• Michael Burrows (~1963- )
– PhD Cambridge
– Worked in DEC, Microsoft, now Google
– Co-developed the AltaVista search engine
• Burrows M, Wheeler D (1994), A block
sorting lossless data compression
algorithm, Technical Report 124, Digital
Equipment Corporation
Motif Discovery in Protein
Sequences using Messy de
Bruijn Graph
Rupali Patwardhan
Advisors:
Dr. Mehmet Dalkilic
Dr. Haixu Tang
April 27, 2005
Rupali Patwardhan, Capstone
Presentation
16
What is a motif ?
• A repeating pattern
• VSKLIPKNRLMISTEWRSLGQQSPGWMHYMP
• VMLPKDIAKLVPKTHLMSTEWRNRLGVQQSQG
• SGVPRLLTASREWRNLGEPFIDQIHYSPRYAD
• YRHVMLPKAMSTEWRSLGLKNPETGTLRILQE
• GLGITQSLGWSREWRHTLGEPHILLFKREKDYQ
17
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Why are motifs interesting
?
• They represent regions that have been
conserved through evolution
• So those regions are likely to be important
for the function of the protein (e.g. an
active site)
• Motifs can be used to classify proteins into
families based on their functions, or
predict the function of a new protein
18
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
What is a de Bruijn Graph?
A graph whose nodes are
subsequences of same length (ltuples) and whose edges indicate the
subsequences of the two connected
nodes overlap
E.g. An edge ACAT  CATS
represents the sequence “ACATS”
19
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
ABCDEFG
ABCD BCDE CDEF DEFG
20
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Applying this to Identify
Repeating Subsequences
• If we have a set of sequences, we can go on adding
corresponding nodes and edges to our de Bruijn
graph.
• If any sub-sequence is repeated, the corresponding
edge will already be present in that graph.
• So we just increment the weight of that edge.
• Eventually the edges corresponding to highly
repeated sequences will have higher weights.
• Now we can find the motif by simply following the
graph along these edges with weights above a
specified threshold .
21
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
1. PAKARCDEKD
PAKA1AKAR1 KARC1ARCD
1
DEKD1CDEK 1RCDE
22
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
1. PAKARCDEKD
2. NARCDEKHKH
NARC
1
PAKA1AKAR1 KARC1ARCD
2
DEKD1CDEK 2RCDE
1
DEKH1EKHK1 KHKH
23
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
1. PAKARCDEKD
2. NARCDEKHKH
NARC
1
PAKA1AKAR1 KARC1ARCD
2
DEKD1CDEK 2RCDE
1
DEKH1EKHK1 KHKH
24
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Making them Messy
• In the context of protein sequences, some amino
acid residues can be substituted by some others
without affecting the function of the protein.
• So a sequence could be considered 'similar' to an
edge even though its not identical.
• Similarity between amino acid residues is
determined using standard scoring matrices, such as
BLOSUM62.
• In that case, we increment weights of all edges that
represent sequences that are ‘similar’ to the one in
question.
25
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Example
• Consider the same 2 sequences as before,
but with K replaced by R in one of them.
• PAKARCDERD
• NARCDEKHKH
• As per BLOSUM62, K  R substitution has
a positive substitution score.
26
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
 PAKARCDERD
 NARCDEKHKH
NARC
1
PAKA1AKAR1 KARC1ARCD
2
DERD1CDER1 RCDE
1
CDEK1DEKH1 EKHK1 KHKH
27
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
 PAKARCDERD
 NARCDEKHKH
NARC
1
PAKA1AKAR1 KARC1ARCD
2
1.4RCDE
DERD1CDER
1.4
CDEK1DEKH1 EKHK1 KHKH
28
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Adjusting the weights to
account for messiness
• Suppose edge A is under consideration,
and edges B and C originating from the
same node as A are similar to A.
WA’  WA + WB*s(A,B) + WC*s(A,C)
29
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Limitation of this Approach
• The motif should have at least a few
continuous amino acid residues
• So the method may fail if the motif
consists of alternate residues
• E.g. AxAxCxDxAxGxC (x could be any
residue) or AxCDxGxRGxC, since these
motifs would not lead to high-weight edges
in the de Bruijn graph
• The problem is due to the need for
overlaps, which is inherent nature of de
Bruijn Graphs
30
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Gapped Version
• For each node, we also create nodes
obtained by applying a gap mask (or
“Dont care” mask) on that node
• We currently restrict the maximal
number of “Dont cares” in a node to 2
• There are 10 such masks
31
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Gapped Version
• Let ‘1’ represent a conserved amino
acid and ‘0’ represent a gap or “Don’t
care”
• Then the 10 masks can be represented
as:
1111, 0111, 1110, 1011, 1101, 1100, 0011,
1001, 0110, 1010, 0101
32
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
Masking Example
• If ANCD is the node that we are
applying the mask to
• ANCD * 1001 = AxxD
• ANCD * 1101 = ANxD
• ANCD * 1011 = AxCD
33
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
1. ….ARCDM…
2. ….ANCDE…
3. ….ASCDT…
ARCD1RCDM
ANCD1NCDE
ASCD1SCDT
34
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
1. ….ARCDM…
2. ….ANCDE…
3. ….ASCDT…
AxCD1 xCDx
AxCD1 xCDx
AxCD1 xCDx
35
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
1. ….ARCDM…
2. ….ANCDE…
3. ….ASCDT…
AxCD xCDx
AxCD3xCDx
AxCD xCDx
36
April 27, 2005
Rupali Patwardhan,
Capstone Presentation
ANCD NCDE
AxCD NxDE
ANxD NCxE
ANxx NCxx
xxCD xxDE
AxxD NxxE
xNCx xCDx
AxCx NxDx
xNxD xCxE
.
.
.
.
.
.
AxCDxxGH
CDEF
CxEF
CDxF
CDxx
xxEF
CxxF
xDEx
CxEx
xDxF
.
.
.
DEFG
DxFG
DExG
DExx
xxFG
DxxG
xEFx
DxFx
xExG
.
.
.
EFGH
ExGH
EFxH
EFxx
xxGH
ExxH
xFGx
ExGx
xFxH
.
.
.
Protein binding microarrays
• Measure TF binding to many sequences.
38
(berger et al. 06)
Sequence design using de
Bruijn graphs
• Complete de Bruijn graphs represent
all k-mers (by edges).
• The graph is strongly connected, and
each vertex is degree-balanced.
• An Euler tour traverses each edge
exactly once.
• The resulting tour corresponds to a de
Bruijn sequence.
39