Combinational Pattern Matching

Download Report

Transcript Combinational Pattern Matching

Combinational Pattern Matching
Shashank Kadaveru
Introduction
In Motif Finding problem, no particular
pattern is given to search for. We infer it
from the sample.
 In Combinational Pattern Matching, we
look for exact or appropriate
occurrences of given patterns in a long
text.

Pattern Matching
Might seem simpler than Motif finding
because we know what we are looking
for.
 But due to the large size of genomes, in
practice it is difficult.
 Here, we develop number of ways to
make pattern matching in a long string.

Repeat Finding
Genetic diseases are associated with
deletions, duplications and
rearrangements of long chromosomal
regions.
 For example, DiGeorge syndrome which
results in impaired immune system and
heart defects, is associated with a 3MB
deletion on human chromosome 22.
 The deletion removes important genes
which leads to a disease.

Repeat Finding
Repeats in DNA hold many evolutionary
secrets.
 A striking and still unexplained
phenomenon is the large number of
repeats in many genomes.
 For example, repeats account for about
50% of the human genome.

Duplicate Removal Problem
Find all unique entries in a list of integers.
 For example, the list (8, 1, 5, 1, 0, 4, 5, 10,
1) has the elements 1 and 5 repeated
multiple times, so the resulting list would
be (8, 1, 5, 0, 4, 10)

Duplicate Removal Algorithm










DUPLICATEREMOVAL(a, n)
1 m <- largest element of a
2 for i <- 0 to m
3 bi <- 0
4 for i <- 0 to n
5 bai <- 1
6 for i <- 0 to m
7 if bi = 1
8 output i
9 return
Duplicate Removal Algorithm
This approach requires O(n log n) time to
sort the list of size n.
 Difficult to use when the entries are
arbitrarily large.
 Creating a list with so many values is not
efficient and may not even be possible.

Duplicate Removal Algorithm
If we can map any number of entries to
some number between 1 and 1000 then
we can use binning strategy efficiently.
 The function used for mapping is called
Hash Function.
 The second list is called Hash Table.

Hash Function
Should be easy to calculate and integer
valued.
 A simple hash function takes h(x) to
integer remainder of x/|b|.
 We want different integers in the array to
map to different integers in b.
 But this isn’t possible when the size of
array a is larger than b.

Hash Function

Example:
◦ If h(x) = x/1000
◦ 7, 1007, 2007 collide and fall into the same
bin.
This is called a collision.
 To deal with collisions, we use Chaining.

Chaining
Elements collapsing to
the same bin are
organized into a linked
list.
 Also, elements falling
into the same bins
reveal duplicate
elements and hence
solves the duplicate
removal problem.

Chaining
This concept can apply to many other
problems.
 Ex: to find exact repeats of l-mer in a
genome.
 If we have to find exact repeats for large
values like l = 40, then we need not
construct 440 entries.
 We can design a suitable hashing function
and use a table of much smaller size.

Exact Pattern Matching
If we have a pattern p = p1 p2 . . pn and a
string t = t1 t2 . . tm
 Pattern matching problem is used to find
any or all occurrences of pattern p in the
text t.
 Brute force algorithm is used for solving
Pattern Matching problem.

Exact Pattern Matching
PATTERNMATCHING(p, t)
1 n <- length of pattern p
2 m <- length of text t
3 for i <- 1 to m− n + 1
4
if ti = p
5
output i
Exact Pattern Matching
Worst case running time is O (mn)
 Such a worst-case scenario can be seen
by searching for p = AAAAT in t =AAAA. .
.AAAAA.
 Peter Weiner invented a data structure
called Suffix Tree that solves Pattern
Matching problem in linear time O(m) for
any text or pattern.
 To understand this, we need to learn
about Keyword Trees.

Keyword Trees

Multiple Pattern Matching Problem:
◦ Given a set of patterns and a text, find all
occurrences of any of the patterns in the text.
◦ For example, if t = ATGGTCGGT and p1 =
GGT, p2 = GGG, p3 = ATG, and p4 = CG, then
the result of an algorithm that solves the
Multiple Pattern Matching problem would be
positions 1, 3, 6, and 7.
Keyword Trees
Multiple Pattern Matching problem with k
patterns can be reduced to k individual
Pattern Matching problems and solved in
O(knm) time.
 However, there exists an even faster way
to solve this problem in O(n + m) time
where n is the total length of patterns p1,
p 2, . . . , p k

Keyword Trees

The key- word
tree for the set of
patterns apple,
apropos, banana,
bandana, and
orange is shown in
figure.
Keyword Trees
The keyword tree has at most N vertices
where N is the total length of all patterns,
but may have fewer.
 One can construct the keyword tree in
O(N) time by progressively extending the
keyword tree.

Suffix Trees
Suffix trees allow one to preprocess a
text in such a way that for any pattern of
length n, one can answer whether or not
it occurs in the text, using only O(n) time.
 It doesn’t depend on the length of the
text.

Suffix Trees

Patterns: proud, perfect, muggle, ugly, rivet

t = “mr and mrs dursley of number 4 privet
drive were proud to say that they were
perfectly normal thank you very much”
Suffix Trees
Suffix Trees
Suffix Trees
A suffix tree for a text of length m can be
constructed in O(m) time, which leads
immediately to a linear O(n + m)
algorithm for the Pattern Matching
problem
 Construct the suffix tree for t, in O(m)
time, and then check whether p occurs in
the tree, which requires O(n) time.

Suffix Trees

The suffix tree for a text t = t1 ...tm is a
rooted labeled tree with m leaves
(numbered from 1 to m) satisfying the
following conditions
◦ Each edge is labeled with a substring of the text
◦ Each internal vertex (except possibly the root)
has atleast 2 children,
◦ Any two edges out of the same vertex start with
a different letter,
◦ Every suffix of text t is spelled out on a path from
the root to some leaf.
Suffix Trees
Suffix Trees

SUFFIXTREEPATTERNMATCHING(p, t)
◦ Build the suffix tree for text t
◦ Thread pattern p through the suffix tree.
◦ if threading is complete
 output positions of every p-matching leaf in the
tree
◦ else
 output “pattern does not appear anywhere in the
text”
Heuristic Similarity Search
Algorithms
The suffix tree algorithm above, while fast,
can only find exact, rather than
approximate, occurrences of a gene in a
database.
 When we are trying to find an
approximate match to a gene of length
103 in a database of size 1010, the
programming algorithms (like the SmithWaterman local alignment algorithm) may
be too slow.

Approximate Pattern Matching
Finds all approximate occurrences of a
pattern in a text.
 The naive brute force algorithm for
approximate pattern matching runs in
O(nm) time.
 The following algorithm will output all
locations in t where p occurs with no
more than k mismatches.

Approximate Pattern Matching




APPROXIMATEPATTERNMATCHING(p, t,
k)
n ← length of pattern p
m ← length of text t
for i←1tom−n+1
◦ dist ← 0
◦ for j←1to n
 if ti+j−1 != pj
dist ← dist + 1

if dist ≤ k
◦ output i
Approximate Pattern Matching
In 1985 Gadi Landau and Udi Vishkin
found an algorithm for approximate string
matching with O(km) worst-case running
time.
 Although this algorithm yields the best
known worst-case performance, it is not
necessarily the best in practice.

Heuristic Similarity Search
Algorithms
Many heuristics for fast database search in
molecular biology use the same filtration
idea.
 In 1973 Donald Knuth suggested a
method for pattern matching with one
mismatch based on the observation that
strings differing by a single mismatch must
match exactly in either the first or second
half.

Heuristic Similarity Search
Algorithms
In 1985 the idea of filtration in
computational molecular biology was
used by David Lipman and Bill Pearson, in
their FASTA algorithm.
 It was further developed in BLAST, now
the dominant database search tool in
molecular biology.

Heuristic Similarity Search
Algorithms
Biologists frequently depict similarities
between two sequences in the form of
dot matrices.
 A dot matrix is simply a matrix with each
entry either 0 or 1, where a 1 at position
(i, j) indicates some similarity between the
ith position of the first sequence and the
jth position of the second sequence.

Heuristic Similarity Search
Algorithms
BLAST: Comparing a Sequence
against a Database
Using shared l-mers for finding similarities, as
FASTA does, has some disadvantages.
 For example, two proteins can have different
amino acid sequences but still be biologically
similar.
 A common construct in many heuristic
similarity search algorithms, including BLAST,
is that of a scoring matrix similar to the
scoring matrices introduced in chapter 6.
 These scoring matrices reveal similarities
between proteins even if they do not share a
single l-mer.

BLAST: Comparing a Sequence
against a Database
BLAST, uses scoring matrices to improve
the efficiency of filtration and to
introduce more accurate rules for
locating potential matches.
 Another powerful feature of BLAST is the
use of Altschul-Dembo-Karlin statistics
for estimating the statistical significance of
found matches.

BLAST: Comparing a Sequence
against a Database



A segment pair is just a pair of l-mers, one
from each sequence.
The maximal segment pair is a segment pair
with the best score over all segment pairs in
the two sequences.
A segment pair is locally maximal if its score
cannot be improved either by extending or
by shortening both segments.
BLAST: Comparing a Sequence
against a Database
BLAST attempts to find all locally maximal
segment pairs in the query and database
sequences with scores above some set
threshold.
 A fast algorithm for finding such l-mers is
the key ingredient of BLAST.

BLAST: Comparing a Sequence
against a Database



An important observation is that if the
threshold is high enough, then the set of all lmers that have scores above a threshold is
not too large.
In this case the database can be searched for
exact occurrences of the strings from this
set using Multiple Pattern Matching
Algorithm.
After the potential matches are located,
BLAST attempts to extend them to see
whether the resulting score is above the
threshold.
BLAST: Comparing a Sequence
against a Database
In recent years BLAST has been further
improved by allowing insertions and
deletions and combining matches on the
same and close diagonals.
 BLAST reports matches to sequences
that either have one segment score above
the threshold or that have several closely
located segment pairs that are statistically
significant when combined.

BLAST: Comparing a Sequence
against a Database

According to the Altschul-Dembo-Karlin
statistics, the num- ber of matches with scores
above θ is approximately Poisson-distributed,
with mean

where m and n are the lengths of compared
sequences, and K is a constant. Parameter λ is a
positive root of the equation
BLAST: Comparing a Sequence
against a Database
where px and py are frequencies of amino
acids x and y from the twenty- letter
alphabet A and δ is the scoring matrix.
 The probability that there is a match of
score greater that θ between two
“random” sequences of length n and m is
computed as 1 − eE(θ).

Any Questions?
Thank You