Exhaustive search - University of Illinois at Urbana
Download
Report
Transcript Exhaustive search - University of Illinois at Urbana
Exhaustive search
CS 466
Saurabh Sinha
A fruitfly has a bacterial infection
• When attacked by bacteria, the fruitfly’s immune
system kicks in
• Many genes that were lying “dormant” now producing
their proteins, to fight the infection. (Some otherwise
active genes may now become inactive.)
• Which genes are these ?
Looking for differentially expressed genes
• Measure the activity level of all genes in normal fly
and in infected fly
• Find genes whose activity levels are significantly
different between the two conditions
• How to measure gene activity level ?
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
DNA Arrays--Technical Foundations
• In a single experiment, the expression levels of hundreds or
thousands genes within a cell are measured
• An array works by exploiting the ability of a given mRNA
molecule to hybridize to the DNA template.
• An “array” containing many short DNA fragments called
“probes”. Each probe’s sequence is part of some gene, and
therefore can bind to mRNA of that gene
• The amount of mRNA bound to each probe on the array reflects
the amount of mRNA of that gene in the cell
May, 11, 2004
http://www.ncbi.nih.gov/About/primer/microarrays.html
4
An Introduction to Bioinformatics Algorithms
www.bioalgorithms.info
DNA Microarray
Millions of DNA strands
build up on each location.
May, 11, 2004
Tagged probes become hybridized to the DNA chip’s
microarray.
http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx
5
An experiment on a microarray
In this schematic:
GREEN represents Control DNA
RED represents Sample DNA
YELLOW represents a combination of Control and Sample DNA
BLACK represents areas where neither the Control nor Sample DNA
Each color in an array represents either healthy (control) or diseased (sample) tissue.
The location and intensity of a color tell us whether the gene is present in
the control and/or sample DNA.
May 11,2004
http://www.ncbi.nih.gov/About/primer/microarrays.html
10
Differentially expressed genes
• Find a set of genes differentially expressed in the infected fly
• These are perhaps the ones orchestrating the immune response
• Look at promoters of these genes
• Find that the substring TCGGGGATTTCC occurs often (modulo
minor spelling mistakes) in these promoters
Regulatory motif
• Someone tells you that TCGGGGATTTCC is the preferred
binding site of the “NFkB” transcription factor
• Infer that NFkB is turning on the immunity !
• What if you did not know that NFkB binds TCGGGGATTTCC ?
• Could you have just gazed at the promoter sequences, and
discovered this binding site ?
Chapter 4.5-4.9
Finding motifs ab initio
• Enumerate all possible strings of some fixed (small) length
• For each such string (“motif”) count its occurrences in the
promoters
• Report the most frequently occurring or the most
statistically significant motif
• Does the true motif pop out ?
Simple statistics
• Consider 10 promoters, each 100 bp long
• Suppose a secret motif ATGCAACT has been “planted”
in each promoter
• Our enumerative method counts every possible “8-mer”
• Expected number of occurrences of an 8-mer is 10 x
100 x (1/4)8 ≈ 0.015
• Most likely, an arbitrary 8-mer will occur at most once,
may be twice
• 10 occurrences of ATGCAACT will stand out
Variation in binding sites
• Motif occurrences will not always be exact copies of
the consensus string
• The transcription factor can usually tolerate some
variability in its binding sites
• It’s possible that none of the 10 occurrences of our
motif ATGCAACT is actually this precise string
A new motif model
• To define a motif, let’s say we know where the motif
occurrence starts in the sequence
• The motif start positions in their sequences can be
represented as s = (s1,s2,s3,…,st)
Motifs: Profiles and Consensus
a G g t a c T t
C c A
Alignment
a c g
a c g
C c g
t
t
t
t
a
T
C
a
c
A
c
c
g
g
A
g
t
t
t
G
_________________
Profile
A
C
G
T
3
2
0
0
0
4
1
0
1
0
4
0
0
0
0
5
3
1
0
1
1
4
0
0
1
0
3
1
0
0
1
4
_________________
Consensus
A C G T A C G T
• Line up the patterns by
their start indexes
s = (s1, s2, …, st)
• Construct matrix profile
with frequencies of each
nucleotide in columns
• Consensus nucleotide in
each position has the
highest score in column
Profile matrices
• Suppose there were t sequences to begin with
• Consider a column of a profile matrix
• The column may be (t, 0, 0, 0)
– A perfectly conserved column
• The column may be (t/4, t/4, t/4, t/4)
– A completely uniform column
• “Good” profile matrices should have more
conserved columns
Scoring Motifs
l
• Given s = (s1, … st) and DNA
a G g t a c T t
C c A t a c g t
a c g t T A g t
a c g t C c A t
C c g t a c g G
_________________
l
Score(s,DNA) =
max
count(k , i)
i 1 k{ A,T ,C ,G}
A
C
G
T
3 0 1 0 3 1 1 0
2 4 0 0 1 4 0 0
0 1 4 0 0 0 3 1
0 0 0 5 1 0 1 4
_________________
Consensus
a c g t a c g t
Score
3+4+4+5+3+4+3+4=30
t
Goal
• Goal is to find the starting positions
s=(s1,…st) to maximize the Score(s,DNA)
of the resulting profile matrix
• This is one formulation of the “motif
finding problem”
Another formulation
• Hamming distance between two strings v and w is
dH(v,w) = number of mismatches between v and w
• Given an array of starting positions s=(s1,…st), we
define dH(v, s) = ∑idH(v,si)
• Define: TotalDist(v, DNA) = mins dH(v,s)
• Computing TotalDist is easy
– find closest string to v in each input sequence
The median string problem
• Find v that minimizes TotalDist(v)
• A double minimization (mins, minv)
• Equivalent to motif finding problem
– Show this
Naïve time complexity
• Motif finding problem: Consider every (s1,…st): O((n-l+1)t)
• Median string problem: Consider every l-mer: O(4l).
Relatively fast !
• Common form of both expressions: Find a vector of L
variables, each variable can take k values: O(kL)
Enumeration
• Want to generate all strings in {1,2,3,4}L
11…11
11…12
11…13
11…14
.
.
44…44
“Search Tree” for enumeration
Here, L = 2
-4
1
2
11
2
11
12
3
3
2-
3-
4-
4
13
14
21
22
23
24
31
32
33
34
41
42
43
44
Analyzing Search Trees
• Characteristics of the search trees:
– The sequences are contained in its leaves
– The parent of a node is the prefix of its sequence
• How can we move through the tree?
“Search Tree” for enumeration
--
Order of steps
4
1
2
11
2
11
12
1
3
2-
3-
4-
4
3
13
14
2
3
21
4
22
5
23
6
24
7
31
8
32
9
10
33
34
11
41
12
13
42
43
14
44
15
An algorithm to enumerate !
NEXTLEAF(a, L, k)
for i := L to 1
if ai < k
ai := ai+1
return a
ai := 1
return a
ALLLEAVES(L, k)
a := (1,..,1)
while true
output a
a := NEXTLEAF(a,L,k)
if a = (1,..,1)
return
Increment the least significant digit; and “carry over” to next
position if necessary
A new task:
visiting all the vertices in tree
•
Not just the leaves, but the internal nodes also
–
–
•
•
Why? Why not only the leaves of the tree?
We’ll see later.
PreOrder traversal of a tree
PreOrder(node):
1. Visit node
2. PreOrder(left child of node)
3. PreOrder(right child of node)
•
How about a non- “recursive” version?
Visit the Next Vertex
1. NextVertex(a,i,L,k)
2.
if i < L
3.
a i+1 1
4.
return ( a,i+1)
5. else
6.
for j L to 1
7.
if aj < k
8.
aj aj +1
9.
return( a,j
)
10. return(a,0)
At an “internal node”
At a “leaf node”,
may need to “climb up”
//
//
//
//
a : the array of digits
i : prefix length
L: max length
k : max digit value
In words
• If at an internal node, just go one level deeper.
• If at a leaf node,
– go to next leaf
– if moved to a non-sibling in this process, jump up
Example
• Moving to the next vertex:
Current Location
1-
11
12
13
--
2-
14
21
22
23
3-
24
31
32
33
4-
34
41
42
43
44
Example
• Moving to the next vertices:
Location after 5
next vertex moves
--
1-
11
12
13
2-
14
21
22
23
3-
24
31
32
33
4-
34
41
42
43
44
Bypassing
• What if we wish to skip an entire
subtree (at some internal node) during
the tree traversal ?
Bypass Move: Example
• Bypassing the descendants of “2-”:
Current Location
1-
11
12
13
--
2-
14
21
22
23
3-
24
31
32
33
4-
34
41
42
43
44
Example
• Bypassing the descendants of “2-”:
Next Location
--
Somehow, know
that 2* is useless
1-
11
12
13
2-
14
21
22
23
3-
24
31
32
33
4-
34
41
42
43
44
Bypassing
BYPASS(a, i, L, k)
for j := i to 1
if aj < k
aj := aj+1
return (a,j)
return (a,0)
Brute Force Solution for the
Motif finding problem
1.
2.
3.
4.
5.
6.
7.
8.
9.
BruteForceMotifSearchAgain(DNA, t, n, l)
s (1,1,…, 1)
bestScore Score(s,DNA)
while forever
s NextLeaf (s, t, n-l+1)
if (Score(s,DNA) > bestScore)
bestScore Score(s, DNA)
bestMotif (s1,s2 , . . . , st)
return bestMotif
O(l(n-l+1)t)
Can We Do Better?
• Sets of s=(s1, s2, …,st) may have a weak profile for the first
i positions (s1, s2, …,si)
• Every row of alignment may add at most l to Score
• Optimism: if all subsequent (t-i) positions (si+1, …st) add
(t – i ) * l to Score(s,i,DNA)
• If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no
sense to search in vertices of the current subtree
– Use ByPass()
• “Branch and bound” strategy
– This saves us from looking at (n – l + 1)t-i leaves
Pseudocode for Branch and Bound Motif Search
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
BranchAndBoundMotifSearch(DNA,t,n,l)
s (1,…,1)
bestScore 0
i1
while i > 0
if i < t
optimisticScore Score(s, i, DNA) +(t – i ) * l
if optimisticScore < bestScore
(s, i) Bypass(s,i, n-l +1)
else
(s, i) NextVertex(s, i, n-l +1)
else
if Score(s,DNA) > bestScore
bestScore Score(s)
bestMotif (s1, s2, s3, …, st)
(s,i) NextVertex(s,i,t,n-l + 1)
return bestMotif
The median string problem
• Enumerate 4l strings v
• For each v, compute TotalDist(v, DNA)
– This requires linear scan of DNA, i.e., O(nt)
• Overall: O(nt4l)
• Improvement by branch and bound ?
• During enumeration of l-mers, suppose we are at
some prefix v’, and find that TotalDist(v’,DNA) >
BestDistanceSoFar.
• Why enumerate further ?
Bounded Median String Search
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
BranchAndBoundMedianStringSearch(DNA,t,n,l )
s (1,…,1)
bestDistance ∞
i1
while i > 0
if i < l
prefix string corresponding to the first i nucleotides of s
optimisticDistance TotalDistance(prefix,DNA)
if optimisticDistance > bestDistance
(s, i ) Bypass(s,i, l, 4)
else
(s, i ) NextVertex(s, i, l, 4)
else
word nucleotide string corresponding to s
if TotalDistance(s,DNA) < bestDistance
bestDistance TotalDistance(word, DNA)
bestWord word
(s,i ) NextVertex(s,i,l, 4)
return bestWord
Today’s summary
•
•
•
•
•
DNA arrays, measuring gene expression
The motif finding problem
The median string problem
Enumeration with search trees
Branch and bound