BC Bioinformatics

Download Report

Transcript BC Bioinformatics

Asymptotics of RNA shapes
Yann Ponty
Andy Lorenz
Peter Clote
Biology Department
Boston College
Talk outline
1.Biological motivation
2.General approach used
3.Results for RNA shapes
Biological motivation
Primary structure
Sequence of
A’s, C’s, G’s and U’s
Secondary structure
Tertiary structure
• By definition it is (half-)planar in nature Ultimate goal, but
• Commonly computationally tractable
difficult to predict.
Biological motivation – RNA secondary structure
Terminal loop
Picture representation (planar
graph representation)
Terminal loop
Balanced parenthesis
sequence representation
n=53
…(((…((((((((…....)))))…))).(((….)))..)))…
Non-negative excursion
(path that starts and
ends at 0 but is never
negative)
Biological motivation – RNA secondary structure
No crossing is allowed in secondary structure. Such crosses are called
pseudoknots.
pseudo-knot
This is a pseudo-knot, not allowed
in secondary structure
Biological motivation – RNA secondary structure
In many algorithms, the set of all possible secondary structures
is the search space. The size of this search space can affect
how large an RNA a given algorithm can be applied to?
Let S(n) denote the number of secondary structures on a
sequence of length n. As described, S(n) are known as the
Motzkin numbers, and the asymptotics of S(n) for large n are
S(n) »
p
3p 3 n
3=2
3
=n
2 ¼
» 1:46581¢3n =n3=2
Similarly, if we force terminal loops to be of length 1 or more, we
get a different asymptotic growth (Stein, Waterman 1978).
q
³ p ´n
p
7 5 ¡ 3=2 3+ 5
n =n3=2
S(n) » 15+8¼
n
»
1:104366¢2:618034
2
In any case, these numbers grow fast. Can we find equivalence
classes of these shapes that do not grow so fast?
Biological motivation – RNA shapes – π shapes
Unpaired base
Terminal loop
bulge
helix
Paired base
Terminal loops
Multi-loop
Multi-loop
Internal loop
Helix regions
π-shapes try to capture basic shape of secondary structure. Bulges and
internal loops are ignored. The unpaired bases in multi-loops and terminal
loops are ignored. Helix regions are collapsed into 1 bracket. The π-shapes
for both of the above secondary structures is the same, and is
[[][]]
Biological motivation – RNA shapes – π shapes
]
[
[
[
[
]
[
[ ]
]
]
…(((…((((((((…….)))))…))).(((.(((…)))..))).)))…
[
[
]
[[][]]
[
]
]
]
Biological motivation – RNA shapes – π’ shapes
Unpaired base
Paired base
bulge
Multi-loop
Internal loop
Helix regions
π’-shapes try to capture more aspects of secondary structure. Bulges
and internal loops are not ignored. Any group of unpaired bases is
reduced to one dot. Helix regions are collapsed into 1 bracket. The
π’-shapes for the above secondary structures are
.[.[[.].].[.[.].].].
and
.[.[.][.].].
Biological motivation – RNA shapes – π’ shapes
…(((…((((((((…….)))))…))).(((.(((…)))..))).)))…
.
.
.
. . . . . .
.
[
[ [
]
] [ [
] ] ]
.[.[[.].].[.[.].].].
Talk outline
1.Biological motivation
2.General approach used
3.Results for RNA shapes
General approach – overview
Standard method
Structure
Better method
Structure
(comparison 1)
Recursion relations
Functional relation for
generating function
Grammar
Generating Function
(comparison 2)
Asymptotics
Asymptotics
General approach
Example to be worked through
For our example, we will use secondary structures with
minimum terminal loop length of 1.
…(((…((((((((…....)))))…))).(((.)))..)))…
Terminal loop of length 7
…length 1
..(((..().((..((..))))
length 0
OK
not OK
length 2
We know the asymptotics of this to be
q
S(n) »
³ p ´n
p
15+ 7 5 ¡ 3=2 3+ 5
8¼ n
2
» 1:104366¢2:618034n =n3=2
Standard method – recursion relations
P1
n
S
z
Let S(z) =
where Sn is num of secondary structures on
n
n= 1
sequence of length n.
not base paired
2 cases
k
Sk-1
n
Sn-1
Sn-k-1
From this we see that
Sn = Sn¡ 1 +
P n¡
2
k= 1 Sk¡ 1
¢Sn¡
k¡ 1
Now, being careful with initial conditions, and with a bit of algebraic
manipulation we eventually can get to the relation
S = z + Sz + Sz2 + S2z2
Standard method – recursion relations
First note
Ã
S2 =
!
X1
2
sn zn
=
n= 1
à n¡
X
X1
n= 1
!
1
sk sn¡
k
zn :
(1)
k= 1
P 2
By induction we ¯rst get Sn P= Sn¡ 1 + n¡
k= 1 Sk¡ 1 ¢Sn ¡ k¡ 1 . Replacing n by
n
n+ 2, we have Sn+ 2 =PSn+ 1 + k= 1 Sk¡ 1 ¢Sn¡ (k¡ 1) . Substituting
P n r for k ¡ 1, we
n
have Sn + 2 = Sn+ 1 + r = 0 Sr ¢Sn¡ r . Since S0 = 1, we have r = 0 Sr ¢Sn¡ r =
P 1
Sn + n¡
r = 0 Sr ¢Sn¡ r , so
Sn+ 2 ¡ Sn+ 1 ¡ Sn =
n¡
X1
Sr ¢Sn¡ r :
r=0
Now
Ã
S2 =
X1
!
2
Sn zn
=
n= 1
Ãn¡
X
X1
n= 1
!
1
Sk Sn¡
k
zn
k= 1
so
2
S =
X1
n
(Sn + 2 ¡ Sn+ 1 ¡ Sn )z =
n= 1
X1
n= 1
n
Sn+ 2z ¡
X1
n= 1
n
Sn+ 1z ¡
X1
n= 1
Sn zn :
Standard method – recursion relations
Note that
1
S ¡ S1z ¡ S2z2 X
=
Sn+ 2 zn
2
z
n= 1
and
1
S ¡ S1z X
=
Sn + 1 zn :
z
n= 1
Thus
S ¡ z ¡ z2 S ¡ z
S =
¡
¡ S
z2
z
2
Multiply by z2 to get
z2 S2 = S ¡ z ¡ z2 ¡ zS + z2 ¡ Sz2
so
z2 S2 ¡ S(1 ¡ z ¡ z2 ) + z = 0
and thus
S = z2 S2 + Sz2 + Sz + z
(1)
Better method – grammar
Grammars are a way of generating a language. We will restrict to the
simple case of context-free grammars. The context-free grammar for
secondary structures (with minimal terminal loop length 1) is given by
S!
² jS² j ( S) jS( S)
Here our terminal symbols (letters in the language this grammar
describes) are (, ) and ●. The language generated is exactly
secondary structures (with minimum terminal loop length 1).
To generate a word, we make a substitution on the right for a symbol
on the left. So, for an example for the above language, we could
generate the word (always substituting the left-most non-terminal)
S ! S( S) ! S² ( S) ! ² ² ( S) ! ² ² ( S² ) !
! ² ² ( ² ( S) ² ) ! ² ² ( ² ( ² ) ² )
² ² ( S( S) ² )
In addition, because this grammar is unambiguous, there is no
different path to this same word.
Better method – grammar
S!
² jS² j ( S) jS( S)
Here is the parse tree
for the same word.
S
(
S
S
●
S
●
)
●
S
S
●
(
S
)
●
Reading the leaves of the tree gives the same word, ² ² ( ² ( ² )
This is only tree giving rise to this word, because the language is
unambiguous (for this grammar can be shown by induction).
² ).
Better method – grammar
Type of nonterminal
S! T j U
S! TU
S! t
S! "
Equation for the l.g.f.
S(z) = T(z) + U(z)
S(z) = T(z)U(z)
S(z) = z
S(z) = 1
Given an unambiguous context-free grammar, we can find the
corresponding generating function relations with the above
properties. This is very fast!
S ! ² jS² j ( S) jS( S)
S = z + zS + z2S + z2S2
Significantly faster than other method at getting here!
This method for getting the relations on the generating equation is
sometimes called the DSV method.
General approach – overview
Standard method
Structure
Better method
Structure
(comparison 1)
Recursion relations
Functional relation for
generating function
Grammar
Generating Function
(comparison 2)
Asymptotics
Asymptotics
Standard method – getting asymptotics
The following theorem due to Meir-Moon and modified by Odlyzko can
be used to get the asymptotics.
P
Suppose that f (z) = 1n= 1 f n zn is analytic
f n ¸ 0 for all n,
P at z = 0, that
and that f (z) = G(z; f (z)), where G(z; w) = m ;n ¸ 0 gm ;n zm wn . Suppose that
there exist real numbers ±; r; s > 0 such that
² G(z; w) is analytic in jzj < r + ± and jwj < s + ±.
² G(r; s) = s, Gw (r; s) = 1,
² Gz (r; s) 6
= 0 and Gw;w (r; s) 6
= 0.
Suppose that gm ;n is real and non-negative for all m; n, that g0;0 = 0, g0;1 6
=
1,and gm ;n > 0 for some m and some n ¸ 2. Assume further that there exist
h > j > i ¸ 1 such that f h f i f j 6
= 0 while the greatest common divisor of j ¡ i
and h ¡ i is 1. Then f (z) converges at z = r , f (r ) = s, and
s
r Gz (r; s) ¡ n ¡ 3=2
f n = [zn ]f (z) »
r n
:
2¼Gw;w (r; s)
For us, we identify S with w and get
G(z; w) = z + zw + z2w + z2w2
Standard method – getting asymptotics
What is good and bad about the Bender-Meir-Moon theorem?
The good
•If all of the conditions are satisfied, one merely has to plug in the
answer.
•This approach, unlike the one we are going to describe, does not
need an explicit formula for the generating function.
The bad
•If all of the conditions are not satisfied, (as happened to us), you
cannot use the theorem. There are many conditions that are very
constraining.
• This approach will only tackle one (common) kind of singularity in
the generating function. The approach we show next is much more
general in this respect.
The ugly…
Better method – getting asymptotics
We now describe a more general approached as described by Flajolet
and Odlyzko (1990).
Start with the relation for the generation function given by the grammar.
S = z + zS + z2S + z2S2
Solve for generating function.
S§ =
2
1¡ z¡ z §
p
1¡ 2z¡ z 2 ¡ 2z 3 + z 4
2z 2
Choose the solution that is analytic at 0 (a necessary condition for
generating functions).
S=
2
1¡ z¡ z ¡
p
1¡ 2z¡ z 2 ¡ 2z 3 + z 4
2z 2
This function will be analytic except possibly where the
denominator is zero (in this case it is analytic at z=0), and where
the square root is zero (since z1=2 is not analytic at 0)
Better method – getting asymptotics
S=
2
1¡ z¡ z ¡
p
1¡ 2z¡ z 2 ¡ 2z 3 + z 4
2z 2
The above equation is non-analytic at the roots of the polynomial
1¡ 2z ¡ z2 ¡ 2z3 + z4
These are
p found to bep roots of unity (modulus 1) and
z = 3+2 5 ; z = 3¡ 2 5
The dominant singularity, ρ, is the one with with smallest modulus.
Thus
p
½=
3¡
5
2
P1
n converges for jzj < ½
This means the series S =
n= 0 Sn z
and diverges for jzj > ½. We immediately get that the terms Sn
grow exponentially atpa rate of
1=½= 3¡ 2p 5 = 3+2 5 p
Thus we know Sn ¼ ( 3+2 5 )n
Better method – finer asymptotics
Now we use the theorem by Flajolet and Odlyzko.
Assume that f(z) is analytic in 4 n1, and that as z ! 1 in 4 ,
f (z) » K (1 ¡ z) ®
Then, as n ! 1 , if ® 2= 0; 1; 2; :::,
fn »
The region, 4
i
ε
1
External singularities
φ
K
n¡
¡ (¡ ®)
®¡ 1
:
If the singularities are isolated, and if
there is a unique dominant singularity,
the rescaled function will always be
analytic in the required region, 4 \1.
The theorem basically states that the
growth of terms is determined
completely by the singularity. The
Dominant singularity
rescaling of the singularity to 1 merely
gets rid of the exponential portion in the
answer, which we already know.
Better method – finer asymptotics
S=
=
2
1¡ z¡ z ¡
1¡ z¡ z 2
2z 2
¡
p
1¡ 2z¡ z 2 ¡ 2z 3 + z 4
2
2z
p
1¡ 2z¡ z 2 ¡ 2z 3 + z 4
2z 2
Slower growing as it does
not contain singularity
S0 =
¡
S0 = ¡
=¡
Only part that matters
for asymptotics
p
1¡ 2z¡ z2 ¡ 2z3 + z4
2z2
(P2 (z)(1¡ z=½)) 1=2 where P2(z) is determined by dividing out
the dominant singularity.
2z 2
P2 (z) 1=2
1=2
(1¡
z=½
)
2z 2
In theorem, α=1/2 from here
This is portion in front of
dominant singularity
P2 (½) 1=2
Evaluate that portion at dominant singularity ρ.
2
2½
K =¡
Better method – finer asymptotics
K =¡
1=2
P2 (½)
2½2
p
1=½=
3+ 5 and
2
® = 1=2
So from theorem we get
Sn »
K
¡ ®¡ 1( 1 )n
n
¡ (¡ ®)
½
(from rescaling)
Plugging in gives
q
S(n) »
as desired.
³ p ´n
p
15+ 7 5 ¡ 3=2 3+ 5
8¼ n
2
» 1:104366¢2:618034n =n3=2
Method
Combinatorial class
Codes
Bijections
Language
?
Grammar
Generating function
DSV
Singularity
analysis
Asymptotics
-Shapes
-Shapes » Trees
Well-parenthesized language
(Dyck words)
[ [ [ [ ] [ ] [ ] ] [ ] [ ] ] [ ] ]
-Shapes
-Shapes » Trees
Well-parenthesized language
(Dyck words)
[ [ [ [ ] [ ] [ ] ] [ ] [ ] ] [ ] ]
Grammar:
S[S]S|
But nested motifs [ [ ] ] must be prohibited !
Otherwise many different shapes may
correspond to a single structure !!!
[]
[[]]
[[[]]]
…
-Shapes
-Shapes , Dyck words w/o [ [ ] ] motifs
T
T
T
S
T
T
S
T
S
T
S
T
T
S
T
S
[ [ [ [ ] [ ] [ ] ] [ ] [ ] ] [ ] ]
Grammar:
S[T]S|[T]
T[T]S|
Generating
function:
Asymptotics:
-Shapes
Goal : Number of shapes (S) compatible with a given RNA sequence S.
First model: Every base can pair with every other base
) (S) is the number of -Shapes having length  n
Problem: We only know the number of -Shapes having length = n
) Build language whose words are -shapes, prefixed by a sequence of
a new dummy symbol ■
By virtually removing the dummy characters in words of size n generated
from R, we get all -Shapes of size  n.
R■R|S
S[T]S|[T]
T[T]S|
-Shapes
Goal : Number of shapes (S) compatible with a given RNA sequence S.
Second model: Terminal loops requires at least  bases (sterical constraint)
Equivalent to counting the number of Pi-shapes  of size  n-t,
where t is the number of terminal loops in 
Terminal loops
-Shapes
Goal : Number of shapes (S) compatible with a given RNA sequence S.
Second model: Terminal loops requires at least  bases (sterical constraint)
Equivalent to counting the number of Pi-shapes  of size  n-t,
where t is the number of terminal loops in 
) Introduce a new dummy symbol  and get rid of it later !
T
T
T
S
T
T
S
R■R|S
T
S
T
S
T
T
S
T
[ [ [ [ ] [ ] [ ] ] [ ] [ ] ] [ ] ]
S[T]S|[T]
T[T]S|
Derivation T   actually creates a terminal loop, so make it cost  bases in
the matching sequence.
-Shapes
Second model: Terminal loops requires at least  bases (sterical constraint)
Grammar:
R■R|S
S[T]S|[T]
T  [ T ] S | 
Generating
function:
Asymptotics:
’-Shapes
Model: Everything can basepair + Terminal loops at least 3 base-long
Introduce unpaired bases  between helices while avoiding  and [ [ ] ] patterns
Grammar:
R■R|S
SU[T]S|U
T  U [ T ] U [ T ] S |  [ T ] | [ T ]  |  [ T ]  | 3
U|
Generating function:
Asymptotics:
A bijection between -Shapes and
Motzkin words
Def Motzkin words:
Words in {(, ), }* whose restriction to {(, )} are well-parenthesized
Example:
(()(())()((())())())
Naturally encodes:
• Positive paths from (0,0) to (n,0) using steps (+1,+1), (+1,-1) and (+1,0)
• Unary/binary trees
• Non-intersecting drawings of any number of chords among n ordered
points on a circle
• RNA secondary structures (( ) pattern forbidden)
A bijection between -Shapes and
Motzkin words
Generating function for Motzkin words:
Generating function for -Shapes:
There exists a bijection between Motzkin words of size n
and -Shapes of size 2n+2 !
-Shapes , Motzkin words
Illustration on trees:

)
[[[][]][][]][]
-shape
length 2n+2
)
Dyck’s
bijection
Tree
pruning
Combinatorial
binary tree
2n+2 edges
(
( ) (  ) 
(

Binary tree
n edges
Motzkin words
n edges
-Shapes , Motzkin words
Motzkin words , -Shapes
+ Terminal loops appear in both structures
Is there a way to transpose the  constraint for -Shapes into
a ’ constraint over Motzkin words ???
(If so,  -Shapes , ’ RNA secondary structures)
Short answer: No ! Why ? Because !
-Shapes , Motzkin words
(Not so) Long answer : Because there is a bigger (though comparable)
number of terminal loops in -Shapes of size 2n+2 than in Motzkin
words of size n.
) A constraint acting on terminal loops has more impact on shapes than on Motzkin
) Proof requires multivariate generating functions analysis.
Multivariate analysis
Idea: Introduce a new character t, whose length is 0, which marks each
occurrence of a terminal loop.
S(T)S|S|
T(T)S|S|t
) Transpose into a system of functional equations:
Let
And
be the number of Motzkin words of size n having k terminal loops
the number terminal loops in a Motzkin word
Multivariate analysis
From
, the average number of terminal loops is just a derivative away:
(Also holds for the average number of occurrences of any given character !!!)
Take the ratio of these quantities and do the same analysis for -shapes
yields the claimed results.
Conclusion and perspectives





DSV method + Singularity analysis
) Easy asymptotics for discrete structures
) Perfect for computational biology
Derived asymptotical growth for the numbers of -shapes
under different models
Yet another bijection between Motzkin words and a
subclass of Dyck words
How to add probabilities for base-pairing to these models ?
Why do the equations simplifies and yield beautiful results
when  is even ?

References
[E.A. Bender]
Asymptotic methods in enumeration
SIAM Rev., 16(4):485-515, 1974.

[P. Flajolet]
Singular Combinatorics
In Proceedings of the International Congress of Mathematicians, Vol 3,
World Scientific, pp. 561-57, 2002

[P. Flajolet and A. M. Odlyzko]
Singularity analysis of generating functions
SIAM Journal of Discrete Mathematics, 3:216-240, 1990.

[R. Giegerich, B. Voss, and M. Rehmsmeier]
Abstract shapes of RNA
Nucleic Acids Res., 32(16):4843-4851, 2004.

[A.M. Odlyzko]
Asymptotic enumeration methods
In Handbook of Combinatorics, pages 1063-1230. Elsevier Science B. V.
and MIT Press, Amsterdam and Cambridge, Volume II. 1995.

[A. Meir and J.W. Moon]
On an asymptotic method in enumeration
Journal of Combinatorial Theory, 51:77-89. Series A , 1989

[P.R. Stein and M.S. Waterman]
On some new sequences generalizing the Catalan and Motzkin numbers
Discrete Math., 26 261-272 , 1978