RNA Secondary Structure & Combinatorics

Download Report

Transcript RNA Secondary Structure & Combinatorics

Predicting RNA Secondary Structures:
A Lattice Walk Approach to Modeling
Sequences Within the HIV-1 RNA
Structure
Facing the Challenge of Infectious Diseases in
Africa: The Role of Mathematical Modeling
University of Witswatersrand
Johannesburg, South Africa
September 25-27, 2006
Asamoah Nkwanta, Ph.D.
Morgan State University
[email protected]
TOPICS
RNA Prediction & Molecular Biology
 RNA Combinatorics
 Certain Class of Random Walks
 Matrix Theory
 Connection Between Walks & RNA
 Modeling HIV-1 RNA Sequences

RNA Secondary Structure Prediction
“The Human Genome Project and related efforts
have generated enormous amounts of raw
biological
sequence
data.
However,
understanding how biological sequences encode
structural information remains a fundamental
scientific challenge. For instance, understanding
the base pairing, or secondary structure, of
single-stranded RNA sequences is crucial to
advancing knowledge of their novel biochemical
functions.”
C. E. Heithsch, Combinatorics on Plane Trees, Motivated by RNA
Secondary Structure Configuration (preprint, 2005)
What is RNA Secondary Sequence
Prediction ?
RNA Secondary Structure Prediction

Given a primary sequence, we want to find
the biological function of the related
secondary structure. To achieve this goal
we predict (model) its’ secondary
structure.

Most methods predict secondary structure
rather than tertiary structure. The three
dimensional shape is important for
biological function, and it is harder to
predict.
Molecular Biology (Cont.)
3-D structure of
Haloarcula marismortui
5S ribosomal RNA in
large ribosomal subunit
Molecular Biology

Central Dogma

DNA  RNA  Protein

Transcription / Translation
Molecular Biology (Cont.)
Molecular Biology (Cont.)
However, the "Central Dogma" has had to be revised a bit. It turns out that you CAN go back
from RNA to DNA, and that RNA can also make copies of itself. It is still NOT possible to go
from Proteins back to RNA or DNA, and no known mechanism has yet been demonstrated for
proteins making copies of themselves.
Molecular Biology (cont.)

HIV is one of a group of atypical
viruses called retroviruses that
maintain their genetic information in
the form of RNA. Retroviruses are
capable of producing DNA from RNA.
Molecular Biology (Cont.)
Molecular Biology (cont.)

Ribonucleic acid (RNA) molecule: Three
main categories
mRNA (messenger) – carries genetic
information from genes to other cells
 tRNA (transfer) – carries amino acids to a
ribosome (cells for making proteins)
 rRNA (ribosomal) – part of the structure of
a ribosome

Molecular Biology (cont.)

Other types (RNA) molecules:
snRNA (small nuclear RNA) – carries
genetic information from genes to other
cells
 miRNA (micro RNA) – carries amino acids
to a ribosome (cells for making proteins)
 iRNA (immune RNA) – part of the structure
of a ribosome (Important for HIV studies)

RNA Secondary Structure
“RNA secondary structures are important in
many biological processes and efficient structure
prediction can give vital directions for
experimental investigation. “
B. Knudsen and J. Hein, Pfold: RNA secondary structure prediction
using stochastic context-free grammars (Nucleic Acids Research,
2003)
There are published examples involving tRNA,
rRNA, and other types of RNA
RNA Secondary Structure (cont.)

A ribonucleic acid (RNA) molecule
consists
of
a
sequence
of
ribonucleotides
(typically
single
stranded)

Each ribonucleotide contains one of
four bases: adenine (A), cytosine (C),
guanine (G), and uracil (U)
Secondary Structure (cont.)

Note U is replaced by thymine (T) in
DNA

As the molecule forms, chemical
bonds join A-U and C-G pairs,
(Unstable G-U). These are called the
Watson-Crick pairs.
Secondary Structure (cont.)

Primary Structure – The linear sequence
of bases in an RNA molecule

Secondary Structure – The folding or
coiling of the sequence due to bonded
nucleotide pairs: A-U, G-C

Tertiary Structure – The three dimensional
configuration of an RNA molecule
Primary RNA Sequence


CAGCAUCACAUCCGCGGGGUAAACGCU
Nucleotide Length, 27 bases
Geometric Representation
Secondary structure is a graph defined
on a set of n labeled points
 (M.S. Waterman, 1978)

Biological
 Combinatorial/Graph Theoretic
 Random Walk

RNA COMBINATORICS

RNA Numbers
1,1,1,2,4,8,17,37,82,185,423,978,…

These numbers count various
combinatorial objects including RNA
secondary structures of length n.
RNA COMBINATORICS (cont.)

The number of RNA secondary structures
for the sequence [1,n] is counted by the
coefficients of s(z):

3
z2 
2

z1  z1  1  ) z ( s
6
z 71 
5
z8 
4
Coefficients of the formal power series:

(1,1,1,2,4,8,17,37,82,185,423,978,…)
z4
RNA COMBINATORICS (cont.)

The number of lattice paths with unit steps
R (right), U (up) & D (down) that go from
(0,0), remain in the first quadrant of the
coordinate plane, and return to the x-axis
under the restriction that there are never
consecutive UD steps is the nth RNA
number:

(1,1,1,2,4,8,17,37,82,185,423,978,…)
RNA COMBINATORICS (cont.)

n
The number of RNA sequences of length n
that can be formed over the words
[A,U,G,C] such that the letters A & U are
not adjacent is equal to:
n
 71  3  5  71
 71  3  5  71
 



2
2
7
1
2



 71 2

What a remarkable formula for an integer,
when n = 1 we get 4, and n = 2 we get 14.
Counting Sequence Database

The On-line Encyclopedia of Integer
Sequences:
http:/www.research.att.com/njas/sequence
s/index.html

N.J.A. Sloane & S. Plouffe, The
Encyclopedia of Integer Sequences,
Academic Press, 1995.
RNA EQUATIONS

Recurrence Relations:
s0  s1  s2  1
n 1
sn  1  sn    s j  1sn  j , n  2
j 1
RNA EQUATIONS (cont.)

4
Generating Function:


z  z2  z  z2  1  z  z  1



z
s
2
z2
3

2
2
1,1,1,2,4,8,17,37,82,185,423,978,…
RNA EQUATIONS (cont.)

Exact Formula:
1  n  k  n  k 



sn   
k 1 n  k  k  1  k

RNA EQUATIONS (cont.)

s(n,k) is the number of structures of
length n with exactly k base pairs:
For n,k > 0,
1  n  k  n  k  1


sn, k   
k  k  1  k  1 
RNA EQUATIONS (cont.)

Asymptotic Estimate: As n grows
without bound
15  7 5   3  5 

 n 
sn  ~




8   2 
3

2
n
Random Walk

A random walk is a lattice path from
one point to another such that steps
are allowed in a discrete number of
directions and are of a certain length
RNA Walk – Type I

NSE* Walks – Unit step walks starting at
the origin (0,0) with steps up, down, and
right

No walks pass below the x-axis and there
are no consecutive NS steps
RNA Walk – Type I (cont.)

N = (0,1) up

S = (0,-1) down

E = (1,0) right
Type I Walk Array (n x k)
 1

 1
 1

 2
 4

 8

17
 

0
0
0
0
0
0
1
2
0
1
0
0
0
0
0
0
0
0
3
6
3
6
1
4
0
1
0
0
0
0
13
28

13
30

10
24

5
15

1
6

0
1














RNA Walk – Type II

NSE** Walks
–
Unit-step walks
starting at the origin (0,0) with steps
up, down, and right such that no
walks pass below the x-axis and
there are no consecutive SN steps
Type II Walk Array (n x k)
 1

 1
 2

 4
 8

 17

 37
 

0
0
0
0
0
0
1
2
0
1
0
0
0
0
0
0
0
0
4
9
3
7
1
4
0
1
0
0
0
0
20
45

17
41

11
29

5
16

1
6

0
1














Examples

Type I: ENNESNESSE

Type II: NEEENSEEES
RNA Walk Bijection

Theorem: There is a bijection between the
set of NSE* walks of length n+1 ending at
height k = 0 and the set of NSE** walks of
length n ending at height k = 0.

Source:
Lattice
paths,
generating
functions, and the Riordan group, Ph.D.
Thesis, Howard University, Washington,
DC, 1997
Matrices Count Lattice Walks

Type I Walks
1 0
1 1
1 2
2 3
4 6
8 13
17 28
- -
0
0
1
3
6
13
30
-
0
0
0
1
4
10
24
-
0
0
0
0
1
5
15
-
0
0
0
0
0
1
6
0
0
0
0
0
0
1
- -
-

Type II Walks
1
1
2
4
8
17
37
-
0
1
2
4
9
20
41
-
0
0
1
3
7
17
41
-
0
0
0
1
4
11
29
-
0
0
0
0
1
5
16
-
0
0
0
0
0
1
6
-
0
0
0
0
0
0
1
-
-
The ith-jth entry corresponds to the number of random walks
of length i and ending height j.
Type I Formation Rule (Recurrence)
m0,0  1
mn  1,0   mn  j , j 
j 0
mn  1, k   mn, k  1   mn  j , k  j 
mn  1, k   0, k  n  1
j 0
The Connection Between RNA
and the Walks

Theorem: There is a bijection between the
set of RNA secondary structures of length
n and the set of NSE* walks ending at
height k=0.

Source: Lattice paths and RNA secondary
structures, DIMAC Series in Discrete Math.
& Theoretical Computer Science 34 (1997)
137-147. (CAARMS2 Proceedings)
Primary Molecule
Molecules Derived From Lattice Walks
Secondary
Molecules
RNA Sequence
GGU
GUUAA
UUG
Known Conserved
Region
AGG
GUG
GAG
Regulatory
Region
Walk 1
Lattice Walk Application
To Predict RNA Sequence
Walk 2
HIV-1 RNA Sequence Prediction

We want to construct a lattice walk
method to predict secondary RNA
sequences that code for regions of
the SL2 and SL3 domains within the
HIV-1 5’ UTR RNA molecule.

These domains are important for HIV
genomic packaging
HIV-1 RNA Structural Components
Components of Secondary Structure

Base pairs
 Bulges
 Interior Loops
 End loops
 Hairpin
 Multibranch loops – junctions where
more than one hairpin or more complex
secondary structures are appended.
HIV-1 Sequence (SL2 & SL3)

The following sequence was obtained from the NCBI website.
The first 363 nucleotides were extracted from the entire HIV-1
RNA genomic sequence:

GGUCUCUCUGGUUAGACCAGAUCUGAGCCUGGGAGCUCUCU
GGCUAACUAGGGAACCCACUGCUUAAGCCUCAAUAAAGCUU
GCCUUGAGUGCUUCAAGUAGUGUGUGCCCGUCUGUUGUGU
GACUCUGGUAACUAGAGAUCCCUCAGACCCUUUUAGUCAGU
GUGGAAAAUCUCUAGCAGUGGCGCCCGAACAGGGACCUGA
AAGCGAAAGGGAAACCAGAGGAGCUCUCUCGACGCAGGAC
UCGGCUUGCUGAAGCGCGCACGGCAAGAGGCGAGGGGCGG
CGACUGGUGAGUACGCCAAAAAUUUUGACUAGCGGAGGCUA
GAAGGAGAGAGAUGGGUGCGAGAGCGUCAGUAUUAAGCG

Color key:
SL2 – yellow
SL3 - red
Known Sequence of the SL2 Domain
GGCGACUGGUGAGUACGCC
mfe -7.1 Original Structure
Type I Walk
Type II Walk
Lattice Walk Model






Start with an RNA primary sequence
Perform RNA combinatorial analysis on the
given sequence
Connect lattice walks to the given sequence
using Type I and II walks
Calculate identified sequences to find the
minimum free energy
Predict secondary sequence
Conduct
laboratory
experiments
for
biological functionality
Acknowledgments

National Science Foundation, DIMACS,
AIMS, Burroughs Wellcome, SACEMA,
WITS

MATH. Modeling 561, Graduate Students

Collaborators: Dwayne Hill, Biology Dept.,
MSU, and Alvin Kennedy, Chemistry Dept.,
MSU