Transcript lecture9_15

Lesson 9
Motif Search
and
RNA Structure Prediction
Finding short motifs in
biological data
(DNA, RNA and Protein sequences )
Scenario 1 : Binding motif is known (easier case)
Scenario 2 : Binding motif is unknown (hard case)
Scenario 2 :
Binding motif is unknown
“Ab initio motif finding”
Why is it hard???
Are common motifs the right thing to search for ?
?
Solutions:
-Searching for motifs which are enriched
in one set but not in a random set
- Use experimental information to rank
the sequences according to their binding
affinity and search for enriched motifs at
the top of the list
ChIP-Seq
Sequencing the regions in the genome to
which a protein (e.g. transcription factor)
binds to.
Finding the p53 binding motif in a set of p53 target sequences
which are ranked according to binding affinity
Best
Binders
ChIP –SEQ
Weak
Binders
- a word search approach to search
for enriched motif in a ranked list
Ranked sequences list
CTGTGA
CTGTGA
CTGTGA
CTGTGA
CTGTGA
Candidate k-mers
CTGTGC
CTGTGA
CTGTGC
CTATGC
CTGTGA
CTACGC
ACTTGA
CTGTGA
CTGTAC
ACGTGA
ATGTGC
CTGTGA
ACGTGC
ATGTGA
http://drimust.technion.ac.il/
uses the minimal hyper geometric
statistics (mHG) to find enriched motifs
The number of sequences containing
the motif among the top sequences
Ranked sequences list
CTGTGA
CTGTGA
CTGTGA
The number of
sequences
containing the motif
CTGTGA
CTGTGA
CTGTGA
CTGTGA
The total number of
input sequences
CTGTGA
The number of
sequences at
the top of the list
The enriched motifs are combined to get a
PSSM which represents the binding motif
Detected
Enriched
motifs
Protein Motifs
Protein motifs are usually 6-20 amino acids long and
can be represented as a consensus/profile:
P[ED]XK[RW][RK]X[ED]
or as PWM
From Sequence to Structure
Predicting RNA structure
DNA
RNA
protein
According to the central dogma of molecular biology the main role of
RNA is to transfer genetic information from DNA to protein
15
RNA has many other biological
functions
•
•
•
•
•
Protein synthesis (ribosome)
Control of mRNA stability (UTR)
Control of splicing (snRNP)
Control of translation (microRNA)
Control of transcription (long non-coding RNA)
The function of the RNA molecule
depends on its folded structure
Ribosome
Nobel prize 2009
RNA Structural levels
Tertiary Structure
Secondary Structure
tRNA
RNA Secondary Structure
• RNA bases are G, C, A, U
• The RNA molecule folds on itself.
• The base pairing is as follows:
G
C
A
U
hydrogen bond.
G
U
Loop
UU
5’
3’
GAUCUUGAUC
Stem
C
U
A
G
G
A
U
C
5’
3’
Predicting RNA secondary Structure
Most common approach:
Search for a RNA structure with a
Minimal Free Energy (MFE)
UU
C
U
A
G
GAUCUUGAUC
G
Low energy
C
U
A
G
U
G
A
U
U
U
High energy
Free energy model
Free energy of a structure is the sum of all
interactions energies
Free Energy(E) = E(CG)+E(CG)+…..
The aim: to find the structure with the minimal free energy (MFE)
Why is MFE secondary
structure prediction hard?
• MFE structure can be found by calculating free
energy of all possible structures
• BUT the number of potential structures grows
exponentially with the number of bases
Solution :Dynamic programming (Zucker and Steigler)
Simplifying assumptions for RNA
Structure Prediction
• RNA folds into one minimum free-energy
structure.
• The energy of a particular base can be
calculated independently
– Neighbors do not influence the energy.
Sequence dependent free-energy
Nearest Neighbor Model
UU
UU
C
G
A
G
5’ A
G
C
U
C
UCGAC 3’
C
U
A
G
5’ A
G
A
U
C
UCGAC 3’
Free Energy of a base pair is influenced by
the previous base pair
(not by the base pairs further down).
Sequence dependent free-energy
values of the base pairs
(nearest neighbor model)
UU
UU
C
G
A
G
5’ A
G
C
U
C
UCGAC 3’
C
U
A
G
5’ A
G
A
U
C
UCGAC 3’
These energies are estimated experimentally from small synthetic RNAs.
Example values:
GC GC GC GC
AU GC CG UA
-2.3 -2.9 -3.4 -2.1
Improvements to the MFE approach
• Positive energy - added for destabilizing
regions such as bulges, loops, etc.
• More than one structure can be predicted
Free energy computation
+5.9 4 nt loop
U U
A
A
G C
G C
+3.3 1nt bulge
-0.3
-2.9 stacking
A
G
U
A
C
A
5’ dangling
-0.3
A
A
5’
-1.1 mismatch of hairpin
-2.9 stacking
C
A
U
G
U
3’
-1.8 stacking
-0.9 stacking
-1.8 stacking
-2.1 stacking
G= -4.6 KCAL/MOL
Improvements to the MFE approach
• Positive energy - added for destabilizing
regions such as bulges, loops, etc.
• Looking for an ensemble of structures with
low energy and generating a consensus
structure
WHY?
RNA is dynamic and doesn’t always fold to
the lowest energy structure
RNA fold prediction based on
Multiple Alignment
Information from multiple sequence alignment
(MSA) can help to predict the probability of
positions i,j to be base-paired.
G C C U U C G G G C
G A C U U C G G U C
G G C U U C G G C C
Compensatory Substitutions
Mutations that maintain the secondary structure
can help predict the fold
UU
G
C
U
A
G
5’ A
G
A
C
U
C
UCGAC 3’
RNA secondary structure can be revealed by
identification of compensatory mutations
G C C U U C G G G C
G A C U U C G G U C
G G C U U C G G C C
U C
U
G
C
G
N
N’
G
C
Insight from Multiple Alignment
Information from multiple sequence alignment
(MSA) can help to predict the
probability of positions i,j to be base-paired.
• Conservation – no additional information
• Consistent mutations (GC GU) – support
stem
• Inconsistent mutations – does not support stem.
• Compensatory mutations – support stem.
From RNA structure to
Function
Many families of non coding RNAs which
have unique functions are characterized by
the combination of a conserved sequence
and structure
MicroRNAs
miRNA
gene
mature
miRNA
Target gene
MicroRNA in Cancer
Sun et al, 2012
The challenge for Bioinformatics:
- Identifying new microRNA genes
- Identifying the targets of specific microRNA
How to find microRNA genes?
Searching for sequences that fold to a hairpin ~70 nt
-RNAfold
-other efficient algorithms for identifying stem loops
Concentrating on intragenic regions and introns
- Filtering coding regions
Filtering out non conserved candidates
-Mature and pre-miRNA is usually evolutionary conserved
How to find microRNA genes?
A. Structure prediction
B. Evolutionary Conservation
Predicting microRNA targets
MicroRNA targets are located in 3’ UTRs, and
complementing mature microRNAs
•Why is it hard to find them ??
– Base pairing is required only in the seed sequence
(7-8 nt)
– Lots of known miRNAs have similar seed sequences
Very high probability to find by chance
mature miRNA
3’ UTR of Target gene
Predicting microRNA target genes
• General methods
- Find motifs which complements the seed
sequence (allow mismatches)
– Look for conserved target sites
– Consider the MFE of the RNA-RNA pairing
∆G (miRNA+target)
– Consider the delta MFE for RNA-RNA pairing
versus the folding of the target
∆G (miRNA+target )- ∆G (target)