MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC

Transcript MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC

MOTIFS
MOTIFSMARTIFAMORIFSMOOTIFSMICIFC
A sequence motif is a nucleotide or
amino-acid sequence pattern that is
widespread (repeated) and has or is
conjectured to have a biological
significance. Sequence motifs may be
identical to each other or they may vary
to a greater or lesser extent.
Domains, Patterns,
Motifs, Repeats?
For proteins, a sequence motif is distinguished
from a structural motif, i.e., a motif formed by the
three dimensional arrangement of amino acids,
which may not be adjacent.
Example: N-glycosylation site motif
Asn, followed by anything but Pro, followed by
either Ser or Thr, followed by anything but Pro.
When a sequence motif appears in protein-coding regions, it
may specify a "structural motif" of a protein. Short coding
motifs in proteins include sites that label proteins for delivery
to particular parts of a cell, or mark them for phosphorylation.
Noncoding sequences contain functional (i.e., regulatory)
sequence motifs and motifs that are just "junk," such as
satellite DNA.
Functional motifs in DNA play different roles, such as binding
sites for proteins.
The discipline of bioinformatics concerns itself with the finding
and the sequence characterization of motifs through
computer-based techniques of sequence analysis.
Motif notation
Consider the N-glycosylation site motif:
Asn, followed by anything but Pro, followed by either
Ser or Thr, followed by anything but Pro.
This pattern may be written as:
N{P}[ST]{P}
where N = Asn, P = Pro, S = Ser, T = Thr; {X} means
any amino acid except X; and [XY] means either X or
Y. The notation [XY] does not give any indication of
the probability of X or Y occurring in the pattern.
Identifying motifs: The challenge
• A microarray experiment showed that
when gene X is knocked out, 20 other
genes are not expressed
– How can one gene have such
drastic effects?
Identifying motifs: The challenge
• Gene X encodes regulatory protein, such as a
transcription factor (TF)
• The 20 unexpressed genes rely on gene product
(TF) to induce transcription
• A single TF may regulate multiple genes
Identifying motifs: The challenge
• Every gene contains a regulatory region (RR)
typically stretching 100-1000 bp upstream of the
transcriptional start site
• Located within the RR are the Transcription
Factor Binding Sites (TFBS), also known as
motifs, specific for a given transcription factor
• TFs influence gene expression by binding to a
specific location in the TFBS of the gene.
Identifying motifs: The challenge
• A motif can be located anywhere within
the Regulatory Region.
• Motifs may vary across different
regulatory regions.
Motifs and Transcriptional Start Sites
ATCCCG
gene
TTCCGG
ATCCCG
ATGCCG
gene
gene
gene
ATGCCC
gene
Why finding motifs is difficult?
Step 1: Start with random sequence
atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca
tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag
gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
Why finding motifs is difficult?
Step 2: Implant motif AAAAAAAGGGGGGG
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa
tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag
gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
Where is the implanted motif?
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga
tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag
gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
Implanting Motif AAAAAAGGGGGGG
with Four Mutations
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaagga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
Why Finding (15,4) Motif is
Difficult?
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
AgAAgAAAGGttGGG
..|..|||.|..|||
cAAtAAAAcGGcGGG
Discovery of Motifs
1. consensus sequences
The notation [XYZ] means X or Y or Z, but
does not indicate the likelihood of any
particular match. For this reason, two or more
patterns are often associated with a single
motif. It is sometimes advisable to look and
consensus sequences and refine the definition
of a motif.
Discovery of Motifs
1. consensus sequences
Rigorously, the IQ motif is:
[FILV]Qxxx[RK]Gxxx[RK]xx[FILVWY]
where x = any amino acid, and the square
brackets indicate alternatives.
Usually, the first amino acid is I, the two [RK]
choices are R, and xx[FILVWY] is so undefined
that it can be ignored. Thus, the consensus is:
IQxxxRGxxxR
Discovery of Motifs
2. Discovery through evolutionary conservation
Motifs may be discovered by comparing homologous genes
from different species. For example, by aligning the amino acid
sequences specified by the GCM (glial cells missing) gene in
man, mouse and D. melanogaster, a pattern was discovered
(the GCM motif) that spans about 150 amino acids, and begins
as follows:
WDIND*.*P..*...D.F.*W***.**.IYS**...A.*H*S*WAMRNTNNHN
Here each . signifies a single amino acid or a gap, and each *
indicates one member of a closely-related amino-acid family.
Subsequently, it was shown that the motif has DNA binding
activity.
Motif Logo
• Motifs can mutate on non
important bases
• The five motifs in five
different genes have
mutations in position 3
and 5
• Representations called
motif logos illustrate the
conserved and variable
regions of a motif
TGGGGGA
TGAGAGA
TGGGGGA
TGAGAGA
TGAGGGA
Motif Logos: an Example
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
Measure of Conservation
•
•
•
Relative heights of letters reflect their abundance in
the alignment.
Total height = entropy-based measurement of
conservation.
Entropy(i) =
-SUM { f(base, i)* ln[f(base, i)] }
over all bases
•
Entropy measures variability/disorder.
–
–
Highly conserved = low entropy = tall stack
Highly variable = high entropy = low stack
Identifying Motifs: Complications
• We do not know the motif sequence
• We do not know where it is located relative to some
genomic landmark (say, gene start)
• Motifs can differ from one another
• The pattern may not be an exact sequence or an
approximate sequence but something like “4-8
hydrophobic amino acids, followed by 2-3 leucines
or isoleucines, followed by 2 phenylalanines and an
aspartic acid or 1 spartic acid and two glycines.
Discovery of Motifs
3. De novo computational discovery of
motifs
A Motif Finding Analogy
• The Motif Finding Problem is similar to
the problem posed by Edgar Allan Poe
(1809–1849) in The Gold Bug
"The Gold-Bug" is a story of a man named William Legrand
who seemingly goes mad after being bitten by a bug thought to
be made of pure gold. He notifies his closest friend, the
narrator, telling him to immediately come visit him at his home
on Sullivan's Island in South Carolina. The two embark upon a
search for lost treasure along with a servant named Jupiter.
The narrator doubts Legrand’s sanity. However, after following
several clues, they find a treasure buried by the infamous pirate
"Captain Kidd," that is estimated to be worth about fourteen
million dollars.
Among the clues, there is a secret
message.
The Gold Bug Problem
• Given a secret message:
53++!305))6*;4826)4+.)4+);806*;48!8`60))85;]8*:+*8!83(88)5*!;
46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5*-4)8`8*; 4069285);)6
!8)4++;1(+9;48081;8:8+1;48!85;4)485!528806*81(+9;48;(88;4(+?3
4;48)4+;161;:188;+?;
• Decipher the message encrypted in
the fragment
Hints for The Gold Bug
Problem
• Additional hints:
– The encrypted message is in English
– Each symbol correspond to one letter
in the English alphabet
– No punctuation marks are encoded
The Gold Bug Problem: Symbol Counts
• Naive approach to solving the problem:
– Count the frequency of each symbol in the
encrypted message
– Find the frequency of each letter in the
alphabet in the English language
– Compare the frequencies of the previous
steps, try to find a correlation and map the
symbols to a letter in the alphabet
Symbol Frequencies in the Gold Bug Message
• Gold Bug Message:
Symbol 8 ; 4 ) + * 5 6 ( ! 1 0 2 9 3 : ? ` - ] .
Freque
ncy
3
4
2
5
1
9
1
6
1
5
1
4
1
2
1 9 8 7
1
6
5
5 4 4 3 2 1 1 1
• English Language:
etaoinsrhldcumfpgwybvkxjqz
Most frequent
Least frequent
The Gold Bug Message Decoding: First
Attempt
• By simply mapping the most frequent
symbols to the most frequent letters of
the alphabet:
sfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenp
inelefheeosnltarhteenmrnwteonihtaesotsnlupnihta
msrnuhsnbaoeyentacrmuesotorleoaiitdhimtaecedtep
eidtaelestaoaeslsueecrnedhimtaetheetahiwfataeoa
itdrdtpdeetiwt
• The result does not make sense
The Gold Bug Problem: l-tuple
count
• A better approach:
– Examine frequencies of l-tuples,
combinations of 2 symbols, 3 symbols,
etc.
– “The” is the most frequent 3-tuple in
English and “;48” is the most frequent 3tuple in the encrypted text
– Make inferences of unknown symbols
by examining other frequent l-tuples
The Gold Bug Problem: the ;48 clue
• Mapping “the” to “;48” and substituting all
occurrences of the symbols:
53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*
!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)eè*th0692e5)t
)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth
(+?3hthe)h+t161t:1eet+?t
The Gold Bug Message Decoding: Second
Attempt
• Make inferences:
53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(e
e)5*!th6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)eè*th
0692e5)t)6!e)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1
(+9thet(eeth(+?3hthe)h+t161t:1eet+?t
• “thet(ee” most likely means “the tree”
– Infer “(“ = “r”
• “th(+?3h” becomes “thr+?3h”
– Can you guess “+”, “?”, and “3”?
oug
The Gold Bug Problem: The
Solution
• The final message is:
AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONE
DEGREESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBR
ANCHSEVENTHLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEA
THSHEADABEELINEFROMTHETREETHROUGHTHESHOTFIFTYFEETO
UT
The Solution (cont’d)
• Punctuation (akin to annotation) is
important:
A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S
SEA, TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST
AND BY NORTH, MAIN BRANCH SEVENTH LIMB, EAST SIDE,
SHOOT FROM THE LEFT EYE OF THE DEATH’S HEAD A BEE LINE
FROM THE TREE THROUGH THE SHOT, FIFTY FEET OUT.
Solving the Gold Bug Problem
• Prerequisites to solve the problem:
– Need to know the relative frequencies
of single letters, and combinations of
two and three letters in English.
– Knowledge of all the words in the
English dictionary is highly desirable.
Motif Finding and The Gold Bug Problem:
Similarities
– Nucleotides in motifs encode for a message in the
“genetic” language. Symbols in “The Gold Bug”
encode for a message in English.
– In order to solve the problem, we analyze the
frequencies of patterns in DNA/Gold Bug
message.
– Knowledge of established regulatory motifs makes
the Motif Finding problem simpler. Knowledge of
the words in the English dictionary helps to solve
The Gold Bug problem.
Similarities (cont’d)
• Motif Finding:
– In order to solve the problem, we analyze the
frequencies of patterns in the nucleotide sequences
– In order to solve the problem, we analyze the
frequencies of patterns in the nucleotide sequences
• The Gold Bug Problem:
– In order to solve the problem, we analyze the
frequencies of patterns in the text written in English
Similarities (cont’d)
• Motif Finding:
– Knowledge of established motifs reduces
the complexity of the problem
• The Gold Bug Problem:
– Knowledge of the words in the dictionary is
highly desirable
Motif Finding and The Gold Bug Problem:
Differences
Motif Finding is harder than the Gold Bug
problem:
– We don’t have the complete dictionary of
motifs
– The “genetic” language does not have a
standard “grammar”
– Only a small fraction of nucleotide
sequences encode for motifs; the size of
data is enormous
So, what do we do?
We use whatever knowledge
we have, and teach the
computer program to look for
elements that abide by these
rules.
•
•
•
•
•
•
Similarity to something known
Strand specificity (cis to the gene)
Knowledge of length distribution
May have known folds
Taxonomic distribution
Position specificity
•
•
•
•
•
•
Founded by Amos Bairoch
1988 First release in the PC/Gene software
1990 Synchronisation with Swiss-Prot
1994 Integration of « profiles »
1999 PROSITE joins InterPro
Release 20.57, of 23-Nov-2009
• Contains biological annotation in addition to
sequences.
– catalytic, metal binding, S-S bridge, cofactor
binding, prosthetic group, PTM
PROSITE Format (Pattern)
Regular Expression Language (REGEXP)
• Pattern: <A-x-[ST](2)-x(0,1)-{V}
• Regexp: Â.[ST]{2}.?[^V]
• Text: The sequence must start with an
alanine, followed by any amino acid, followed
by a serine or a threonine, two times,
followed by any amino acid or nothing,
followed by any amino acid except a valine.

MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC

Transcript MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC

Directory