Motif finding

Download Report

Transcript Motif finding

Motif finding: Lecture 1
CS 498 CXZ
From DNA to Protein: In words
1. DNA = nucleotide sequence
•
Alphabet size = 4 (A,C,G,T)
2. DNA  mRNA (single stranded)
•
AATACGAAGTAA
Alphabet size = 4 (A,C,G,U)
AAUACGAAGUAA
3. mRNA  amino acid sequence
•
Alphabet size = 20
4. Amino acid sequence “folds” into 3dimensional molecule called protein
Asn Thr Lys Stop
Gene expression
• Process of making a protein from a
gene as template
• Transcription, then translation
• Can be regulated
Transcription
• Process of making a single stranded
mRNA using double stranded DNA as
template
• Only genes are transcribed, not all DNA
Step 1: From DNA to mRNA
Transcription
SOURCE: http://academy.d20.co.edu/kadets/lundberg/DNA_animations/rna.dcr
Transcriptional regulation
TRANSCRIPTION
FACTOR
GENE
ACAGTGA
PROTEIN
Transcriptional regulation
TRANSCRIPTION
FACTOR
GENE
ACAGTGA
PROTEIN
The importance of gene
regulation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Genetic regulatory network controlling the development of the body plan of the sea urchin embryo
Davidson et al., Science, 295(5560):1669-1678.
• That was the “circuit” responsible for
development of the sea urchin embryo
• Nodes = genes
• Switches = gene regulation
• Change the switches and the circuit
changes
• Gene regulation significance:
– Development of an organism
– Functioning of the organism
– Evolution of organisms
Binding sites and motifs
Binding sites
• Binding sites of transcription factor
“Bicoid”, collected experimentally
http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
TAAT C C C
http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Motif (“Consensus
String”)
WAAT C C N
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif
W = T or A
N = A,C,G,T
Motif
Motif
• Common sequence “pattern” in the
binding sites of a transcription factor
• A succinct way of capturing variability
among the binding sites
Alternative way to represent motif
1 1 9 9 0 0 0 1 A
6 0 0 0 0 9 8 7 C
1 0 0 0 QuickTime™
1 0 0 1 G
and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
1 8 0 0 8 0 1 0 T
Position weight matrix (PWM)
Or simply, “weight matrix”
Motif representation
• Consensus string
– May allow “degenerate” symbols in string,
e.g., N = A/C/G/T; W = A/T; S = C/G; R =
A/G; Y = T/C etc.
• Position weight matrix
– More powerful representation
– Probabilistic treatment
The motif finding problem
• Suppose a transcription factor (TF)
controls five different genes
• Each of the five genes should have
binding sites for TF in their promoter
region
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Binding sites for TF
The motif finding problem
• Now suppose we are given the promoter
regions of the five genes G1, G2, … G5
• Can we find the binding sites of TF, without
knowing about them a priori ?
– Binding sites are similar to each other, but not
necessarily identical
• This is the motif finding problem
• To find a motif that represents binding sites of
an unknown TF
A variant of motif finding
• Given a motif (e.g., consensus string, or
weight matrix), find the binding sites in
an input sequence
• For consensus string, problem is trivial
– For each position l in input sequence,
check if substring starting at position l
matches the motif.
• For weight matrix, not so trivial
Binding sites from a weight
matrix motif
• Given a string
1 1 9 s9 of
0 0length
0 1 A l = 7
• s = s1s2…sl
6 0 0 0 0 9 8 7 C
W
• Pr(s | W) =1 0 0 0 W
1 s0 k0 1 G

k
.11 .11 1
1
0
0
0 .11
.67 0
0
0
0
1 .89 .78
.11 0
0
0 .11 0
0 .11
A
C
G
k
• Example: 1 8 0 0 8 0 1 0 T
Pr(CTAATCCG) =
0.67 x 0.89Counts
x 1 xof 1each
x 0.89
base
each column
x 1 x0.89 xIn 0.11
.11 .89 0
0 .89 0 .11 0
T
Probability of each base
In each column
Wk = probability of base  in column
k
Binding sites from a weight
matrix motif
• Given sequence S (e.g., 1000 base-pairs long)
• For each substring s of S,
– Compute Pr(s|W)
– If Pr(s|W) > some threshold, call that a binding site
• Look at S, as well as its “reverse complement”
– Rev.Compl. of AGTTACACCA is TGGTGTAACT
– (That’s what is on the other strand of DNA)
Ab initio motif finding
• The original motif finding problem
• To find a motif that represents binding
sites of an unknown TF
Ab initio motif finding
• Define a motif score, find the motif with
maximum score over all possible motifs
in search space (motif model)
• Consensus string model => exhaustive
search algorithm, guarantee on finding
the optimal motif
• PWM model => local search, not
guaranteed to find optimal motif.
Ab initio motif finding consensus string motifs
• A precise motif model defines the
search space (I.e., a list of all candidate
motifs).
• The motif model also prescribes exactly
how to determine if a substring is a
match to a particular motif.
• Define motif model precisely
Ab initio motif finding consensus string motifs
• E.g., string over alphabet {A,C,G,T} of fixed length l. If
l = 4, all 256 strings AAAA, AAAT, AAAC, …, TTTT, are
“candidate motifs”.
• E.g., string over alphabet {A,C,G,T} of fixed length l,
and allowing up to d mismatches. If AAAA is a motif,
and d=1, then AAAT, AATA etc. are also counted as
matches to motif.
• E.g., string over extended alphabet {A,C,G,T,N} of
fixed length l. Here “N” stands for any character
(A,C,G,or T.)
– If AANAA is the motif, then AACAA, AAGAA, AATAA or
AAAAA are all counted as matches to this motif.
Ab initio motif finding consensus string motifs
• Define a motif score, i.e., a real number
associated with each candidate motif, in
relation to the input sequences.
• E.g., count Ns of a motif s in input
sequences(s).
• E.g., some function of the motif count Ns.
– E.g., Zs = (Ns - Es)/s
– Es is the expected count of motif s in random
sequences; and
– s is the variance of the count in random sequences
Ab initio motif finding consensus string motifs
• For each motif s in the search space,
– Compute the score of s
• Output the highest scoring motifs.
• This is the “enumerative” algorithm.
• Guaranteed to produce the optimal motif, since every
possible motif is considered.
• Guarantee possible due to small search space. (E.g., 4l
where l is the motif length).
• Cant handle large values of l (e.g., > 10) : exponential
growth of running time.
Ab initio motif finding PWM motifs
•
•
•
•
Local search techniques, e.g.,
Gibbs sampling
Expectation Maximization
Greedy
Gibbs sampling: The search space
• Input: a set of sequences {S1,S2,…,Sn}
• Input: motif length l
• Candidate motif: A set of substrings
{s1,s2,…,sn}, each of length l, one from
each Si.
• Search space: all possible candidate
motifs
– O(Ln) where L is length of each Si.
Gibbs sampling: algorithm
• Consider any candidate motif
{s1,s2,…,sn},where each si is of length l
• Let Wk be the frequency of base  in
the kth position of the candidate motif
– Pr(s|W) =
W
sk k
k
• Let “background” (genome-wide)
frequency of nucleotide  be q

Gibbs sampling: algorithm
• Let current motif be Wt = {s1,s2,…,sn}
• Pick one si to replace
• For each substring s’ in Si, replace si
with s’ and compute
W sk k

Pr(s'|W )
Pr(s') 

Pr(s | background)  qsk
Gibbs sampling: algorithm
• Pick s’ with probability proportional to
Pr(s’) as computed
• Replace si with s’ to obtain new current
motif Mt+1
• Keep updating motif
• Report the motif with maximum score