GST II: ---Title--- - University of Missouri

Download Report

Transcript GST II: ---Title--- - University of Missouri

DNA Regulatory
Binding Motif Search
Dong Xu
Computer Science Department
109 Engineering Building West
E-mail: [email protected]
573-882-7064
http://digbio.missouri.edu
Lecture Outline
 Gene
regulation
 Definition
of regulatory motif search
 CONSENSUS
 Gibbs
(“Greedy” Algorithm)
Sampler
Gene Regulation
promoter
operator
DNA sequence
Start of transcription
Key steps in transcription
DNA
 Initiation
RNA
 Elongation
 Termination
+
Initiation
RNA Polymerases
RNA Pol I
Ribosomal RNAs
RNA Pol II
All protein genes, snRNAs U1,U2 etc
RNA Pol III
Transfer RNAs, ribosomal RNAs
TATA
TATA A/T A A/T
One of the first sequences to be described was the TATA
box consensus
Transcription Initiation Complex
A macromolecular assembly of approximately 50
proteins
Many conserved from yeast to humans
TATA-binding protein (TBP) binds to TATA box
RNA pol II
TAF
TAF
TBP
TATA
Upstream Regulatory Elements
In addition to the TATA box the comparison of
many eukaryotic upstream sequences identified
addition conserved motifs that were involved with
the regulation of gene transcription
Some UREs were common to many genes others
were found only in genes expressed in specific
cells or as a result of specific stimuli
TATA
URE
Promoters are sequences in the DNA just upstream of
transcripts (coding sequences) that define the sites of initiation
Transcription faction
Transcription factors are the proteins that
modulate the rate of gene transcription by specific
interactions with DNA and/or other proteins
motif
TATA
Regulatory Network (1)
• Regulatory elements in eukaryotes are frequently arranged in
“modules”.
• Frequently TFs act as synergistic (cooperative) or antagonistic
(competitive) pairs.
Endo 16
Regulatory Network (2)
Lecture Outline
 Gene
regulation
 Definition
of regulatory motif search
 CONSENSUS
 Gibbs
(“Greedy” Algorithm)
Sampler
Motif Identification
AGCCA
AGCCA
AGCCA
AGCCA
Regulatory
regions
AGCCA
Motif –
AGCCA
Binding site???
What constitutes a motif?
 In
S.cerevisiae typically 6-10
conserved bases – The motif
 Spacers
varying in length (1-11bp)
Usually located in the middle
ACCNNNNNNGTT
Subproblem #1
 Having
a collection of known binding
sites
 Can
we develop a representation to
search for new binding sites?
Subproblem # 2
 Given
a set of sequences containing
binding sites for a common factor
 Can
we discover their location in
each sequence?
Computational Approach
Identify
a set of genes believed to be
controlled by the same regulatory
mechanism (co-regulated genes).
Extract regulatory regions of the genes
(usually upstream sequences) to form a
sample of sequences.
Find some way to identify conserved
elements (ungapped pattern) in these
sequences, resulting in a list of potential
regulatory sites.
Motif Finding Problem
Given
a sample of sequences and an
unknown pattern (motif) that appears at
different unknown positions in each
sequence, can we find the unknown
pattern?
Input: a set of sequences, each one with
an unknown pattern at an unknown position.
Output: the pattern and a set of starting
positions of the pattern in each sequence.
Why Not Use Multiple
Alignment
The
motif is short and may appear in different
location in different sequences. Most other areas
are random.
The
problem is made more complicated since not
every sequence contains a motif, due to:
 The upstream region used may not be long enough to include a
regulatory site in every sequence.
 Usually, potential co-regulated genes are used to construct the
sample, which means that we don’t know for sure whether all
these genes are really co-regulated.
Frequency matrix
N = 25.00
p(b)= 0.25
pos 0 pos 1 pos 2 pos 3 pos 4 pos 5
A
13
3
6
7
6
6
C
8
9
4
0
6
6
G
3
9
10
0
0
8
T
1
4
5
18 13
5
(
A
C
G
T
pos 0
1.76
1.57
1.23
0.95
Log f(b,i) + N / 4
p(b)
)
pos 1
1.23
1.61
1.61
1.32
pos 5
1.46
1.46
1.57
1.40
pos 2
1.46
1.32
1.65
1.40
pos 3
1.52
0.70
0.70
1.89
pos 4
1.46
1.46
0.70
1.76
Information Content Values
The functional constraints on each specific position of the pattern are
variable from some sites absolutely conserved (Shannon’s information
content Ci ranging between 0 and 1).
Sequence Logo
Example Data Set
Experimentally determined CRP binding sites for 18 genes
CRP Dimer
Homo dimeric structure indicates symmetric model
CPR Product Multinomial
Model Logo
Palidromic Product Multinomial model of sites
Essentially a Multiple
Local Alignment
.
.
.
Find “best” multiple local alignment
Difficulties
 Multiple
factors for a single gene
 Variability in binding sites
 The nature of variability is NOT well understood
 Insertions and deletions are uncommon
 Location,
location, location…
 Confidence assessment
Lecture Outline
 Gene
regulation
 Definition
of regulatory motif search
 CONSENSUS
 Gibbs
(“Greedy” Algorithm)
Sampler
Early Statistical Approaches
 CONSENSUS – Use a greedy algorithm to iteratively
build up motifs by adding more and more pattern
instances.
 Gibbs sampler – Start from a random initial solution,
use the Gibbs sampling approach to make a series
of local moves, trying to get to the solution with the
best score.
 MEME – Use the expectaion maxmization (EM)
algorithm.
CONSENSUS Algorithm
CONSENSUS
uses an iterative procedure
to add more and more patterns to form
potential motifs:
 Initialize each l-mer in sequence 1 as a singlepattern motif.
 Add each l-mer in sequence 2 to each single-pattern
motif, forming motifs consisting of 2 patterns. Keep
only the top n motifs.
 Repeat the process by adding each l-mer in
sequence 3 to the top n motifs from the last round,
forming motifs consisting of 3 patterns, and so on
until the last sequence. Only the top n motifs are
kept each time.
More Details of CONSENSUS
 CONSENSUS use the information content
score for scoring a motif as a set of
ungapped patterns.
 Instead of following the sequence order as
given in the input sequence set, a
randomized ordering is used to avoid
dependence on the input set.
CONSENSUS Procedure (1)
Cycle 1:
For each word W1 in S1
For each word W2 in S2
Create alignment (gap free) of W1, W2
Keep the n best alignments A1,1, …, An,1 :
ACGGTTG ,
CGAACTT ,
GGGCTCT …
ACGCCTG ,
AGAACTA ,
GGGGTGT …
CONSENSUS Procedure (2)
Cycle t:
For each alignment Aj, t-1 from cycle t-1
For each word Wt+1 in St+1
Create alignment (gap free) of Wt+1, Aj, t-1
Keep the n best alignments A1,t, …, An,t
ACGGTTG ,
ACGCCTG ,
…
ACGGCTC ,
CGAACTT ,
AGAACTA ,
…
AGATCTT ,
GGGCTCT …
GGGGTGT …
…
GGCGTCT …
Weight matrix

Probabilistic model:
How likely is each letter at each motif
position?
A
C
G
T
1
2
3
4
5
6
7
8
9
.89
.02
.38
.34
.22
.27
.02
.03
.02
.04
.91
.20
.17
.28
.31
.30
.04
.02
.04
.05
.41
.18
.29
.16
.07
.92
.18
.03
.02
.01
.31
.21
.26
.61
.01
.78
A. K. A.
Weight matrices are also known as

Position-specific scoring matrices

Position-specific probability matrices

Position-specific weight matrices
Related concepts

Information content

Relative entropy
Scoring a motif model

A motif is interesting if it is very different
from the background distribution
A
C
G
T
1
2
3
4
5
6
7
8
9
.89
.02
.38
.34
.22
.27
.02
.03
.02
.04
.91
.20
.17
.28
.31
.30
.04
.02
.04
.05
.41
.18
.29
.16
.07
.92
.18
.03
.02
.01
.31
.21
.26
.61
.01
.78
less interesting
more interesting
Relative entropy

A motif is interesting if it is very different
from the background distribution

Use relative entropy as objective function:
pi , 

  pi , log


b 
position i  letter 
pi, = probability of  in matrix position i
b = background frequency (in non-motif sequence)
Computational Complexity
n is user-defined heuristic constants
Running time:
O(N2) + O(k N n)
Where
N: length of sequence;
n: top n selections
k: number of sequences
Lecture Outline
 Gene
regulation
 Definition
of regulatory motif search
 CONSENSUS
 Gibbs
(“Greedy” Algorithm)
Sampler
Gibbs Sampling (1)
 Goal:
find the best ak to maximize
the difference between motif and
background base distribution.
a1
a2
a3
a4
ak
Gibbs Sampling (2)
Step 1: Pick random start position,
compute current motif matrix
 Step 2: Iterative update

 Take one sequence out, update motif matrix
 Calcuate fitness score of each position of out
sequence
 Pick start position in out sequence based on weight
Ax
 Take out another sequence, …, until converge

Step 3: Reset starting position
Liu, X
Gibbs Sampling (3)
Take out one sequence, calculate the fitness score
of every subsequence relative to the current motif
a1' ?????????????????
a2'
a3'
a4'
ak'
Fitness Score

Ax = Qx / Px
Current Motif
 Qx: probability of
generating subsequence x
from current motif
 Px: probability of
generating subsequence x
from background
A
1
0.1
2
0.3
3
0.7
T
G
C
0.1
0.7
0.1
0.2
0.4
0.1
0.1
0.1
0.1
X = GGA:
Background:
Q? P?
P(A) = P(T) = 0.4
P(G) = P(C) = 0.1
An example
ACAGTGT
TAGGCGT
ACACCGT
???????
CAGGTTT
ACAGTGT
TAGGCGT
ACACCGT
ACGCCGT
CAGGTTT
A
C
G
T
.45
.45
.45
.05
.05
.05
.05
.25
.45
.05
.25
.45
.05
.05
.05
.05
.45
.65
.05
.65
.05
.25
.05
.05
.05
.45
.25
.85
sequence 4
Gibbs pseudocode
select sites at random
compute the relative entropy
for (iter = 0; iter < maxiter; iter++) {
shuffle(sequences)
foreach sequence in (sequences) {
assign score to each site in sequence
choose one site probabilistically
compute the fitness score
if (fitness score is best so far) {
store a copy of the current sites
}
}
}
print the best scoring set of sites
Computational Complexity

One iteration running time: O(NK)
 Usually need < N iterations for convergence, and
< N starting points.
 Overall complexity: unclear – typically O(N2K) O(N3K)

EM is a local optimization method

Initial parameters matter
Biological Considerations
In
practice, motif finding algorithms have to
take into account characteristics of real
input samples. These include:
 Motifs with unknown length.
 Samples with biased nucleotide composition.
 Corrupted samples (not every sequence contains a
motif).
 Regulatory sites can lie on either DNA strand.
Reading Assignments

Suggested reading:
 Chapter 10 in “Current Topics in Computational
Molecular Biology, edited by Tao Jiang, Ying Xu,
and Michael Zhang. MIT Press. 2002.”

Optional reading:
1. Victor Olman, Dong Xu, and Ying Xu. CUBIC:
Identifications of Regulatory Binding Sites through
Data Clustering. Journal of Bioinformatics and
Computational Biology. 1:21-40. 2003.
Project Assignment (1)
Develop a program that implement the
“greedy” algorithm (CONSENSUS)
for motif identification
1.
Use an objective function of total
mismatches between words.
2.
Test the program using the DNA
sequence in the next page.
3.
Output the motif and location in each
sequence.
Project Assignment (2)
Test DNA sequence (each line a sequence):
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa
tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag
gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa