Gibbs sampling for motif finding in biological sequences

Download Report

Transcript Gibbs sampling for motif finding in biological sequences

Gibbs sampling for motif
finding in biological sequences
Christopher Sheldahl
Biological Sequences
• Proteins : “strings” of amino acids.
20 common types
|alphabet| = 20
• Nucleic acids : “strings” of nucleotides
4 common types for RNA; 4 for DNA
|alphabet| = 4
Motif Finding
• Multiple sequences : How to detect related
segments (or “motifs”)?
1: …ATACGTTGAGA…
2: …CGACGTTGCAA…
3: …CCACGCTGGAC…
Sequence Databases are Large
• Genbank : NA : 108 million sequences
• SwissProt : proteins : ~500,000
sequences
Growth of SwissProt
http://www.expasy.org/sprot/relnotes/relstat.html
The basic problem
• We have N biological sequences.
• Assume that there is one evolutionarily
related segment (substring) of known
width W in each sequence.
• How do we find this segment in each
sequence? Remember that there might
be mutations.
Protein case : We’d like to
know the following:
• Motif description (matrix): Probability of each
of the 20 aa at each position in the motif of
width w:
qi,R : Probability of R type aa at position i
of motif. Matrix is W x 20.
• Background probability (vector) of each of the
20 aa in sites not part of motif: p1….p20
• Ak : alignment vector of starting indices
for the motif in each of the N
sequences.
Scoring a given string
• If we knew the motif description matrix and
the background vector we could score any
string of width W :
W
W
P   pRi
i1
i1
Q   qi,Ri
Q
L
P

L : score of segment wrt the motif.
Ri : the aa type at position i of the segment.

Scoring an alignment of N
sequences
We can score an entire alignment :
N
AlignScore   Lx
x1

Qx
Lx 
Px
What if we just had one new
sequence?
• What if we knew the motif description
matrix and the background vector for a
large number of sequences, and we
want to quickly align a new (presumably
related) sequence?
What if we just had one new
sequence?
• Score possible locations of the motif in the
new sequence and pick one with a high (or
max) value of L:
Q
L
P
• Note that the better our motif description and
background model are, the better the new
alignment will be.

Motif Finding Dilemma
• If we knew the alignment, we could calculate
the motif description matrix and the
background vector.
• If we knew the motif description and the
background, we could calculate the alignment
(by choosing high L positions).
• We don’t know either of these - we’ve seen
one way to approach this problem before...
Expectation Maximization or
The Road Not Taken in this Lecture
Start with Random motif description and background.
Repeat until motif description converges:
1) E - step : Calculate alignment points (Ak) from Motif Description
and Background.
2) M - step : Calculate Motif Description and background from
alignments.
Lawrence et. al. 1990. “An Expectation Maximization Algorithm for the Identification and Characterization
of Common Sites in Unaligned Biolpolymer Sequences”. Proteins.
Local minima
• After initialization, simple EM algorithm
is a deterministic process.
• If the initialization is bad, can get
trapped in local minima.
• Heuristic : Try many different starting
points.
Random Sampling
• Note the heuristic of trying many
random initial positions. Maybe we can
overcome local minima if we employ
randomness more thoroughly.
• Can we take random sample alignments
from the distribution :
p(Align | S, W, Motif Matrix, Background)
where Align means A1, A2…AN.
Markov Chain Monte Carlo
• Markov Chain Monte Carlo methods
generate a Markov chain of points that
converges to a distribution of interest.
• “Monte Carlo” : The methods employ
randomness.
Metropolis-Hastings
• Metropolis-Hastings is an MCMC model that can sample from
any distribution P, using a proposal distribution
Q(x’; x).
• Initialize with random x.
• Generate new x’ =
Proposal position according to
Q(x’; x)
• Compute α = min( (P(x’) / P(x) ), 1)
and accept change with probability α.
Figure : Wikipedia
Gibbs Sampling
• Gibbs sampling is a variety of MetropolisHastings sampling where the sampling step is
always accepted.
• For multivariate distributions, in Gibbs
sampling only one parameter is changed at a
time.
• This makes Gibbs sampling particularly
useful for multivariate distributions.
Motif Finding with Gibbs :
Site Sampler
• Site sampler : Sample starting points for motif
in each sequence.
• Start with random alignments, then use
random sampling for one sequence at a time
to gradually improve the alignments.
Lawrence et al., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for
Multiple Alignment”, Science. 1993. 262(5131), 208-214
Gibbs sampling: Changing
One Thing at a Time
• N=3
• We want to sample alignment points A1 for
sequence 1, given the pattern description M,
background model B, the sequences S and
width W, and alignments A2 (for sequence 2)
and A3 (for sequence 3).
p( A1 | S, M, B,W, A2, A3)
Site Sampler Algorithm
Initialize with random alignment points.
While not converged:
Do steps 1 and 2 for each of N sequences :
1) Predictive update step :
Calculate motif matrix and background using all
sequences but the currently selected one.
2) Sampling step. Calculate Lx = Qx / Px for all starting
points x in this sequence. Choose one with a
probability proportional to Lx.
• The inner loop is an iteration of sequences.
• In step 2 you sample a new value for Ak i.e. the alignment position for sequence k
• Convergence : In theory you sample until alignment no longer changes
Example: Random Alignments
Rouchka, 1997,
“A Brief Overview
of Gibbs Sampling”
DNA:
Alphabet =
{A,C,G,T}
Example:
Initial Counts
New Counts
Remove 1st sequence (ATTTAT)
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
Rouchka, 1997.
Sequence 1:
TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT
Rouchka, 1997.
Final Alignment
Assumptions for Site Sampler
• Assumption of one motif.
• Assumption of one copy of motif in each
sequence.
• Assumption that we know the motif
width.
Motif Sampler
• Sample different motifs for each part of the
sequence. Store motif descriptions and
background vector for each motif.
• Allows multiple motifs, and varying number of
each motif to be present.
• Still subject to assumption of known motif
width.
Neuwald et al., “Gibbs Motif Sampling: Detection of bacterial membrane protein
repeats”. 1995. Protein Science. 4:1618-1632.
Null model
• Null model has pattern description
probabilities identical to the background
probabilities for its constituent amino
acids.
Widths and Copy Numbers
• The motif sampler requires widths and expected
numbers of each motif per sequence as input
parameters. The widths are fixed during the
algorithm, but the number of copies can change.
• Construction of subfamilies (more specific models)
favored by larger widths and smaller expected
number of copies.
• Construction of superfamilies (less specific models)
favored by smaller widths and larger expected
number of copies.
Motif Sampler Algorithm
Initialize with Ei non-overlapping random segments for each motif i in each of the
sequences.
While not converged:
Do steps 1 and 2 for successive segments in the biological sequences :
1) Predictive update step :
If this segment is in an alignment remove it and recalculate
motif matrixand background data excluding this segment.
2) Sampling step : Sample one of the motifs with a probability
proportional to the score that the motif description assigns this
segment.
• In the inner loop the motif sampler iterates through segments (substrings)
• Motif sampler samples among different motif models for each segment.
Modifications to the Samplers
• There are methods to allow for gapped
motifs.
• High scoring alignments can be stored
for later investigation.
Neuwald et al., 1995
Porin structures
Porin alignment
Neuwald et al., 1995
Motif descriptions
Motivation for Gibbs sampling
• We have a joint probability density
f(x, y1, y2,…yp)
• We want a marginal density, f(x).
• The integrations are difficult.
Casella and George, “Explaining the Gibbs Sampler”
Near-optimum Sampler
• A particular motif may not be present in
the best alignment found, even though
its found in many other near-optimal
alignments.
• Sample among near-optimal alignments
to capture all sites present.
Column sampler
• The column sampler allows for a motif
to shift:
GCACCTG --> GCACCTG
• It also allows for the development of
motifs with gaps:
GCACCTG --> GCACCTG