Lecture 10 - University of New England

Download Report

Transcript Lecture 10 - University of New England

Bioinformatics
Lecture 10
• Finding signals and motifs in DNA and proteins
• Expectation Maximization Algorithm
• MEME
• The Gibbs sampler
Finding signals and motifs in DNA and proteins
• An alignment of sequences is intrinsically connected with another essential task,
which is finding certain signals and motifs (highly conservative ungapped blocks)
shared by some sequences.
• A motif is a sequence pattern that occurs repeatedly in a group of related protein
or DNA sequences. Motifs are represented as position-dependent scoring matrices
that describe the score of each possible letter at each position in the pattern.
• Another related task is searching biological databases for sequences that contain
one or more of known motifs.
• These objectives are critical in analysis of genes and proteins, as any gene or
protein contains a set of different motifs and signals. Complete knowledge about
locations and structure of such motifs and signals leads to a comprehensive
description of a gene or protein and indicates at a potential function.
The eMOTIF method of motif analysis
• eMotif is very useful method of identifying motifs in proteins
• MSA of a particular set of proteins is submitted to eMotif, which
essentially searches for consensus sequence(s) and identifies the
conservative motifs.
• The probability of a motif is estimated from the frequencies of the
individual amino acids in the SwissProt DB as a product of
probabilities of each position in the consensus
• The result could be as follows: This motif matches 25 out of the 30
sequences supplied. It will match 1 in 10 19 random sequences, or
less than 1 sequence in the current SWISS-PROT database.
• Then a motif can be searched in the Swiss-Prot DB
eMOTIF
True positives
eMOTIF: search of sequences with certain emotif in the DB
Expectation Maximization (EM) Algorithm
• This algorithm is used to identify conserved areas in unaligned DNA and proteins.
• Assume that a set of sequences is expected to have a common sequence pattern.
• An initial guess is made as to location and size of the site of interest in each of the
sequences and these parts are loosely aligned.
• This alignment provides an estimate of base or aa composition of each column in
the site.
• The EM algorithm consists of the two steps, which are repeated consecutively.
• Step 1, the expectation step, the column-by-column composition of the site is used
to estimate the probability of finding the site at any position in each of the sequences.
These probabilities are used to provide new information as to expected base or aa
distribution for each column in the site.
• Step 2, the maximization step, the new counts for bases or aa for each position in
the site found in the step 1 are substituted for the previous set.
Expectation Maximization (EM) Algorithm
OOOOOOOOXXXXOOOOOOOO
OOOOOOOOXXXXOOOOOOOO
o o o o o o o o o o o o o o o o o o o o o o o o
OOOOOOOOXXXXOOOOOOOO
OOOOOOOOXXXXOOOOOOOO
IIII
IIIIIIII
Columns defined by a preliminary
alignment of the sequences provide
initial estimates of frequencies of aa
in each motif column
IIIIIII
Columns not in motif provide
background frequencies
Bases
Background
Site column 1
Site column 2
……
G
0.27
0.4
0.1
……
C
0.25
0.4
0.1
……
A
0.25
0.2
0.1
……
T
0.23
0.2
0.7
……
Total
1.00
1.00
1.00
……
Expectation Maximization (EM) Algorithm
XXXXOOOOOOOOOOOOOOOO
XXXX
A
IIII
IIIIIIIIIIIIIIII
OXXXXOOOOOOOOOOOOOOO
XXXX
B
IIII
I
IIIIIIIIIIIIIII
Use previous estimates of aa
or nucleotide frequencies for
each column in the motif to
calculate probability of motif
in this position, and multiply
by……..
X
…background frequencies in the
remaining positions.
The resulting score gives the likelihood that the motif
matches positions A, B or other in seq 1. Repeat for
all other positions and find most likely locator. Then
repeat for the remaining seq’s.
EM Algorithm 1st expectation step : calculations
• Assume that the seq1 is 20 bases long and the length of the site is 20 bases.
• Suppose that the site starts in the column 1 and the first two positions are A and T.
• The site will end at the position 20 and positions 21 and 22 do not belong to the
site. Assume that these two positions are A and T also.
• The Probability of this location of the site in seq1 is given by
Psite1,seq1 = 0.2 (for A in position 1) x 0.7 (for T in position 2) x Ps (for the next 18
positions in site) x 0.25 (for A in first flanking position) x 0.23 (for T in second
flanking position x Ps (for the next 78 flanking positions).
• The same procedure is applied for calculation of probabilities for Psite2,seq1 to
Psite78, seq1, thus providing a comparative set of probabilities for the site location.
• The probability of the best location in seq1, say at site k, is the ratio of the site
probability at k divided by the sum of all the other site probabilities.
• Then the procedure repeated for all other sequences.
EM Algorithm 2nd optimisation step: calculations
• The site probabilities for each seq calculated at the 1st step are then used to create
a new table of expected values for base counts for each of the site positions using
the site probabilities as weights.
• Suppose that P (site 1 in seq 1) = Psite1,seq1 / (Psite1,seq1 + Psite2,seq1 + …+ Psite78,seq1 ) =
0.01 and P (site 2 in seq 1) = 0.02.
• Then this values are added to the previous table as shown in the table below.
• This procedure is repeated for every other possible first columns in seq1 and then
the process continues for all other sequences resulting in a new version of the table.
• The expectation and maximization steps are repeated until the estimates of base
frequencies do not change.
Bases
Background
Site column 1
Site column 2
……
G
0.27 + …
0.4 + …
0.1 + …
……
C
0.25 + …
0.4 + …
0.1 + …
……
A
0.25 + …
0.2 + 0.01
0.1 + …
……
T
0.23 + …
0.2 + …
0.7 + 0.02
……
Total/
weighted
1.00
1.00
1.00
……
Multiple EM for Motif Elicitation - MEME
MEME: Summary Line
• This line gives the width (‘width’), number of occurrences in the training set (‘sites’), log
likelihood ratio (‘llr’) and E-value of the motif. Each motif describes a pattern of a fixed
width and no gaps are allowed in MEME motifs. MEME numbers the motifs consecutively
from one as it finds them. MEME usually finds the most statistically significant (low E-value)
motifs first.
•The statistical significance of a motif is based on its log likelihood ratio, its width and number
of occurrences, the background letter frequencies (given in the command line summary), and
the size of the training set.
•The E-value is an estimate of the expected number of motifs with the given log likelihood
ratio (or higher), and with the same width and number of occurrences, that one would find in
a similarly sized set of random sequences. (In random sequences each position is independent
with letters chosen according to the background letter frequencies.)
• The log likelihood ratio is the logarithm of the ratio of the probability of the occurrences of
the motif given the motif model (likelihood given the motif) versus their probability given the
background model (likelihood given the null model). (Normally the background model is a 0order Markov model using the background letter frequencies, but higher order Markov
models may be specified via the -bfile option to MEME.)
•Clicking on the buttons to the left of the motif summary line takes you to the previous motif
(P) or next motif (N).
MEME: Summary Line
MEME
MOTIF 1
width = 26
sites = 5
llr = 244
E-value = 5.0e-006
MEME
The Gibbs Sampler
• The Gibbs sampler algorithm is slightly different from the EM approach.
The method also searches for the statistically most probable motifs and can
find the optimal width and the number of motifs in each sequence.
• The method iterates through two steps. In the first step a random start
position for the motif is chosen for all sequences but for one. These seq. are
then aligned and used to find an initial guess of the motif.
• The objective of the next step is to find the most probable pattern common
to left out sequence (and on the next iterations to all of the sequences) by
sliding them back and forth until the ratio of the motif probability to the
background probability is a maximum.
• Then the next sequence is left out and the process is repeated until the
residue frequencies in each motif do not change. The number of iterations
may range from several hundred to several thousand.
• Several additional statistical procedure are used to improve the
performance of the algorithm. The Gibbs sampler was used to align
sequences with very little sequences similarity.
Steps of the Gibbs sampler algorithm
A. Estimate the aa or base frequencies in the motif columns of all but the 1 sequence. Also obtain background
Motif
xxxxxxxMxxxxxxx
All
sequences
except the
outlier
xxxxxxxxxxMxxxx
xMxxxxxxxxxxxxx
xxxxxxxxxxxxxxM
xxxxxMxxxxxxxxx
xxxxxxxMxxxxxxx
xxxxxxxxxxMxxxx
xMxxxxxxxxxxxxx
xxxxxxxxxxxxxxM
xxxxxMxxxxxxxxx
Random start
Location of motif in each sequence provides
positions chosen
first estimate of motif composition
x is equal to n seq.
positions
M indicates random
location of the motif in
each seq.
- indicates initially
aligned motif positions
B. Use the estimate from A to calculate the ratio of probability of motif to background score at each position
in the left out sequence. This ratio for each possible location in the sequence is the weight of the position.
The
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
outlier
M ->
M- >
M->
M->
sequence M - >
C. Choose a new location for the motif in the left out sequence by a random selection using the weights to bias
the choice.
xxxxxxxxxxMxx
Estimated locations of the motif in left out sequence
D. Repeat steps A to C >>times