Transcript Slide 1
Motif discovery
EM algorithm
Gibbs Sampler
Enumeration
Regression methods
Phylogenetic trees
Purpose
Construction
Finding significance
Not directly related to today’s topic: EM algorithm tutorial
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.5949
Finding regulatory motifs - Background
From a known family of sequences that are related only by very small motifs
(that contribute to the function), find the real motif from vast amounts of junk.
Sequence length ~2000; motif size ~10
Science , 262, 208-214.
Finding regulatory motifs - Background
The aligned segment can still be quite diverse.
Probability ratios by position:
Science , 262, 208-214.
Finding regulatory motifs - Background
Transcription
aparatus
Transcription
factor
Gene
Regulatory motifs
Transcription
aparatus
Transcription
factor
Gene
Regulatory motifs
ChIP-chip and ChIP-Seq
Ren et al. 1999; Iyer et al. 2000.
Thanks to Dr. Steve Qin for the slide.
ChIP-chip and ChIP-Seq
Brief Bioinform (2013) 14 (2): 225-237.
Finding regulatory motifs - Background
The key biological question:
Knowing a number of TF binding sites from ChIP-seq, or
genes regulated by a certain TF by functional studies, how
to find the binding motif?
Finding regulatory motifs - Background
Difficulties:
(1) Motif locations vary with regard to the gene
(2) Motifs are small (~10 characters to be found in strings
of ~2000 characters)
(3) Motif sequences vary
(4) Motif may be multi-block (not considered here)
Vavouri et al. Genome Biology 2007 8:R15
Finding regulatory motifs - Background
Summarizing a motif from an aligned sequence block:
Brief Bioinform (2013) 14 (2): 225-237.
Finding regulatory motifs - Conservation
One common measure of conservation of a motif: entropy
for position i:
Entropy(i )
P( X , i ) log P( X , i )
X A,T ,C ,G
2
Conservation(i ) 2 Entropy(i )
2 4 0.25log2 0.25
Or, relative entropy compared to background frequency:
DKL ( Pmotif || Pbackground )
X { A,T ,C ,G}
Pmotif ( X ) log2
Pmotif ( X )
Pbackground ( X )
Finding regulatory motifs - quality
How to judge the quality of an identified motif?
The null distribution is not as simple as we hope. As we
discussed before, the genome is not random. No simple
profile model summarizes the null --- the non-motif
background.
What to consider:
> Position bias: cluster close to the “start codon” ATG
> Orientation bias: tend to be up-stream of a gene
> Functional specificity: occurring mostly in the sequences
under study, and/or mostly near a functional subcategory
of genes.
Finding regulatory motifs - formulation
The formulation of the computational question:
Align tens/hundreds of sequences; only a tiny fraction of
the aligned sequences actually support the alignment (the
motif)
What’s needed:
(1) A scoring system (likelihood function, loss function, etc)
(2) A very efficient algorithm to search for the optimum in a
space of astronomical size
Example: align 100 sequences, each 2000 characters
long, there’s roughly 4000^99 possibilities
Finding regulatory motifs – scoring system
The motif is generated from a position weight matrix;
All other parts of the sequences are generated randomly
from a background model.
Example: Third order Markov background model:
CTTATGTA
Finding regulatory motifs - MEME
We are simultaneously seeking two things:
(1) A motif profile (θ)
(2) The location of the motif in each sequence (π)
Sounds familiar? If we treat π as missing data, the problem
falls into the realm of the EM algorithm.
MEME -- Multiple EM for Motif Elicitation
Original idea: Timothy L. Bailey and Charles Elkan, "Fitting a mixture
model by expectation maximization to discover motifs in
biopolymers", Proceedings of the Second International Conference
on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI
Press, Menlo Park, California, 1994.
Finding regulatory motifs - MEME
D'haeseleer P. Nat Biotechnol. 2006 Aug;24(8):959-61.
Finding regulatory motifs - MEME
The EM algorithm may be trapped at a local optimum.
To avoid local optima, MEME tries multiple start points:
(1) every n-character word present in the sequences is
tried.
(2) The best is selected;
(3) mask the best, find the second best … …
It allows repeated occurrence of a motif in every
sequence.
Finding regulatory motifs – Gibbs Sampler
Can be considered a stochastic version of the EM algorithm.
Probablistic updates avoids local optima to some extent.
Lawrence et al. Science 262, 208-214.
Liu, JS et al. J. Am. Stat. Assoc. 90, 1156–1170
S: observed sequences
θ: motif parameters
θ0: background parameters (non-interest, but important)
π : motif locations
Goal: P(π, θ | θ0, S)
Sample P(π | θ, θ0, S)
Sample P(θ | π, θ0, S)
Finding regulatory motifs – Gibbs Sampler
Randomly initiate θ (a probability matrix)
Randomly select one sequence i, score all positions using θ
Update π(i) based on pattern probability
Iterate until θ converges
Weight for each segment: Ax=Qx/Px,
Qx: likelihood based on the position weight matrix
Px: likelihood based on the background model
Finding regulatory motifs – Gibbs Sampler
Quote:
Science 262, 208-214
Finding regulatory motifs – Enumeration
Enumeration ??!!!
Yes. It works very well compared to EM, Gibbs sampler,
and other methods. Tompa, M. et al. Assessing computational tools for the discovery of
transcription factor binding sites. Nat. Biotechnol. 23, 137–144 (2005)
The basic idea:
Simply count all possible m-mers in all the sequences of
interest. Select those with exceptionally high frequencies
(by testing). Merge similar ones into a position weight
matrix.
•Using efficient programming (hash), one scan through the
sequences gives us all frequencies.
•The real methods are much more complicated than above.
Finding regulatory motifs – MDscan
To address the problem: some input sequences may not
contain the motif.
Liu XS et al. Nature Biotechnology 20, 835 - 839 (2002)
(1) Find all non-redundant w-mers (“seed”) from the most
reliable subset of t sequences;
(2) Find all matches (≥m identity) in the t sequences;
(3) Establish the position weight matrix using these
matches;
(4) Evaluate the motif matrix (approximate maximum a
posteriori (MAP) score):
Finding regulatory motifs – MDscan
From the top 10~50 seeds from the previous step, iterate
Forward step:
Using the non-top sequences, search for the w-mer seed;
If the addition of a “match” increases the score of the wmer, the sequence is added;
Backward step:
Re-examine all matches of each seed;
If the removal of a match increases the score, remove the
match;
Report the highest scoring motifs.
Finding regulatory motifs – Motif Regressor
Goal: link gene expression with motif discovery.
Conlon et al. PNAS 6:3339. (2003)
Finding regulatory motifs – Motif Regressor
How well a sequence agrees with a motif:
all w-mers in this sequence
motif model
background HMM model
For every motif:
Stepwise model selection using all “good” motifs from last
step:
Modeling inter-dependent positions
• Zhou and Liu
Bioinformatics 2005
• Barash et al.
RECOMB 2003
Thanks to Dr. Steve Qin for the slide
Detect intra-dependent position pairs
Build intra-dependency into
the motif model.
26
Hu et al. Nucleic Acids Research, 2009, 1–14
Phylogenetic tree
http://en.wikipedia.org/wiki/Image:Phylogenetic_tree.svg
Phylogenetic tree
It can be seen as a model beyond multiple sequence
alignment
Sequence similarity Evolutionary relationship
The biological relavence of phylogenetic trees
Finding evolutionary relationships among
organisms
Help predict gene functions
Finding important mutations in rapidly changing
organisms, e.g. HIV
In terms of computation, sifting out real mutations
from all possible pairs at a given location of an alignment
Building phylogenetic trees
Binary tree structure is assumed:
At every split, two descendent edges.
Biologically: at any time point, only one new species can
be generated.
edge
node
The tree can be built for species, or just a certain protein
The edge length reflects evolutionary distance
Different organisms have different evolution speed
Different molecule have different evolution speed
Rooted and unrooted trees
The real (hidden) evolutionary path is a rooted tree
Phylogenetic trees are estimates to the real tree based on
some distance information
Some methods estimates where the root is, while some do not
a
c
b
d
AAAGGGGTTTT
AAAGGGGCCCC
AAATTTTAAAA
AAACCCCAAAA
a b c d
a
b
c
d
a
…………
b
C
d
Reconstruction of trees
Maximum parsimony
Find the tree that requires the least number of mutations to
derive the observed sequences.
Distance-based reconstruction
Similar to clustering methods.
Likelihood-based reconstruction
Needs a model for sequence evolution. Find the tree that gives
the highest likelihood of the observed data. (Not discussed here.)
The two methods above can be seen as likelihood-based under
very simple sequence evolution models and other assumptions.
Maximum parsimony
Four sequences:
AAG
AAA
GGA
AGA
Three possible trees:
#changes: 3
4
4
Maximum parsimony
Find the tree that can explain the observed data with minimal
number of substitutions.
This is analogous to model selection: if two models explain the
data equally well, the simpler is preferred.
Assumption: Equal evolutionary speed; independence between
positions
Can apply weights to account for different probabilities of
mutations.
Maximum parsimony
Tasks:
(1) With a given topology and leaf assignments, find the
minimal #substitutions
Note: Non-leaf nodes can vary. Needs minimization.
Using the independence assumption, can compute
one position at a time.
(2) Search through all possible topologies with leaf
assignments.
Maximum parsimony
Task (1) example: the simplest: Fitch’s algorithm
R j Rk if R j Rk
Ri
R
R
otherwise
k
j
Bottom to top:
A
AGT
AG
A
G
Min(# substitutions)
=#union operations
=2
T
T
T
A
Maximum parsimony
Task (2) is a very hard problem.
There are ~34 million possible trees when n=10.
A number of heuristic methods were proposed to deal with the
problem. A few names here:
Nearest Neighbour Interchanges
Branch and Bound
Tree Bisection and Reconnection
……
Some of such methods use a good topology found by distancebased methods (below) as a starting point.
Distancebased
methods
UPGMA
Same
assumptions
as parsimony
Likelihood-based methods
Data (aligned sequences): D
Evolutionary model: M
Tree structure: T
Maximize Prob(D | M, T)
M:
pa | b, t
The probability of changing from character b to
character a in edge length (time) t.
With independence assumption, between two sequences:
P( x | y, t ) pxi | yi , t
i
Likelihood-based methods
The probability of a tree is the product of the sequence
mutation rates in each branch of the tree, which is in turn
the product of the rate of substitution in each branch times
the branch length.
It allows mutation rate variations across branches.
Computationally costly.
Assessing significance
A simple non-parametric bootstrap strategy:
Resample columns (positions) from the aligned sequences.
The confidence of a branching event is assessed by the
fraction of its occurrence in the trees from the resampled
sequences.
Assessing significance
Derdeyn et al.
Science.
303(5666):201922