Transcript Document
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
Computational Genomics and
Proteomics
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Lecture 8
Motif Discovery
Outline
Gene Regulation
DNA
Transcription factors
Motifs
What are they?
Binding Sites
Combinatoric Approaches
Exhaustive searches
Consensus
Comparative Genomics
Example
Probabilistic Approaches
Statistics
EM algorithm
Gibbs Sampling
www.accessexcellence.org
www.accessexcellence.org
www.accessexcellence.org
Four DNA nucleotide building blocks
G-C is more strongly hydrogen-bonded than A-T
Degenerate code
Four bases: A, C, G, T
Two-fold degenerate IUB codes:
R=[AG] -- Purines
Y=[CT] -- Pyrimidines
K=[GT]
M=[AC]
S=[GC]
W=[AT]
Four-fold degenerate: N=[AGCT]
Transcription Factors
•Required but not a part of the RNA polymerase complex
•Many different roles in gene regulation
Binding
Interaction
Initiation
Enhancing
Repressing
•Various structural classes (eg. zinc finger domains)
•Consist of both a DNA-binding domain and an interactive
domain
Motifs
Short sequences of DNA or RNA (or amino acids)
Often consist of 5- 16 nucleotides
May contain gaps
Examples include:
Splice sites
Start/stop codons
Transmembrane domains
Centromeres
Phosphorylation sites
Coiled-coil domains
Transcription factor binding sites (TFBS – regulatory
motifs)
TFBSs
Difficult to identify
Each transcription factor may have more than one binding
site
Degenerate
Most occur upstream of translation start site (TSS) but are
known to also occur in:
introns
exons
3’ UTRs
Usually occur in clusters, i.e. collections of sites within a
region (modules)
Often repeated
Sites can be experimentally verified
Why are TFBSs important?
Aid in identification of gene
networks/pathways
Determine correct network structure
Gene A
Gene B
Drug discovery
Switch production of gene product on/off
Consensus sequences
Matches all of the example sequences closely but not
exactly
A single site
TACGAT
A set of sites:
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
Consensus sequence:
TATAAT or
TATRNT
Trade-off: number of mismatches allowed, ambiguity in
consensus sequence and the sensitivity and precision of
Information Content and
Entropy
Sequence Logos
Frequency Matrices
Given a collection of motifs,
TACGAT
TATAAT
TATAAT
GATACT
TATGAT
TATGTT
Create the matrix:
T
A
C
G
Position weight matrices
Finding Motifs
Two problems:
Given a collection of known motifs, develop a
representation of the motifs such that additional
occurrences can reliably be identified in new
promoter regions
Given a collection of genes, thought to be related
somehow, find the location of the motif common to
all and a representation for it.
Two approaches:
Combinatorial
Probabilistic
Combinatorial Approach
Exhaustive Search
Exhaustive Search
Sample-driven here refers to trying all the words as
they occur in the sequences, instead of trying all
possible (4W) words exhaustively
Greedy Motif Clustering
Greedy Motif Clustering
Greedy Motif Clustering
Comparative Genomics
Main Idea: Conserved non coding regions are important
Align the promoters of orthologous co-expressed
genes from two (or more) species e.g. human and
mouse
Search for TFBS only in conserved regions
Problems:
Not all regulatory regions are conserved
Which genomes to use?
Phylogenetic Footprinting
Phylogenetic Footprinting refers to the task of finding conserved
motifs across different species. Common ancestry and selection on
these motifs has resulted in these “footprints”.
Phylogenetic Footprinting
An Example
Xie et al. 2005
Genome-wide alignments for four species (human,
mouse, rat, dog)
Promoter regions and 3’UTRs then extracted for 17,700
well-annotated genes
Promoter region taken to be (-2000, 2000)
This set of sequences then searched exhaustively for
motifs
Nature 434, 338-345, 2005
The Search
Xie et al. 2005
Expected Rate
Probabilistic Approach
Gibbs Sampling (applied to
Motif Finding)
Gibbs Sampling Algorithm
Gibbs Sampling – Motif Positions
AlignACE - Gibbs Sampling
Remainder of the lecture:
Maximum likelihood and the EM algorithm
The remaining slides are for your information only and will
not be part of the exam
Basic Statistics
Maximum Likelihood Estimates
EM Algorithm
Basic idea (MEME)
http://meme.nbcr.net/meme/meme-intro.html
Basic idea (MEME)
MEME is a tool for discovering motifs in a group of related DNA
or protein sequences. A motif is a sequence pattern that occurs
repeatedly in a group of related protein or DNA sequences.
MEME represents motifs as position-dependent letterprobability matrices which describe the probability of each
possible letter at each position in the pattern. Individual MEME
motifs do not contain gaps. Patterns with variable-length gaps
are split by MEME into two or more separate motifs.
MEME takes as input a group of DNA or protein sequences (the
training set) and outputs as many motifs as requested. MEME
uses statistical modeling techniques to automatically choose the
best width, number of occurrences, and description for each
motif.
http://meme.nbcr.net/meme/meme-intro.html
Basic MEME Model
MEME Background frequencies
MEME – Hidden Variable
MEME – Conditional Likelihood
EM algorithm
Example
E-step of EM algorithm
Example
M-step of EM Algorithm
Example
Characteristics of EM
Gibbs Sampling (versus EM)