CSCE590/822 Data Mining Principles and Applications
Download
Report
Transcript CSCE590/822 Data Mining Principles and Applications
CSCE555 Bioinformatics
Lecture 10 Motif Discovery
HAPPY
CHINESE
NEW YEAR
Meeting: MW
4:00PM-5:15PM
SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Outline
Introduction to DNA Motif
Motif Representations (Recap)
Motif database search
Algorithms for motif discovery
7/16/2015
2
What is a DNA motif?
Motif A Recurring pattern
A short conserved sequence pattern
associated with distinct functions of a
protein or DNA
DNA motifs asTranscription Factor biding
sites
Transcription: binding sites (DNA)
and factors (proteins)
Colored lines are binding sites: DNA sequence patterns.
Blobs are factors (proteins) that recognize binding sites.
Example:
Transcription Factor Binding Sites
Estrogen
Receptor
Transcription start
DNA
ERE
(estrogen response element)
Gene
ERE Sequence
Efp
… a g g g t c a t g g t g a c c c t …
TERT
… t t g g t c a g g c t g a t c t c …
Oxytocin
… g c g g t g a c c t t g a c c c c …
Lactoferrin
… c a g g t c a a g g c g a t c t t …
Angiotensin
… t a g g g c a t c g t g a c c c g …
VEGF
… a t a a t c a g a c t g a c t g g …
Why are sequence patterns useful-revisited
In the context of transcriptional regulation,
sequence patterns can be used to help answer
several questions.
What transcription factors are involved in
regulating my gene?
Does my gene contain a DNA binding domain?
What novel transcription factor binding sites
does my set of co-regulated genes contain?
How do we represent sequence
patterns?
The three most common
pattern representation languages:
regular expressions (e.g.,leucine zipper)
profiles (PWMs, PSSMs etc.)
hidden Markov models (HMMs)
1) Regular expressions define sets of
sequences that they match
Sp1 binds to DNA via 3
zinc-finger binding
domains:
C-X(2,4)-C-X(3)[LIVMFYWC]-X(8)-HX(3,5)-H
These particular domains
recognize Sp1 binding
sites:
GRGGCRGGW
Transcription factor Sp1 binding
to DNA
2) Profiles are built from multiple
alignments of instances of a pattern
Example: nuclear hormone
receptor transcription
factor binding site profile
derived from
experimentally determined
sites.
Observed counts can be
converted to frequencies by
dividing by the number of
observed instances.
So profiles are probabilistic
models of sequence patterns.
Counts of number of
times each letter is
observed at each
position in pattern.
3) Making a Markov Model
A
T
A
A
A
C
C
C
G
C
A
A
A
A
C
A
C
G
C
-
T
-
A
A
A
A
A
T
T
G
T
T
G
C
C
C
C
[AT][CG][AC][ACGT-](3)A[TG][GC]
~3600 possible valid sequences
Lecture 3.2
10
Making a Markov Model of Motif
0.4
C:0.4
G:0.2
T:0.2
A:0.2
0.6
0.6
A:0.8
T:0.2
1.0
C:0.8
G:0.2
1.0
A:0.8
C:0.2
0.4
A:1.0
1.0
T:0.8
G:0.2
1.0
C:0.8
G:0.2
P(ACAC--ATC)=0.8x1.0x0.8x1.0x0.8x1.0x0.6x0.4
x0.6x1.0x1.0x0.8x1.0x0.8 = 0.0047
Lecture 3.2
11
How to score the match of a sequence
against three motif models?
Regular express: exact match or fuzzy
match
Profile: sum of log-odds
HMM: probability score P(s|H)
Outline
Introduction to DNA Motif
Motif Representations (Recap)
Motif database search
Algorithms for motif discovery
7/16/2015
13
How do we search for occurrences
of known patterns?
Tools exist that allow us to search for one or more
known sequence patterns in one or more sequences in
different ways.
The patterns can come from a database of known
patterns or be novel patterns we have discovered using
pattern discovery software or other means.
Some tools treat each pattern independently; others
look for groups of matches to patterns.
All tools compare each pattern to each position and
compute a score which can be the number of mutations
(regular expression patterns) or a probability or logodds (profiles and HMMs).
Many useful databases of patterns
have been compiled
TRANSFAC – transcription factor binding
sites (profiles)
PROSITE – protein sites and domains
(regular expressions and profiles)
EPD – eukaryotic promoters (profiles)
PFAM – protein families and domains
(HMMs)
BLOCKS – protein families (profiles)
Searching for known patterns in a
given sequence
MOTIF – search protein sequence against
Prosite, PFAM etc.; search DNA sequence
against TRANSFAC
PROFILESCAN – search protein
sequence against Prosite database of
profiles or regular expressions
MAST – search for occurrences of one or
more patterns in a DNA sequence (or
database of sequences)
Outline
Introduction to DNA Motif
Motif Representations (Recap)
Motif database search
Algorithms for motif discovery
7/16/2015
17
The Motif Discovery Problem
We are given a set of sequences, each
containing an instance of an unknown motif.
Find the motif.
Multiple, local sequence alignment.
A clean, computer-sciencey problem. A bit too
clean, we should be suspicious…
In Real Life
A microarray experiment indicates that 50
genes share similar expression patterns.
Do they share a common type of transcription
factor binding site?
◦ Almost certainly some of the genes were included
erroneously: experimental noise.
◦ Perhaps they share a common mRNA degradation
signal.
Is the TFBS near the transcription start site?
◦ Yeast: probably. Human: who knows?
Approaches to Motif Discovery
Matrix-based:
◦ Gibbs Sampling - most popular.
◦ Expectation maximization.
◦ Stormo’s greedy algorithm.
Consensus sequence-based:
◦ Several algorithms by Pevzner.
◦ Box-finder of Kielbasa et al.
Three Ingredients of Almost any
Bioinformatics Method
1.
2.
3.
Search space (haystack) Mathematically precise
formulation of the problem
Scoring scheme
Search algorithm (= optimization technique)
Strictly speaking, Gibbs sampling and expectationmaximization are search algorithms. They are not
specific to motif discovery; indeed they were first
used in other contexts.
Gibbs Sampling: Simplifying
Assumptions
The width of the motif is known in advance.
No indels (gaps).
Each sequence contains precisely one instance
of the motif.
The sequences are single-stranded (e.g. mRNA).
Search Space
Motif width = W
N
Length = L
Size of search space = (L – W + 1)N
L=100, W=15, N=10 size 1019
Scoring Scheme
Assign a numeric score to any proposed
answer.
caga
ctga
cacc
cgca
What score should this get?
Some Definitions
caga
ctga
cacc
cgca
count matrix:
• pki = cki / N
cki =
k
1
2
3
4
a
0
2
0
3
c
4
0
2
1
g
0
1
2
0
t
0
1
0
0
• pi = background abundance of ith residue type
i
Two Scoring Schemes
1. Based on frequentist statistics / information
theory:
score
W
k 1
c
ln pki ki
i a , c , g ,t
p
cki
i
i a , c , g ,t
2. Based on Bayesian statistics:
score
W
k 1
6
ln
cki!
N 3! i a ,c , g ,t
p
cki
i
i a , c , g ,t
Worked Example
score
W
k 1
cki =
6
ln
cki!
N 3! i a ,c , g ,t
p
cki
i
i a , c , g ,t
1
2
3
4
a
0
2
0
3
c
4
0
2
1
g
0
1
2
0
t
0
1
0
0
Score = 1.99 - 0.50 + 0.20 + 0.60 = 2.29
N = 4
pi = ¼
pi
cki
1
N
1 256
4
i
6
N 3!
p
cki
i
i
32
105
Search Algorithm
We want the global maximum score!
(Or as close as we can get.)
Exact algorithms (e.g. dynamic programming)
would be too slow (e.g. lifetime of universe).
Therefore we resort to a heuristic algorithm:
Gibbs sampling, which is a type of Monte Carlo
Markov chain method.
Gibbs Sampling Search
1
Suppose the search space is a 2D
rectangle. (Typically, more than 2
dimensions!)
Start at a random point X.
Randomly pick a dimension.
2
X
Look at all points along this dimension.
Move to one of them randomly, proportional
to its score π.
Repeat.
Gibbs Sampling for Motif Search
Choose a random starting state.
Randomly pick a sequence.
Look at all motif positions in this
sequence.
Pick one randomly proportional
to exp(score).
Repeat.
Does it Work in Practice?
Only successful cases get published!
Seems more successful in microbes (bacteria & yeast)
than in animals.
The search algorithm seems to work quite well, the
problem is the scoring scheme: real motifs often don’t
have higher scores than you would find in random
sequences by chance. I.e. the needle looks like hay.
Attempts to deal with this:
◦ Assume the motif is an inverted palindrome (they often are).
◦ Only analyze sequence regions that are conserved in another
species (e.g. human vs. mouse).
As usual, repetitive sequences cause problems.
More powerful algorithm: MEME
1.
Go to our MEME server:
http://molgen.biol.rug.nl/meme/website/meme.ht
ml
1.
Fill in your emailadres, description of the sequences
2.
Open the fasta formatted file you just saved with
Genome2d (click “Browse”)
3.
Select the number of motifs, number of sites and the
optimum width of the motif
4.
Click “Search given strand only”
5.
Click “Start search”
Something like this will appear in your
email. The results are quite self
explanatory.
Summary
Motif discovery and Motif search problem
Motif representation
Gibbs sampling algorithm for motif discovery
Using MEME (Expectation Maximization algorithm) for
motif discovery
Acknowledgement
Zhiping Weng (Boston Uni.)
Timothy L. Bailey