Special Topics in Genomics Motif Discovery

Download Report

Transcript Special Topics in Genomics Motif Discovery

Special Topics in Genomics
Cis-regulatory Modules and Phylogenetic
Footprinting
Cis-regulatory Modules and Module Discovery
The slides for module discovery are provided by Prof. Qing Zhou @ UCLA
Motif Discovery
 0 A 
 
 0C 
 0G 
 
 0T 
θ0
Background
1 A

 1C
1G

1T
2 A
 2C
 2G
 2T
  wA 
  wC 
  wG 

  wT 
1
2
3
4 5
θ1 θ2  θw  Θ
Motif (weight matrix)
Mixture modeling
Difficulties in motif discovery in higher
organisms
• Upstream sequences are longer.
• Motifs are less conserved and shorter.
• Background sequence structures are more complicated.
• To solve the problem, utilize more biological knowledge in
our model.
1) module structure
2) multiple species conservation
Cis-regulatory module
• Combinatorial control of genes: cis-regulatory modules
module
module
CisModule: modeling module structure
(Zhou and Wong, PNAS 2004)
• Module structure: consider co-localization of motif sites.
0
Θ1
0.2 5
0.2 5


0.2 5


0.2 5
Motif 1
Motif 2
ΘK


q0
Motif 3
q1
qK
Hierarchical Mixture modeling 
K: # of motifs
B
M
1 r
r
S
Parameters and missing data
• Missing data problem.
K # of motifs
Given
l
Module length
S Set of sequences  Observed data
M Indicators for a module start
Missing data
A Indicators for a motif site start
0 Background model
Θ Weight matrices for motifs
Parameters Ψ
W Motif widths
r Probability of a module start
q Probability of starting a motif site



Bayesian inference by posterior sampling
TTTGC
Parameter Update
Given M and A,
TATCC
Module-motif detection
Given Θ, r, q, and W,
1)Sample modules:
CTTGC
TTTAC
GTTGC
A 0

C 1
G 1

T 3
1
0
0
4




0
5
0

0
θ1θ2 θw
M=0
Aligned
1) Infer Θ from aligned
sites.
2) Update r, q and W.
M=1
M=0
2) Within each module,
sample motif sites:
Module sampling

Want to sample from P (M | S, Ψ), need to calculate
P(S | Ψ)   P(S, M | Ψ).
M
• Denote S  [ x1x2 xL ]  x[1, L] ,
 Forward summation:
f n (Ψ)  P( x[1,n] | Ψ).
f n (Ψ)  r  h(n  l  1, n) f nl (Ψ)  (1  r )  P( xn | 0 ) f n1 (Ψ).
 An ()  Bn (Ψ).
1
nl
Backgroun P( xn | 0 )
d:

n 1  n
Module: h(n  l  1, n)
L
Module sampling
• Backward sampling
An (Ψ)
P( M n l 1  1 | M [ n1, L ] ) 
.
f n (Ψ )

How to calculate h(n  l  1, n)
K
h(i, m)   qk P( x[ m wk 1,m] | k )h(i, m  wk )  q0 P( xn |  0 )h(i, m  1).
k 1
Posterior inference
• Motif sites: marginal posterior probability of being a
motif start position > 0.5.
• Modules: marginal posterior probability of being within
a module > 0.5.
Simulation study
• Generate 30 data sets independently, each contains:
1) 20 sequences, each of length 1000;
2) 25 modules, with length 150;
3) each module contains 1 E2F site, 1 YY1 site,
and 1 cMyc site.
CisModule
Do not consider module
Motifs
Fail
TP
FP
Fail
TP
FP
E2F
0.03
17.9
7.5
0.37
17.1
11.6
YY1
0.07
16.0
8.7
0.20
17.1
11.0
cMyc
0
15.7
9.9
0.63
13.6
12.4
Example: Discovery of tissue-specific modules
in Ciona
• Sidow lab Collected 21 genes that are
co-expressed during the development
of muscle tissue in Ciona.
• Want to find motifs and modules in the
upstream sequences (average length =
1330) of these genes.
• Found 3 motifs in 28 modules (4860
bps).
Are they real motifs that determine the gene
expression??
Experimental validation
• Positive element: the shortest sufficient and non-overlapping
sequence that drives strong expression in muscle: average
length of 289 bps.
Experimental validation
• 70% of our predicted motif sites are located in the positive
elements!
Other tools
• Gibbs Module Sampler (Thompson et al. Genome
Res. 2004)
• EMCMODULE (Gupta and Liu, PNAS, 2005)
Phylogenetic Footprinting
Functional elements tend to be conserved
across species
For example, exons are conserved due to the selection pressure. Introns and
intergenic regions are less likely to be conserved.
Phylogenetic footprinting
Miller et al. Annu. Rev. Genomics Hum. Genet. 2004
Incorporating cross-species conservation into
motif discovery
• A threshold method (Wasserman et al. Nature
Genetics, 2000)
STEP1: construct cross-species alignment
STEP2: compute conservation measure from the alignment
STEP3: Non-conserved regions are filtered out
STEP4: Gibbs motif sampler is applied to conserved regions of
the target genome
Phylogenetic footprinting & motif discovery
• CompareProspector (Liu Y. et al. Genome Res. 2004)
STEP1: construct cross-species alignment
STEP2: compute conservation measure (window percent
identity, WPID) from the alignment
STEP3: multiply the likelihood ratio at a position by the
corresponding WPID, thus likelihood landscape is changed to
favor conserved sites
STEP4: apply a Gibbs motif sampler based algorithm
Phylogenetic footprinting & motif discovery
• Evolutionary model based approach
EMnEM (Moses et al. 2004)
PhyME (Sinha et al. 2004)
PhyloGibbs (Siddharthan et al. 2005)
Tree Sampler (Li and Wong, 2005)
Incorporating cross-species conservation into
motif discovery
• PhyloCon(Wang and Stormo, Bioinformatics, 2003)
STEP 1: construct alignment among orthologous sequences;
STEP 2: convert conserved regions into profiles;
STEP 3: use profiles in the first sequence as seeds;
STEP 4: find matches of each seed in the second sequence;
STEP 5: update seeds;
STEP 6: repeat step 2 and 3 for all sequences.
Phylogenetic footprinting & module discovery
• Multimodule (Zhou and Wong, The Annals of Applied
Statistics, 2007)
Multimodule
• Module structure of each sequence is
modeled by an HMM.
• Couple HMMs via multiple
alignment: Aligned states are coupled
and collapsed into one common state.
• Uncoupled states: similar to single
species model.
• Coupled states: evolutionary model.
Comparing with other methods
• Three data sets with experimental validation reported
previously, which contain 9 known motifs with 152
validated sites.
• CompareProspector (Liu et al. 2004): conservation score
• PhyloCon (Wang and Stormo 2003): progressive alignment
of profiles
• EMnEM (Moses et al. 2004): Phylogenetic motif discovery
• CisModule (Zhou and Wong 2004): Single-species module
discovery.
Comparing with other methods
Method
# known
For correctly identified motifs by each method
motifs
identified # predicted # overlaps Sensitivity Specificity
sites
(%)
(%)
CompareProspector
7
75
36
24
48
PhyloCon
3
50
26
17
52
EMnEM
6
130
44
29
34
CisModule
5
110
35
23
32
MultiModule
8
157
79
52
50
# of known sites = 152