Evaluation of existing motif detection tools on their

Transcript Evaluation of existing motif detection tools on their

DETECTION OF REGULATORY MOTIFS
BASED ON COEXPRESSION AND
PHYLOGENETIC FOOTPRINTING
PhD presentation Valerie Storms
March 29th, 2011
Promoters
Prof. Dr. Ir. Kathleen Marchal
Prof. Dr. Ir Bart De Moor
Overview
1. Introduction on transcriptional regulation
2. The effect of orthology and coregulation on detecting
regulatory motifs
3. PhyloMotifWeb: workflow for motif discovery in
eukaryotes
4. De novo motif discovery in vitamin D3 regulated genes
Genetic information
All living organisms consists of one or more cells
• E.g. humans:
– Built of multiple cells like nerve cells, muscle cells, skin cells
– Every cell: contains identical genetic information
G- C
Genetic information
Sugar-Phosphate
Backbone C - G
• Stored as DNA (deoxyribose nucleic acid)
Base pair A-T
• Double helix with sugar-phosphate backbone
• 4 building blocks = “base”
– A: adenine
– C: cytosine
– G: guanine
– T: thymine / U: uracil
• Complementary base pairing -> hydrogen bounds
• Presentation: ACCTGCTAG….ATTGACGGAC
G- C
A-T
T -A
C- G
Base pair G-C
G- C
T -A
A-T
G- C
G- C
T -A
A-T
T -A
Genetic dogma
GENEXPRESSIE
DNA contains genes = specific sequences of bases that encode instructions on how
to make proteins = work units of a cell
….AAATTTGGTTGTTGTCTCCCAGCTGTTTATTTCTGT
Gene
DNAAACAGATCTTGGAGGCTGCGGTCTGGATCCCTCGCC
AAGAACCAGATCCAGGAGAAAACGTGCTCAACGTGC
AGCTCTGCTCCTACTGATTATAGCCCCACAGATGACA
TCGCTCCATAGTCACACCAAGTCTCCTGTGGGAGTC
TTGCTCCTCGTTCTCAGTGTCTGTTACAGCTCGGTAT
TRANSCRIPTION
TTTAGTGTCAGGACGTCGGCTCCCAGCCCGCATCTC
CGCTCAGCAATGCCATTATCTTCTCAGCCAAGTCCTA
GAAATGGGTTGGCTTCCCATTTGCAAAAACATCGCT
CCATAGTCACACCAAGTCTCCTGTGGGAGTCTTGCT
CCTCGTTCTCAGTGTCTGTTACAGCTCGGTATTTTAG
mRNA
TGTCAGGACGTCGGCTCCCAGCCCGCATCTCCGCT
CAGCAATGCCATTATCTTCTCAGCCAAGTCCTAGAAA
TGGGTTGGCTTCCCATTTGCAAAAACATCGCTCCATA
GTCACACCAAGTCTCCTGTGGG….
TRANSLATION
protein
TRANSCRIPTIONAL
REGULATION
DIFFERENT
LEVELS
OF REGULATION
Main players in Transcriptional regulation
1. Recruitment of the RNA POLYMERASE COMPLEX to the promoter
region of the target gene
Co-activator
TF
RNA polymerase
complex
TSS
TARGET GENE
DNA
Promoter region
This process can be activated or repressed by:
• Transcription Factors (TFs) – activators and repressors
 Bind DNA directly by recognizing specific regions
• Co-activators and co-repressors
 Recruited by protein-protein interactions
Main players in Transcriptional regulation
2. Chromatin structure
Eukaryotic cells
• Nucleus
• Linear DNA molecules organized into chromosomes
• Chromatin = complex of DNA and proteins
Histones
Linear DNA molecule
Influences
Transcriptional
Regulation
TF
Heterochromatin
Euchromatin
Main players in Transcriptional regulation
Co-activator
TF
DNA
TF
RNA polymerase
complex
TSS
REGULATORY MOTIF
Chromatin
remodeling complex
TARGET GENE
ATTGCCAT
- Modify chromatin structure:
TF-DNA INTERACTION
- DNA methylation
- Histone modifications like methylation, acetylation
• TFs bind specific non-coding sequences in the DNA to control the expression of
their target genes  TF binding sites
• All genes regulated by the same TF contain a similar TF binding site in their
promoter region
• REGULATORY MOTIF models the TF-DNA binding specificity and captures the
variability of TF binding sites
Regulatory motif
TF
REGULATORY MOTIF
G
G
G
G
G
T
T
A
T
T
G
G
G
G
C
A
A
A
T
A
C
C
C
C
G
G
C
G
G
G
Alignment of TF binding sites
A
0.01
0.01
0.01
0.97
0.01
0.01
C
0.01
0.01
0.01
0.01
0.97
0.29
G
0.97
0.01
0.97
0.01
0.01
0.69
T
0.01
0.97
0.01
0.01
0.01
0.01
p1
p2
p3
….
Construction of frequency matrix
pn
Motif logo
Computational motif discovery
?
TF
De novo
Motifmotif
scanning
discovery
1. Motif scanning: known motif model
 Different algorithms to predict TF binding sites
2. De novo motif discovery: search for novel, uncharacterized motifs
 Two different computational approaches!
Algorithms classified based on the information sources they use:
- Coregulation information
- Orthology information
- Co-localization of different TF binding sites
- Chromatin structure
Overview
1. Introduction on transcriptional regulation
2. The effect of orthology and coregulation on detecting
regulatory motifs
3. PhyloMotifWeb: workflow for motif discovery in
eukaryotes
4. De novo motif discovery in vitamin D3 regulated genes
Different information spaces
1. Coregulation space
2. Orthologous space
Next generation of motif discovery
tools integrates orthology with
coregulation information
3. Combined coregulation-orthology space
Study
Research goal:
– Extent of information in coregulation or orthologous space
– Conditions under which complementing both spaces improves motif
detection
Method:
– Synthetic and real benchmark datasets
– Select motif detection tools  flexible enough to perform in each of the
three spaces
- Phylogibbs (Siddharthan et al., 2005)
- Phylogenetic sampler (Newberg et al., 2007)
- MEME (Bailey and Elkan, 1994)
Theoretical comparison
Overview
Phylogibbs
Phylogenetic sampler
MEME
Simulated annealing +
tracking
A Gibbs sampler
Expectation
=> local optimum
Maximization
=> global optimum (= MAP
solution)
=> Ensemble centroid solution
=> local optimum
Short
Long (>multiple re-initializations)
Short
Phylogenetic relatedness between the orthologous sequences
No evolutionary
model
 Tree-based evolutionary model
 Alignment of the orthologous sequences needed
 Unaligned
sequences
Theoretical comparison
Assignment and scoring of motif sites
Unaligned
Phylogibbs
Single independent motif sites
Window principle
Prealigned -> more flexible in case of a bad
prealignment
Phylogenetic sampler
Block principle
-> very sensitive to bad prealignments
-> leave out phylogenetic distant
orthologs
Multiple
orthologous
motif sites
Tree-based evolutionary model
(F81)
Performance assessment
Construction of Synthetic datasets
1
Motif WMs with a different IC
TC…T
2
Background sequences
3
TT…T
…
TC…C
4
Ancestor species
Seq 1
Seq 2
Use a phylogenetic tree and an
evolutionary model to create the
orthologs for different species
Seq 1
Seq 2
5
…
Seq 10
REF SPECIES
SPECIES 1
SPECIES 2
SPECIES 3
SPECIES 4
Coregulation
Orthologous
Combined
… Seq 10
Performance assessment
Construction of Real datasets
Biological datasets:
1. Prokaryotic data -> Gamma-proteobacteria
LexA
TyrR
2. Eukaryotic data -> yeast species
Urs1H
Rap1
Performance assessment
Results (1)
…
COREGULATION SPACE
 Depends on the degeneracy of the embedded motif
 Does adding orthologs
improve the performance
for the LOW IC motif?
Performance assessment
Results (2)
COMBINED SPACE
…
…
1. Evolutionary distance between the added orthologs
Performance assessment
Results (3)
2. Phylogenetic tree
=> Tree based on neutral evolution rate
3. The number of added orthologs and the topology of the tree
=> low impact
4. Noise
=> Orthologous direction: performance drop depends on the species distance
and the algorithm characteristics
Performance assessment
Results (4)
ORTHOLOGOUS SPACE
 Room for improvement!
-Number of added orthologs
larger effect than in
combined space
-PS
Almost no output
when orthologs are prealigned
(No centroid solution)
Conclusions
Phylogibbs
Phylogenetic sampler
MEME
Quality of predicted motifs depends on correctness of
prealignments  Challenge: accounting for phylogenetic
relatedness, independent of a prealignment
Ensemble centroid strategy
 Useful with low signal/noise
 Computationally limiting
Phylogenetic tools may perform better than the more basic MEME tool
BUT  More parameters to tune
 Performance strongly depends on the prealignment quality, the
phylogenetic tree, the relationship between the orthologs etc…
Overview
1. Introduction on transcriptional regulation
2. The effect of orthology and coregulation on detecting
regulatory motifs
3. PhyloMotifWeb: workflow for motif discovery in
eukaryotes
4. De novo motif discovery in vitamin D3 regulated genes
PhyloMotifWeb
Motif finders with different
algorithmic background
performance diversity
Ensemble strategy
combine results
of multiple algorithms
Progress of
experimental
technologies
Growing number of
sequenced genomes
Orthology information
Epigenetic information
Chromatin structure
information
Ensemble
phylogenetic
motif finders
Create orthologs
alignments
phylogenetic
tree
Automatic
parameter sweep
Easy reduction
of search space
PhyloMotifWeb – Ensemble strategy
• Three motif finders: Phylogibbs, Phylogenetic sampler and MEME
• Run each motif finder across multiple parametersettings (e.g. different
motif numbers, motif widths etc.)
 Large collection of output matrices
• FuzzyClustering algorithm
– summarizes all these output matrices into a set of non-redundant
ensemble motifs
– Works on the TF binding site level <-> matrix level
PhyloMotifWeb
Motif finders with different
algorithmic background
performance diversity
Ensemble strategy
combine results
of multiple algorithms
Progress of
experimental
technologies
Growing number of
sequenced genomes
Orthology information
Ensemble
phylogenetic
motif finders
Create orthologs
alignments
phylogenetic
tree
Epigenetic information
Chromatin structure
information
Important for motif
discovery in
eukaryotes!
Automatic
parameter sweep
Easy reduction
of search space
PhyloMotifWeb - Eukaryotes
Restrict search space to regions with higher regulatory potential
based on epigenetic information like chromatin structure
BUT: Tissue and condition dependent!
Annotation of regulatory regions > Regulatory build pipeline of
Ensembl
•
Multi-cell type:
– DNase hypersensitivity -> open chromatin
– CTCF binding sites -> enhancer/insulator marker
– Binding sites of other TFs
•
Cell-type specific:
– Histone modifications
PhyloMotifWeb – Webserver
PhyloMotifWeb – Webserver
PhyloMotifWeb – Webserver
Results page
- Motif logo
- Individual binding sites
of the ensemble solution
- p-value for the
overrepresentation of the
ensemble motif in the
sequence set versus
random sequence sets
- Comparison with
database motifs
Overview
1. Introduction on transcriptional regulation
2. The effect of orthology and coregulation on detecting
regulatory motifs
3. PhyloMotifWeb: workflow for motif discovery in
eukaryotes
4. De novo motif discovery in vitamin D3 regulated genes
Vitamin D3 - metabolism
•
Source: Diet and produced in skin when exposed to sunlight
•
Role in regulating many physiological
and cellular processes:
- Bone health
- Prevention of autoimmune diseases
- Anti-proliferative effect on different cell
types like cancer cells
Vitamin D3 - mode of action
VitD3
VDR
1. Vitamin D3 enters the cell and binds to the vitamin D
receptor (VDR), which dimerizes with RXR
VitD3
RXR
2. Ligand-activated VDR/RXR binds
the DNA at Vitamin D Regulatory
elements (VDRE)
VDR
VDRE
3. Recruitment of co-activators and
chromatin remodelers:
 open chromatin structure
Chromatin
remodeling
complex
Co-activator
complex
VitD3
RXR
VDR
4. Transcription of the VDR target
gene
DRIP
RXR
VitD3
Transcription machinery
VDR
Target gene
Vitamin D3 - dataset
Mouse bone
cells
VitD3
Target gene
VDRE
VERSUS
Ctr
VitD3
RXR VDR
Human breast
cancer cells
ANTIPROLIFERATIVE
PHENOTYPE
GOAL: get insight in molecular mechanism underlying anti-proliferative
effect of vitD3
- Human and mouse cell lines treated with vitD3 versus no vitD3 (Control)
- Measured the expression of all genes in the human and mouse cells using
microarrays for both conditions over different time points
- Select differentially expressed genes (vitD3 versus Control) -> phenotype
- Group per species all genes with similar behavior in coexpression clusters
 focus on genes with a conserved co-expression behavior across
human and mouse interesting for common anti-proliferative phenotype
Vitamin D3 - Dataset
Conserved coexpression cluster:
- 10 genes
- Upregulated after vitD3
Assume: conserved
transcriptional regulation
Conserved regulatory motifs
responsible for expression
behavior
 De novo strategy
 Screening: Co-localization of TF binding sites
Vitamin D3 - de novo motifs
METHOD: PhyloMotifWeb
RESULTS:
1. Very common motifs
• Low specificity for coexpressed cluster
•
Match with TFs involved in cell cycle regulation
–
–
•
Well conserved TF binding sites, present in many genes!
e.g. SP1, ZF5, NRF1
TF involved in B-cell differentation
–
EBF
Vitamin D3 - de novo motifs
2. Motifs specific for the conserved coexpression cluster
-> higher overrepresentation in the cluster compared to the genome
-> match with following TFs:
ZEB1
- Transcriptional activator of VDR protein
- Role in cancer metastasis
VDR
- Putative direct regulation by VDR
- VDRE hard to discover de novo: only one conserved half-site!
•Two conserved half sites with variable spacer
C1
C2
C1
C2
•Diverse configurations [DR, IR, ER]
•Located far up-/down-stream TSS
NHR-scan: specific for nuclear hormone receptor binding sites
Vitamin D3 – Cis-regulatory modules
TF1 TF2
TF1 TF2
TF1 TF2
Higher eukaryotes:
-> TFs act in cooperation to modulate gene expression
-> Find co-localized binding sites for de novo predicted motifs => CRMs
Vitamin D3 – Cis-regulatory modules
METHOD: CPModule
INPUT:
•
De novo predicted motif models
•
Constraint: module size ranging between 150bp and 400bp
RESULTS:
•
3 CRMs highly specific for the coexpressed genes (p-value < 0.001):
SP1-EBF
7 genes
NRF1-EBF
7 genes
VDR-ZEB1-EBF
10 genes
•
Each CRM contains the EBF motif -> degenerated -> many hits -> using a motifspecific score threshold
•
Most interesting is the ZEB1-VDR module
Vitamin D3 - perspectives
•
Motifs predicted for the conserved coexpression cluster -> investigate
their presence for larger species-specific clusters or maybe for the
full genome
•
The availability of cell-type specific epigenetic information can help
to retrieve the functional binding sites
•
Besides a transcriptome analysis -> integrate extra omics data like
ChIP-seq and protein profiling to reconstruct the regulatory network of
vitD3
Acknowledgements
CMPG-Bioi
ESAT-Bioi
•
Prof. Dr. Kathleen Marchal
•
Prof. Dr. Bart De Moor
•
Dr. Pieter Monsieurs
•
Prof. Dr. Yves Moreau
•
Marleen Claeys
•
Wouter Van Delm
•
Carolina Fierro
•
Aminael Sanchez
LEGENDO
•
Hong Sun
•
Dr. Lieve Verlinden
•
Prof. Dr. Mieke Verstuyf
CMPG
•
Dr. Guy Eelen
•
•
Els Vanoirbeek
Prof. Dr. Jan Michiels
Extra slides
Theoretical comparison
Phylogibbs Algorithm (1)
Procedure:
1. start with a random configuration C,
based on prior information on the number of motif sites/TFs
2. construct the set of all possible configurations C’ that differ
in one single move from C (designed moveset)
3. calculate for each C’ the posterior probability score
4. sample a new configuration from this score distribution
 This procedure is repeated for two phases :
1. Simulated annealing: iterating to configuration C* with the highest posterior probability
(=MAP) (temperature parameter β)
2. Tracking: posterior probabilities are assigned to the windows in C*
-> One initialization is sufficient
-> Very short running time (minutes/hours)
Theoretical comparison
Phylogibbs Algorithm (2)
3. Calculate the posterior probability score: P(C|S)
Bayes’ Theorem:
 P(C|S) ~ P(S|C) = probability that the motif sites of C are drawn from the motif WM and that the
background sequence is drawn from the background model  EVOLUTIONARY MODEL
 The motif WM = unknown!! -> integral over all possible WMs :
with prior P(WM) modeled by Dirichlet prior distribution Dir(γ)
The approximation to solve this integral requires that the tree topologies are reduced to
collections of star topologies
Theoretical comparison
Phylogenetic sampler Algorithm (1)
Procedure:
1. start with a random positioning of blocks (based on prior information
on the expected number of motif sites/TFs and max number of motif sites per sequence)
2. update the motif model based on the current blocks (<-> PG)
3. scoring: leave out the blocks for one sequence (<-> PG)
and calculate for each possible block the conditional probability score
4. first sample the number of motif sites for the sequence, then sample
this number of blocks from the score distribution (3)
 This iteration procedure is repeated for:
1. Burn-in phase: to converge to local optimum
2. Sampling phase: keep track of all sampled blocks to construct the centroid afterwards
-> multiple initializations (seeds) recommended to avoid getting trapped in local maximum
-> long running time (hours/days)
Theoretical comparison
Phylogenetic sampler Algorithm (2)
2. Update the motif model
-> Sample a new motif model from a Dirichlet distribution Dir(β+c) adjusted with phylogenetically weighted counts (based
on phylogenetic tree)
-> Accept the new motif with a probability proportional to the Metropolis Hastings ratio
3. Calculate the conditional probability score
The conditional probability
=> proportional to the probability that the block is drawn from the motif model (inferred)
divided by the probability that the block is drawn from the background model
 EVOLUTION MODEL
 The Felsenstein tree-likelihood algorithm is used to handle all tree topologies (<->PG)
Theoretical comparison
Solution
Phylogibbs  Maximum a posteriori (MAP) solution
-> set of motif sites (configuration) with the highest posterior probability
Phylogenetic sampler  Centroid solution
-> report all those motif sites that appear in at least half the sampling iterations
-> keeps track of all motif sites sampled during sampling iterations to calculate posterior probabilities
-> does not take into account joint occurrences of the motif sites
Figure from Newberg et al., 2007
Theoretical comparison
Evolutionary model
Adapted Felsenstein (F81) model
-> Describes the substitution process at the nucleotide level
-> Assumes that all positions evolve independently and at equal rates (u)
-> Probability that a is mutated to b is dependent on the time (t)
-> Fixation of b is dependent on its frequency in the motif WM
Phylogibbs  proximity = q = exp(-ut) = probability that no substitution took place per site
Phylogenetic sampler  branch length = b = ut AND a different normalization for their branch
lengths (k)
Convert proximities to branch lengths::: b=-3/4ln(q)
Introduction
Main players in Transcriptional regulation
Prokaryotic cells (bacteria):
• No nucleus, circular ‘naked’ DNA molecule
Eukaryotic cells:
• Linear DNA molecules organized into chromosomes
• Chromatin > complex of DNA and proteins (Histones)
Chromatin function:
– Storage of long DNA molecules into nucleus
Nucleus
Chromosome
– Role in Transcriptional regulation: euchromatin
and heterochromatin
DNA
Nucleosome
Chromatin
Histone proteins
Main players in Transcriptional regulation
2.
Chromatin structure (eukaryotes)
Co-activator
TF
DNA
RNA polymerase
complex
Chromatin
remodeling complex
TSS
TARGET GENE
Promoter region
Theoretical comparison
Input format
SPACE
Phylogibbs
Phylogenetic
sampler
COREGULATION:
Non-coding regions for a set of
coregulated genes from one species
Unaligned
ORTHOLOGOUS: Non-coding regions for a set of
Prealigned orthologs
orthologous genes from multiple species
-PG => Dialign
-PS => ClustalW
COMBINED: Combination of both
Phylogenetic tree
MEME
Unaligned
Theoretical comparison
Assignment and scoring of motif sites
Unaligned
Phylogibbs
Single independent motif sites
Window principle
Prealigned -> more flexible in case of a bad
prealignment
Phylogenetic sampler
Block principle
-> very sensitive to bad prealignments
-> leave out phylogenetic distant
orthologs
Multiple
orthologous
motif sites
Tree-based evolutionary model
(F81)
Performance assessment
Results (3)
2. Phylogenetic tree
=> Tree based on neutral evolution rate
3. The number of added orthologs and the topology of the tree
4. Noise
=> Orthologous direction: performance drop depends on the species distance
and the algorithm characteristics
Spec 3
Spec M
Phylogibbs ↓
Phylogenetic sampler ↓
-Weighting scheme
-Block principle
PhyloMotifWeb - webserver
PHYLO-MOTIF-WEB
ENSEMBL
CORE
STEP 1
Select the non-coding regions
STEP 2
ENSEMBL
COMPARA AND
REGULATORY
BUILD
Additional information sources
Mask repeats
Multi-species alignments
DNA features like chromatin structure
STEP 3
Motif discovery by using an ensemble
strategy
TRANSFAC and JASPAR
UCSC GENOME
BROWSER
MEME
Phylogibbs
Phylogenetic sampler
STEP 4
Clover
Post-processing of the predicted
ensemble motif matrices
MotifComparison
External Database
External Software
PhyloMotifWeb - Webserver
Vitamin D3 - de novo motifs
RESULTS:
1. Very common motifs
-> low overrepresentation in the cluster compared to the genome
-> match with following TFs:
SP1
- Involved in vitD3 response –> regulation of genes without VDRE
binding site
MEME
- Regulator of TFs involved in cell cycle regulation
ZF5
NRF1
- TF particularly abundant in differentiated tissues with low proliferation
MEME
- Growth suppressive activity
PG
- Involved in cell proliferation
MEME
PG
EBF
- B-cell differentation
 SP1, ZF5 and NRF1 are cell cycle regulators -> well conserved binding
sites, present in many genes!
PS
PhyloMotifWeb – Ensemble strategy
• Three motif finders: Phylogibbs, Phylogenetic sampler and MEME
• Run each motif finder across multiple parametersettings (e.g. different
motif numbers, motif widths etc.)  Large collection of output matrices
• FuzzyClustering algorithm -> summarizes all these output matrices into
a set of non-redundant ensemble motifs
- Works on TF binding site level -> fine tuning sensitivity/specificity
- Integration of TF binding site scores assigned by the original motif
finder
- Trace back the different motif finders that contributed to the final
solution
Vitamin D3 - de novo motifs
METHOD: PhyloMotifWeb
- 4000 bp centered around TSS
 Restrict to regions with regulatory potential
- Use evolutionary conservation information
 human-mouse pairwise alignment
 six species alignment
- Use Phylogibbs, Phylogenetic sampler and MEME => Ensemble solution
- Predicted ensemble motifs were compared to database motifs from
TRANSFAC and JASPAR to retrieve TFs potentially involved in the
coexpression behavior
Vitamin D3 - dataset
Mouse bone
cells
VitD3
Target gene
VDRE
VERSUS
Ctr
VitD3
RXR VDR
Human breast
cancer cells
ANTIPROLIFERATIVE
PHENOTYPE
GOAL: get insight in molecular mechanism underlying anti-proliferative
effect of vitD3
- Human and mouse cell lines treated with vitD3 versus no vitD3 (Control)
- Measured the expression of all genes in the human and mouse cells using
microarrays for both conditions over different time points
- Select differentially expressed genes (vitD3 versus Control) -> phenotype
- Group per species all genes with similar behavior in coexpression clusters
 Focus on similarity between human
COMMON antiproliferative phenotype
and mouse cells as interesting for
General perspectives
Integration of multiple information sources to improve de novo
motif discovery
• Orthology information
– Ortholog alignments, evolutionary models
– Evolution in how algorithms exploit this information source
• New information sources like epigenetic information
become available
– How to exploit this new information?
– More knowledge on which chromatin modifications co-locate
with transcriptionally active regions like promoters, enhancers
or TF binding sites will improve usability

Evaluation of existing motif detection tools on their

Transcript Evaluation of existing motif detection tools on their

Directory