Genome evolution: a sequence
Download
Report
Transcript Genome evolution: a sequence
Genome Evolution. Amos Tanay 2012
Genome evolution:
Lecture 12: Evolution of regulatory
sequences
Genome Evolution. Amos Tanay 2012
Beyond Protein Coding Sequences
Non coding fraction of the genome:
• E. coli : 12%
• Yeast : 27%
• Fly : 76%
• Human : 97.6%
How biological functions of non-coding sequence can be defined?
Genome Evolution. Amos Tanay 2012
Sequence specific transcription factors
•
•
•
Sequence specific transcription factors (TFs) are a critical part of any gene activation or gene
repression machinery
TFs include a DNA binding domain that recognize specifically “regulatory elements” in the
genome.
The TF-DNA duplex is then used to target larger transcriptional structure to the genomic
locus.
Lactose Repressor
Genome Evolution. Amos Tanay 2012
Sequence specificity is represented using
consensus sequences or weight matrices
•
•
•
•
The specificity of the TF binding is central to the understanding of the regulatory relations it
can form.
We are therefore interested in defining the DNA motifs that can be recognize by each TF.
A simple representation of the binding motif is the consensus site, usually derived by
studying a set of confirmed TF targets and identifying a (partial) consensus. Degeneracy can
be introduced into the consensus by using N letters (matching any nucleotide) or IUPAC
characters (representing pairs of nucleotides, for examlpe W=[A|T], S=[C|G]
A more flexible representation is using weight matrices (PWM/PSSM):
ACGCGT
ACGCGA
ACGCAT
TCGCGA
TAGCGT
•
1
2
3
4
5
6
A
60%
20%
0
0
20%
40%
C
0
80%
0
100%
0
0
G
0
0
100%
0
80%
0
T
40%
0
0
0
0
60%
PWMs are frequently plotted using motif logos, in which the height of the character
correspond to its probability, scaled by the position entropy
Genome Evolution. Amos Tanay 2012
In vitro TF binding energy is approximated by weight
matrices
We can interpret weight matrices as
energy functions:
E ( s ) wi [ si ]
i
wi [ si ] log( pi [ si ])
This linear approximation is reasonable
for most TFs.
Yeast Leu3 data
(Liu and Clarke, JMB 2002)
Genome Evolution. Amos Tanay 2012
In-vivo TF binding affinity is approximated by weight matrices
Chromatin ImmunoPrecipitation (ChIP)
Ume6
•
s
Stronger prediction
Average PWM energy
11.5
Cross-link and sheer
•
s
ImmunoPrecipitation
5.5
ChIP ranges
Stronger binding
Tanay. Genome Res 2006
Genome Evolution. Amos Tanay 2012
TF binding affinity is kinetically important, with possible
functional implications
Kalir et al. Science 2001
Genome Evolution. Amos Tanay 2012
TFs are present at only a fraction of their optimal sequence
targets. Binding is regulated by co-factors, nucleosomes
and histone modifications
Heinzman et al. Nature
Genetics, 2007)
Genome Evolution. Amos Tanay 2012
TFs are present at only a fraction of their optimal sequence
targets. Binding is regulated by co-factors, nucleosomes
and histone modifications
Heinzman et al. Nature
Genetics, 2007)
Genome Evolution. Amos Tanay 2012
Specific proteins are identifying enhancers
Here are studies of p300 binding in the developing mouse brain
(visel et al. Nature 2009)
Genome Evolution. Amos Tanay 2012
TFBSs are clustered in promoters or in “sequence modules”
•
•
•
•
The distribution of binding sites in the genome is non uniform
In small genomes, most sites are in promoters, and there is a bias toward nucleosome free
region near the TSS
In larger genomes (fly) we observe CRM (cis-regulatory-modules) which are frequently away
from the TSS. These represent enhancers.
A single binding site, without the context of other co-sites, is unlikely to represent a
functional loci
Genome Evolution. Amos Tanay 2012
Discriminative scores for motifs
•
•
•
So far we used a generative probabilistic model to learn PWMs
The model was designed to generate the data from parameters
We assumed that TFBSs are distributed differently than some fixed background model
•
If our background model is wrong, we will get the wrong motifs..
•
•
A different scoring approach try to maximize the discriminative power of the motif model.
We will not go here into the details of discriminative vs. generative models, but we shall
exemplify the discriminative approach for PWMs.
Lousy discriminator
High specificity discriminator
High sensitivity discriminator
Genome Evolution. Amos Tanay 2012
Hypergeometric scores and thresholding PWMs
Number of sequences
| A | n | A |
k | B | k
P(| A B | k )
n
|
B
|
Hyper geometric probability
(sum for j>=k is the hg p-value)
Positive
True positive
PWM score threshold
For a discriminative score, we need to decide on both the PWM model and the
threshold.
Genome Evolution. Amos Tanay 2012
Constructing a weight matrix from aligned TFBSs is trivial
• This is done by counting (or “voting”)
• Several databases (e.g., TRANSFAC, JASPAR) contain matrices that were
constructed from a set of curated and validated binding site
• Validated site: usually using “promoter bashing” – testing reported
constructs with and without the putative site
Transfac 7.0/11.3 have 400/830 different PWMs, based on more than 11,000
papers
However, there are no real different 830 matrices out there – the real binding
repertoire in nature is still somewhat unclear
Genome Evolution. Amos Tanay 2012
High density arrays quantify TF binding preferences and identify
binding sites in high throughput
•
•
Using microarrays (high resolution tiling arrays) we can now map binding sites in a genome-wide
fashion for any genome
The problem is shifting from identifying binding sites to understanding their function and
determining how sequences define them
Harbison et al., Nature 2004
Genome Evolution. Amos Tanay 2012
Direct measurements of the in-vitro binding affinity of 8-mers and DNA binding domains
(here just a library of homeodomains, from Berger et al. 2008)
Genome Evolution. Amos Tanay 2012
Profiling binding affinity to the entire k-mer spectrum provide direct
quantification of in-vitro affinity (Badis et al., 2009)
104 TFs
8-mers
Heatmap of 2D hierarchical
agglomerative clustering
analysis of 4740 ungapped 8mers over 104 nonredundant
TFs, with both 8- mers and
proteins clustered using
averaged E-score from the
two different array designs.
Genome Evolution. Amos Tanay 2012
What kind of biological function is naturally selected?
Discrete and
deterministic “binding
sites” in yeast as
identified by Young,
Fraenkel and colleuges
In fact, binding is rarely deterministic and discrete, and simple wiring is something you
should treat with extreme caution.
Genome Evolution. Amos Tanay 2012
The Halpern-Bruno model for selection on affinity
We work on deriving the substitution rate at each position of the binding site, given its observed
stationary frequency. We are assuming that the fitness of the site is defined by multiplying the
fitness values of all loci. This means fitness is generally linear in the binding energy!
According to Kimura’s theory, an allele with
fitness s and a homogeneous population would
fixate with probability:
Assuming slow mutation rate (which allow us to
assume a homogenous population) and motifs
a and b with relative fitness s the fixation
probabilities (chance of fixation given that
mutation occurred!) are:
If p represent the mutation probability, and p the
stationary distribution, and if we assume the
process as a whole is reversible then:
(Halpern and Bruno, MBE 1998)
p p
ln b ba
p p
f ab a ab
p p
1 a ab
p b pba
1 e 2 s
1 e 2 Ns
fitness 1 s, s 1
1 e 2 s
2s
f ab
2 Ns
1 e
1 e 2 Ns
1 e2s
2s
f ba
1 e 2 Ns 1 e 2 Ns
2s
1 e 2 Ns e 2 Ns 1
2 Ns
f ab / f ba
e
1 e 2 Ns
2s
1 e 2 Ns
p b pba f ba
p p
f
1 b ba ab e 2 Ns
p a pab f ab
p a pab f ba
p p
ln b ba
p p
rab c pab a ab
p p
1 a ab
p b pba
Genome Evolution. Amos Tanay 2012
The Halpern-Bruno model for selection on affinity
The HB model is limited for the study of general sequences.
When restricting the analysis to relatively specific sites, HB is not completely off
Moses et al., 2003
Genome Evolution. Amos Tanay 2012
Testing the general binding energy – fitness correspondence
•
While E(S) is approximated by a PWM, F(E) is unlikely
to be linear
•
Assume that the background probability of a motif a is
P0(a). In detailed balance, and assuming the fitness of a
at functional sites is F(a), the stationary distribution at
sites can be shown to be:
Expected and observed energy
distribution in E.Coli CRP sites
(left) and background (right)
Q(a) Po (a)e 2 NF ( a )
•
If we collapse all sites with binding energy E
(and hence the same F(a)=F(E(a))
Q( E ) Po ( E )e 2 NF ( E )
•
The entire genome should behave like a
mixture of background sequance and functional
loci:
W ( E ) (1 ) Po ( E ) Q( E )
•
Inferred F(E), is shown in Orange
So we can try and recover Q(E) and therefore
F(E) from the maximum likelihood parameters
fitting an empirical W(E)
Comparison of CRP energies in
E.coli and S. typhimurium
(Hwa and Gerland, 2000-)
Mustonen and Lassig, PNAS 2005
Genome Evolution. Amos Tanay 2012
TFBS evolution: purifying selection and conservation
TF1
TF1
Similar function
CACGCGTT
CACGCGTA
Neutral evolution
TF1
Disrupted function
CACGCGTT
CACGAGTT
Low rate
purifying selection
TF2
TF1
Altered function
CACGCGTT
CACACGTT
Low rate
purifying selection
Altered affinity
CACGCGTT
CACACGTT
Rate?
Selection?
Genome Evolution. Amos Tanay 2012
Binding sites conservation
Kellis et al., 2003
Genome Evolution. Amos Tanay 2012
Binding sites conservation: heuristic motif identification
Kellis et al., 2003
Genome Evolution. Amos Tanay 2012
Analyzing k-mer evolutionary dynamics
• Instead of trying to identify conserved motifs try to infer the evolutionary
rate of substitution between pairs of k-mers
• Start from a multiple alignment and reconstruct ancestral sequences
(assuming site independence, or even max parsimony)
• Now estimate the number of substitution between pairs of 8-mers,
compare this number to the number expected by the background model
• Do it for a lot of sequence, so that statistics on the difference between
observed and expected substitutions can be derived
Genome Evolution. Amos Tanay 2012
Saccharomyces TFBS Selection Network
Inter-island organization in
the Reb1 cluster: selection hints
toward multi modality of Reb1
Nodes: octamers
node
conservation
conserved @ 2SD
conserved @ 3SD
otherwise
Arcs: 1nt substitution
arc Rate
Selection
Normal
neutral
Low
negative
not enough stat
Tanay et al., 2004
Genome Evolution. Amos Tanay 2012
Leu3 selection network
Substitution rate
Substitution changing
high affinity to high
affinity motifs
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3
log delta affinity
High Affinity
(Kd < 60)
Meidum Affinity
(400 > Kd > 60)
High rate subs.
Substitution changing
high affinity to low
affinity motifs
Genome Evolution. Amos Tanay 2012
A simple transcriptional code and its evolutionary implications
TF5
AAATTT
AATTTT
AAAATT
TF3
GATGAG
GATGCG
GATGAT
TF4
ACGCGT
TCGCGT
ACGCGT
TF1
CACGTG
CACTTG
TF2
TGACTG
TGAGTG
TGACTT
Genome Evolution. Amos Tanay 2012
The Halpren-Bruno model for selection on affinity
The basic notion here is of the relations between sequence, binding and function/fitness
Sequence
Binding energy
Function
E (S )
F (E)
We argued that E(S) can be approximated by a PWM
F(E) is a completely different story, for example:
Is there any function at all to low affinity binding sites?
Is there a difference between very high affinity and plain strong binding sites?
Are all appearances of the site subject to the same fitness landscape?
Genome Evolution. Amos Tanay 2012
More tests for possible conservation of low binding energy sites
Simulation
S. mikitae
S. cerevisiae
(Neutral, context aware)
High affinity
ΔE
ΔE
..
..
ΔE
ΔE
..
..
1
KS statistics
0.8
0.6
Low affinity
0.4
0.2
0
0
0.25
0.5
Genome Evolution. Amos Tanay 2012
More tests for possible conservation of low binding energy sites
Binding site
conservation
Conservation
of total
energy
Reb1
S
Conservation score
S
S
60
50
40
30
20
10
0
0
Ume6
Conservation score
20
Cbf1
20
Gcn4
Mbp1
20
20
15
15
15
15
10
10
10
10
5
5
5
5
0
0
0
50
100
binding energy percentile
0
0
50
100
binding energy percentile
50
100
binding energy percentile
0
0
50
binding energy percentile
100
0
50
binding energy percentile
100
Tanay, GR 2006
Genome Evolution. Amos Tanay 2012
Evolutionary dynamics of transcription factor
binding (mammals)
Shared binding loci: 4%
Schimdt et al. Science 2010
Genome Evolution. Amos Tanay 2012
Evolutionary dynamics of CTCF binding
(mammals)
Shared binding loci: 24%
Schimdt et al. Cell 2012
Genome Evolution. Amos Tanay 2012
Evolutionary dynamics of transcription factor
binding (flies) – correlates with the sequence
Bradley et al. PLoS biology 2010