Review: RECOMB Satellite Workshop on Regulatory Genomics

Download Report

Transcript Review: RECOMB Satellite Workshop on Regulatory Genomics

Review: RECOMB Satellite
Workshop on Regulatory
Genomics
(Held March 26-27, 2004)
Workshop Themes/Trends
• More comprehensive evaluations of
motif-detection algorithms
• Making more effective use of
comparative mapping/evolution data
• Models that explain rather than just
describe
• Moving from binding motifs to entire
regulatory modules
• Methods are simple not sophisticated
Outline
• Jim Kadonaga, University of California, San Diego
The MTE, a New Core Promoter Element for
Transcription by RNA Polymerase II
• Rotem Sorek, Compugen and Tel Aviv University
The "promoters" of splicing: Intronic sequences that
regulate alternative splicing
• Yitzhak Pilpel, Weizman Institute
Revealing the architecture of genetic backup circuits
through inspection of transcription regulatory networks
• Ron Shamir, Tel Aviv University
Revealing selection patterns in the evolution of yeast
transcription regulation
• Michael Eisen, Lawrence Berkeley National Lab
Evolutionary Signatures of Regulatory Sequences
A New Core Promoter Element for
Transcription by RNA Polymerase II
(Jim Kadonaga)
The majority of transcription activity is
regulated by sequence-specific DNA-binding
factors, which are thus the focus of the bulk
of current research on regulation, however...
The ultimate target of all of this action is the
core promoter, which also plays a part in
regulation
Core promoter
• Encompasses TSS
• Directs RNA polymerase II
• Most well-known component is
the TATA box
Core promoter
• Encompasses TSS
• Directs RNA polymerase II
• Most well-known component is
the TATA box
Only about
30-40% of
promoters
contain a
TATA box!
What’s going
on the rest
of the time?
Finding Novel Promoter Elements
• Experimentally investigated binding in
those promoters with no TATA-box
– found novel promoter element DPE
• Large scale motif detection of 2000 core
promoters in Drosophila (Ohler et al, 2002)
– Plotted distance of top 10 motifs to TSS
• four motifs had clear peak: TATA, Inr, DPE and ...
• a novel promoter element MTE
The Core Promoter
gets a new look
MTE
Motif Ten
Promoter
Element
(Kadonaga, powerpoint slides)
DPE and MTE
Two newly Identified Promoter Elements
• Conserved from Drosophila to human
(unknown whether occur in yeast)
• Very sensitive to spacing to Inr motif
– experimentally found TSS (papers not reliable)
– single insertion/delection between motifs
causes 7-fold reduction in transcription
• Inr and DPE (or MTE) bound cooperatively
by TFIID
– first step in transcription initiation
TATA gets top billing but...
• In Drosophila (out of 205 core promoters)
– TATA and DPE: 14%
– TATA only: 29%
– DPE only: 26%
– Neither: 31%
• TATA, DPE, and MTE can all
– independently support transcription
– compensate for mutation in one other
And finally... regulation.
• NC2 previously known to repress TATAdependent transcription; unexpectedly
found to activate DPE-dependent
transcripton
• Studied 18 enhancers and estimate that
about 25% exhibit some specificity for
DPE or TATA
• Similar work in progress for MTE
The “Promoters” of Splicing
(Rotem Sorek)
In general it is not known how alternative
splicing (AS) is regulated
• A few known splicing regulatory proteins
– like TFs they are sequence-specific, but they bind to
RNA not DNA
– binding motif (usually 4-10 nt) can be located in exon
or intron
– can act as enhancers or silencers
• Evidence for combinatorial regulation
The typical
“motif in a haystack”
• Most work on finding splicing
factor motifs focuses on exons
– short enough that mutation studies feasible
• Introns too long, require a computational
approach
• Compiled training dataset
– 250 AS exons, AS both in mouse/human
– large set of constituitively spliced (CS) exons,
conserved across human/mouse
Sorek and Ast, Genome Research 2003
Their Primary Finding: there tends to be
significantly more conservation in introns
surrounding AS exons than CS exons
On average about 100 bases on either side of
each exon are conserved, compared to around 7
bases for constituitively spliced exons
What’s the explanation?
– multiple binding motifs?
– helping to determine secondary structure in
RNA, which helps lead to correct splicing?
Predicting Alternative Splicing
• Additional Predictive features
–
–
–
–
Higher conservation around exon
Higher conservation of exon itself (motifs?)
Shorter exons
Exons that are a multiple of 3
• Method: somehow chose one threshold for each feature?
• Performance: scanned human genome, predicted 1000
AS exons (incl training data?)
– 70% had EST evidence of AS vs 6-7% baseline
– Lab test showed that 7/15 (randomly?) selected from remaining
30% are AS in at least one of 15 tissues
• Significance: estimate “splicing promoters” cover 3x10^6
bp
Genetic Backup Circuits
(Kafri and Pilpel)
• Fact: single gene knockouts often have
little or no phenotypic effect
– 10% lethal in worm
– 27% lethal in yeast
• Question: Can we better understand the
mechanisms of genetic backup?
• Task: Predict whether a knockout will be
lethal or not
Duplicates Suggest Redundancy
• Genes with
duplicates are
less likely to be
essential
• But clearly this
doesn’t tell the
whole story
– lethal genes can
have duplicates
– nonessential
genes often have
no duplicate
(Gu, Z. et al Nature 2003)
Function of Duplicate Matters
• Compute dispensability of yeast genes
– growth rate after knockout compared to mean
growth rate, averaged over many conditions
• Compared GO functional annotations of
highly similar genes. Found higher
dispensability when
– higher functional similarity (Resnik info content)
– little functional similarity but high sequence
similarity (Blast E-values)
Similarity of Expression
– backup is best provided
by genes which do not
share expression patterns
0.9
Dispensability
• 40 time series, 500
timepoints
• In each condition
calculated correlation of
expression profiles of
each pair of paralogous
genes
• Average correlation
suggests
0.8
0.7
0.6
-0.5
-0.25
0
0.25
0.5
Mean Expression Correlation
0.75
How can we explain this
unexpected result?
– never similarly expressed
• positive correlation:
– always similarly
expressed
• no correlation:
– never similarly expressed
or
– similarly expressed in
certain conditions
0.9
Dispensability
Classify pairs into:
• negative correlation:
0.8
0.7
0.6
-0.5
-0.25
0
0.25
0.5
Mean Expression Correlation
0.75
Variability of Expression
0.95
Dispensability
• Use stdDev to
quantify
consistency of
correlation
across
conditions
0.85
0.75
0.65
0.55
0.45
0.35
0.25
0
0.2
0.4
0.6
StdDev Expression Correlation
0.8
Goldilocks and the three little
1
paralogs
Expression
Stdev
0.75
correlated in only
a subset of conditions
Just Right
0.5
0.25
Never Same
Expression
Too Diverged
Always Same
Expression
Too Similar
0
Strongly
Negative
Little
Correlation
Strongly
Positive
Mean
Optimal backup requires the “ability to switch
between similar and dissimilar expression in a
condition dependent manner”
Predictions about the Past...
Hypothesized Duplication Mechanism
1. duplication occurs
2. leads to nonstable redundancy
3. quickly followed by either
–
–
mutation and loss of one of the duplicates
subfunctionalization leading to stable redundancy
Hypothesize two distinct types of subfunctionalization
1. mutation of coding region leading to functional
divergence
2. mutation of control region leading to divergence of
expression
Need for Regulatory Flexibility
• This second type of subfunctionalization would
entail a quite significant regulatory challenge if
the paralogs are to provide backup for one
another
– Upon mutation of B, A must be turned on in the
conditions that would normally require B
• Postulate that
– this regulatory challenge is met when a gene has a
significant amount of regulatory diversity (i.e. different
TF motifs)
– backup asymmetry arises when one of the genes has
few motifs (Kellis suggests otherwise?)
Experiments, but no hard numbers
• Claim the capacity of genes to respond at the
transcriptional level when their counterpart is
deleted is central to their ability to provide backup
– Most paralogs downregulated when other gene is
knocked out (cross-hybridization?)
• lower stdev -> down regulation
• Claim that asymmetry of backup capability can be
predicted based on number of transcription factor
binding sites.
– Gene that has the larger number of motifs is the one
that is capable of providing a backup to the other
– Genes with few motifs are “parasites” – can’t backup
• Claim an improved ability to predict effect of
double knockouts
A Question
• They claim that only when the genes
diverge in function will they be maintained
in evolution.
• But if the duplicated pair can compensate
for each other’s function then won’t there
be little selection pressure to maintain both
copies?
From General Conservation to
Specific Motifs
• Searched conserved intronic regions for
overrepresented hexamer
– literature search for most significant hexamer shows
that hexamer mentioned as an AS motif in six papers
• Next steps:
– identify the consensus sequences of additional motifs
– learn tissue/developmental specificity for each motif
Revealing Selection Patterns in the
Evolution of Yeast Transcription Regulation
(Amos Tanay, Irit Gat-Viks and Ron Shamir)
• Identifying TF binding sites is hard
• Even harder to predict more complex
interactions
– rarely a binary switch
– not a linear relation between affinity and acivation
– different binding affinities can lead to different results
(e.g. P53 can lead to apoptosis or rescue)
Conservation indicates functionality
Evolution dynamics disclose details of functionality
An Analogy:
Imagine we didn’t know the genetic code, but just
the length of the codes
We know that synonymous substitutions
are more common in coding regions than
nonsynonymous substitutions
1.
2.
3.
build a network where each 3-letter nt string is
represented by one node
put an edge between nodes where the thickness of
the edge represents the frequency of mutations in
aligned coding regions of related organisms
see strongly connected components comprised of
nodes which all code for the same amino acid
A “Simple” Approach
• Chose to use the four recent genomes of
“simple” yeasts (promoter regions are relatively
short)
• Identified 4000 promoters and aligned them
using ClustalW
• Use simple window scanning method to identify
all “motifs” of size 8
• Simple parsimony method to infer ancestral
sequences at each node in the phylogeny
A Simple Approach (2)
• Calculate background substitution rate
– 16 parameter background model for each
branch in phylogeny
• For each motif, compute 8 tables of sitespecific substitution rates
s[m, i, a, b]
E (count )
– simply count observed substitutions at each
site, summed over all branches of the tree and
all instances of the motif
– normalized substitution rate: log of ratio of
observed substitutions over expected
substitutions
Building a “Selection Network”
• Each node represents an 8mer “motif”
• Connect all motifs that are 1 substitution apart
– if substitution rate is positive, dark edge
– if substitution rate is negative, light edge
– if not enough data, very thin edge
images taken from: http://www.cs.tau.ac.il/~amos/promoter_evo/
• Did some
larger scale
evaluations
based on
ChiP and
gene
expression
data
• Also some
anectodal
results
Matrix of Substitutions from the
Motif Concensus
Evolutionary Signatures of
Regulatory Sequences (Michael Eisen)
• Examples of “Evolutionary Signatures”
– coding sequence: conserved conserved variable
– structural RNA, nt that basepair are coevolving
What are the evolutionary constraints
imposed on sequences by TF binding?
• Aligned 4 yeast species
– for each base in genome, estimate evolutionary rate
(very noisy estimates)
Analyze the pattern of rate variation
across the entire binding site
Moses et al Evol Biol 2003
Position-specific Rate Variation
• The pattern of rate variation across the
entire binding site for a particular TF
– within one genome
– across genomes
Position-specific Rate Variation
• The pattern of rate variation across the
entire binding site for a particular TF
– within one genome
– across genomes
Highly Correlated
• Clearly due to structural constraints
– protein contacts
– even when we know there’s no contact,
there’s DNA bending issues....
These “signatures” are missing from
current motif-prediction programs
• Although this isn’t a particularly suprising
result, many predicted motifs (e.g. from
MEME etc.) do not display this TFBS
“signature”
– could use as a filter, or incorporate it more
directly (they’re working on this currently?)
• Different families of TF have different
“signatures”
– Eisen thinks the community is still
underutilizing this information
Make better use of comparative
data by using an explicit
evolutionary model
• Is there likely to have been a TFBS in the
ancestor?
– build a PSSM representing the chemical
contribution of each base to the binding
specificity
– use Halpern and Bruno model to predict how
the TFBS will evolve given proposal +
selection model
Make better use of comparative
data by using an explicit
evolutionary model
Moses et al Evol Biol 2003
Larger Cis-Regulatory
Sequences
• Known binding patterns in Drosophila have low
information content
– find a sequence match for each TFBS before almost
every gene in the genome
• Build a statistical model to identify significant
clusters of binding sites in windows of arbitrary
size
– improved detection of cis-regulatory modules
– experimental results still show many false positives
• Use comparative data to discriminate real
clusters from false ones
How to use comparative data
• Conservation in Drosophila pseudoobscura isn’t
a good indicator of functionality
– all real and fake clusters have very high overall
sequence conservation, including their flanking
regions (a surprise)
• However...
– the actual binding sites are often not conserved
– even one or two mutations can destroy a binding site
conservation of binding site density
is a useful indicator of function
An Impassioned Speech on the
Evolution of the Scientific Journal
• “If you publish [your work] in a journal like Science which
fewer and fewer people in the world have access to you
run a really big risk of being the next Mendel and that
your work will languish in obscurity”
• Don’t publish in a journal that “takes your writing, your
ideas, thoughts and paper and claims ownership of them
and then only doles them out to a relatively narrow
bunch of people who have enough money to pay for
them..solely to promote the financial health of the
journal...”
• Don’t be “like Microsoft”... publish in Public Library of
Science or another freely available journal
For More Information
• Most of the talks I picked were invited talks
• For the workshop there there is often only
an abstract
• Video feed is available online:
http://www.calit2.net/multimedia/recomb2004vid
eos.html
• Many have papers that have just come out
or are about to come out with additional
details... check the authors’ webpages
Evolution and Larger CisRegulatory Sequences
• what are enhancer? whole regions of binding sites?
• how are Drosophila enhancers organized
• only 5 binding sites whose specificities are well
characterized from experim. studies
– low information content
– find them all over the genome
• Clusters of binding sites -> Surrogate for regulatory
function
• Shown previously that if look for clusters of these sites
– all identified regions overlap known enhancers
– don’t find anything else
– then I don’t understand next study with 39 clusters
• Found 39 clusters
– 9 overlap known enhancers
– 28 tested experimentally
• 6 clearly regulating nearby gene
• 3 shown some regulatory role perhaps
• remainder don’t appear to be real (but could have
wrong promoter? look back at donoga talk)
• What’s difference between real and fake?
– use comparative mapping
• Used two flies (which ones)
– distant enough based on coding region
conservation that expect to see conservation
only of funtionally conserved regions
– not the case
– all real and fake clusters have very high
overall sequence conservation, including their
flanking regions (why?)
• However,
–
–
–
–
–
binding sites not conserved
one or two mutaitons enough to destroy a binding site
measure conservation of binding site density
show graph (37:18)
summary (39:21)
• In more distantly related species
– alignment more of an issue
– binding sites will move around more
– been shown that huge binding site turnover– will have 2
separate ways to make the same enhancer
– no sequence identity but in experimental studies can replace
each other?