Transcript Lecture25
Combined analysis of ChIPchip data and sequence data
Harbison et al.
CS 466
Saurabh Sinha
Outline
• Transcription factors interpret the regulatory
information encoded in DNA to induce or
repress gene expression
• Comparative genomics has been used to find
the regulatory sites in yeast genome
• Looking at sequence alone does not reveal if
a putative site is actually functioning as a
binding site
• ChIP-chip data (also called “location data”)
provides such information
• Harbison et al combine these two types of
data
Chip-on-chip
Source: http://www.chiponchip.org/
Data
• Genome-wide “location analysis” using ChIPon-chip
• Each experiment done with one TF
• 203 TFs experimented with, in “rich media
conditions”
• 84 of these TFs also experimented with in at
least one other condition
• Why?
– Binding is not just a function of the presence of the
site. It is also a function of the presence of the TF
– TF may not be present in every condition
Data
• How were the 84 TFs (to be tested in
additional conditions) chosen?
• If there was prior evidence that they
play a role in that additional condition
ChIP-on-chip results
• 11,000 unique interactions between TFs and
promoter regions identified
• A matrix of (m x n), where m is the number of
TFs (203), n is the number of yeast genes
(~6000)
• 11,000 of the entries were “1”, meaning the
binding was significant
– Need post-processing of binding affinities to
assess if it is statistically significant
The next step: bring in the
sequence
• Genome-wide “location data” or “binding data”
combined with sequence data
• For each TF, collect all sequences bound by it
– These are promoter length sequences, not exact
binding sites
• Apply motif finding programs to estimate what
the binding motif is (where the binding sites are)
Motif finding
• Only consider TFs that bound >= 10
sequences
– 147 such TFs
• Run 6 different motif-finders on the bound
sequences
• 68000 motifs discovered !
• A large number of these motifs are “variants”
of the same motif, i.e., similar to each other
Motif finding
• Using clustering of motifs, and stringent
statistical tests, identify high confidence
motifs from among these 68000 motifs
• High confidence motifs found for 116 of the
147 TFs whose bound sequences were
analyzed
• Now require that the motif also be conserved
across other related yeast species
• 65 TFs with single, high-confidence,
phylogenetically conserved motifs were found
Motif finding
• The 65 motifs were a mix of “known” and
novel motifs.
– That is, some of the motifs were similar to already
known motifs
– 21 TFs’ motifs were new
• Took these 65 motifs, as well as other known
motifs from the literature to form a
compendium of 102 motifs for further analysis
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Next step
• We now have motifs for 102 TFs
• Next step is to locate binding sites of each TF
in the whole genome
• Equivalent to finding matches to each motif in
the whole genome
• Finding matches:
– Require a high sequence similarity
– Require phylogenetic conservation
– Require high binding to that region by TF
Mapping sites in the genome
• “Map” gave 3353 sites (“interactions”) within
1296 promoters
• This is different from simply locating matches
to motif
• Because TF binding information is also
incorporated
• Under different conditions, only a subset of
the binding sites in the map are actually
occupied
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Does the map make sense?
• The map is telling us which TFs bind
which actual sites in the genome, and
hence which genes are being regulated
• In many cases, the known functions of
the genes predicted to be targeted by a
TF are consistent with the known
function of the TF
More insights from the map
• Binding sites are not
uniformly distributed over
the promoter regions
• Sharply peaked
distribution
• Very few sites in 100 bp
immediately upstream of
the genes
• Most sites (74%) are
between 100 and 500 bp
of gene
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Arrangements of sites
• Specific arrangements of binding sites in a
promoter
• Simple arrangement: one binding site for one
TF
• Another arrangement: Repeats of a particular
binding site
– Allows for “graded response”
– Some TFs show a significant preference for
repeated sites
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Arrangements of sites
• Another arrangement: Binding sites for
multiple TFs
– “Combinatorial regulation”: In different conditions,
different combinations of binding sites (and TFs)
direct different gene expression
– Genes whose promoters have such arrangement
of sites are required for multiple pathways, and
regulated in environment-specific fashion
Source: Harbison et al. Nature 431, 99-104(2 September 2004)
Arrangements of sites
• Another arrangement: Binding sites for
specific pairs of TFs occur more
frequently in same promoter than
expected by chance
– The two TFs perhaps interact physically in
doing their job
Source: Harbison et al. Nature 431, 99-104(2 September 2004)