Cis-regulation (cont.), GREAT (Gill subs)
Download
Report
Transcript Cis-regulation (cont.), GREAT (Gill subs)
http://cs273a.stanford.edu
[Bejerano Fall10/11]
1
Lecture 13
Cis-Regulation cont’d
GREAT
http://cs273a.stanford.edu
[Bejerano Fall10/11]
2
Gene Regulation
•gene (how to)
•control region
(when & where)
Protein coding
RNA gene
DNA
DNA binding
proteins
http://cs273a.stanford.edu
[Bejerano Fall10/11]
3
Pol II Transcription
Key components:
• Proteins
• DNA sequence
• DNA epigenetics
Protein components:
• General Transcription factors
• Activators
• Co-activators
http://cs273a.stanford.edu
[Bejerano Fall10/11]
4
Enhancers
http://cs273a.stanford.edu
[Bejerano Fall10/11]
5
Vertebrate Gene Regulation
distal: in 106 letters
gene (how to)
control region
(when & where)
DNA
DNA binding
proteins
proximal: in 103 letters
http://cs273a.stanford.edu
[Bejerano Fall10/11]
6
Gene Expression Domains: Independent
http://cs273a.stanford.edu
[Bejerano Fall10/11]
7
Distal Transcription Regulatory Elements
http://cs273a.stanford.edu
[Bejerano Fall10/11]
8
Repressors / Silencers
http://cs273a.stanford.edu
[Bejerano Fall10/11]
9
What are Enhancers?
What do enhancers encode?
Surely a cluster of TF binding sites.
[but TFBS prediction is hard, fraught with false positives]
What else? DNA Structure related properties?
So how do we recognize enhancers?
Sequence conservation across multiple species
[weak but generic]
Verifying repressors is trickier [loss vs. gain of function].
How do you predict an enhancer from a repressor? Duh...
http://cs273a.stanford.edu
[Bejerano Fall10/11]
10
Insulators
http://cs273a.stanford.edu
[Bejerano Fall10/11]
11
Cis-Regulatory Components
Low level (“atoms”):
• Promoter motifs (TATA box, etc)
• Transcription factor binding sites (TFBS)
Mid Level:
• Promoter
• Enhancers
• Repressors/silencers
• Insulators/boundary elements
• Cis-regulatory modules (CRM)
• Locus control regions (LCR)
High Level:
• Epigenetic domains / signatures
• Gene expression domains
• Gene regulatory networks (GRN)
http://cs273a.stanford.edu
[Bejerano Fall10/11]
12
Disease Implications: Genes
gene
genome
protein
Limb Malformation
Over 300 genes already
implicated in limb malformations.
http://cs273a.stanford.edu
[Bejerano Fall10/11]
13
Disease Implications: Cis-Reg
gene
genome
NO protein
made
Limb Malformation
Growing number of cases (limb, deafness, etc).
http://cs273a.stanford.edu
[Bejerano Fall10/11]
14
Transcription Regulation & Human Disease
[Wang et al, 2000]
http://cs273a.stanford.edu
[Bejerano Fall10/11]
15
Critical regulatory sequences
Lettice et al. HMG 2003
12: 1725-35
Single base changes
Knock out
http://cs273a.stanford.edu
[Bejerano Fall10/11]
16
Other Positional Effects
[de Kok et al, 1996]
http://cs273a.stanford.edu
[Bejerano Fall10/11]
17
Genomewide Association Studies point to non-coding DNA
http://cs273a.stanford.edu
[Bejerano Fall10/11]
18
WGA Disease
http://cs273a.stanford.edu
[Bejerano Fall10/11]
19
9p21 Cis effects
Follow up study:
http://cs273a.stanford.edu
[Bejerano Fall10/11]
20
Cis-Regulatory Evolution: E.g.,
obile Elements
Gene
Gene
Gene
Gene
What settings
make these
“co-option” events
happen?
[Yass is a small town in
New South Wales, Australia.]
http://cs273a.stanford.edu
[Bejerano Fall10/11]
21
Britten & Davidson Hypothesis: Repeat to Rewire!
[Davidson & Erwin, 2006]
[Britten & Davidson, 1971]
http://cs273a.stanford.edu
[Bejerano Fall10/11]
22
Modular: Most Likely to Evolve?
Chimp
http://cs273a.stanford.edu
[Bejerano Fall10/11]
Human
23
Human Accelerated Regions
• Human-specific substitutions in conserved
sequences
Human
[Pollard, K. et al., Nature, 2006]
Chimp
[Beniaminov, A. et al., RNA, 2008]
24
[Prabhakar, S. et al., Science, 2008]
http://GREAT.stanford.edu:
Generating Functional Hypotheses
from Genome-Wide Measurements
of Mammalian Cis-Regulation
Gill Bejerano
Dept. of Developmental Biology &
Dept. of Computer Science
Stanford University
http://bejerano.stanford.edu
25
Human Gene Regulation
1013 different cells in an adult human.
All these cells have the same Genome.
20,000 Genes encode how to make proteins.
1,000,000 Genomic “switches” determine
which and how much proteins to make.
Gene
Gene
Gene
Gene
http://bejerano.stanford.edu
Hundreds of different cell types.
26
Most Non-Coding Elements likely work in cis…
“IRX1 is a member of the Iroquois homeobox gene family.
Members of this family appear to play multiple roles
during pattern formation of vertebrate embryos.”
gene deserts
regulatory jungles
9Mb
Every orange tick mark is roughly 100-1,000bp long,
each evolves under purifying selection, and does not code for protein.
http://bejerano.stanford.edu
27
Many non-coding elements tested are cis-regulatory
http://bejerano.stanford.edu
28
Combinatorial Regulatory Code
2,000 different proteins can bind specific DNA sequences.
Proteins
DNA
Protein binding site
Gene
DNA
A regulatory region encodes 3-10 such protein binding sites.
When all are bound by proteins the regulatory region turns “on”,
and the nearby gene is activated to produce protein.
http://bejerano.stanford.edu
29
ChIP-Seq: first glimpses of the
regulatory genome in action
Peak Calling
Cis-regulatory peak
http://bejerano.stanford.edu
30
What is the transcription factor
I just assayed doing?
• Collect known literature of the form
• Function A: Gene1, Gene2, Gene3, ...
• Function B: Gene1, Gene2, Gene3, ...
• Function C: ...
• Ask whether the binding sites you discovered are
preferentially binding (regulating) any one or more
of the functions listed above.
• Form hypothesis and perform further experiments.
Cis-regulatory peak
http://bejerano.stanford.edu
Gene transcription start site
31
Example: inferring functions of Serum Response
Factor (SRF) from its ChIP-seq binding profile
Gene transcription start site
SRF binding ChIP-seq peak
• ChIP-seq identified 2,429 SRF binding peaks
in human Jurkat cells1
• SRF is known as a “master regulator of the actin cytoskeleton”
• In the ChIP-Seq peaks, we expect to find binding sites
regulating (genes involved in) actin cytoskeleton formation.
http://bejerano.stanford.edu
[1] Valouev A. et al., Nat. Methods, 2008
32
Example: inferring functions of Serum Response
Factor (SRF) from its ChIP-seq binding profile
Gene transcription start site
SRF binding ChIP-seq peak
Ontology term (e.g. ‘actin cytoskeleton’)
Existing, gene-based method to analyze enrichment:
• Ignore distal binding events.
• Count affected genes.
• Rank by enrichment
hypergeometric p-value.
http://bejerano.stanford.edu
N
K
n
k
= 8 genes in genome
= 3 genes annotated with
= 2 genes selected by proximal peaks
= 1 selected gene annotated with
P = Pr(k ≥1 | n=2, K =3, N=8)
33
We have (reduced ChIP-Seq into) a gene list!
What is the gene list enriched for?
Pro: A lot of tools out there for
the analysis of gene lists.
Cons: These tools are built
for microarray analysis.
Does it matter ??
Microarray
data
Microarray
data
Deep
sequencing
data
Microarray tool
http://bejerano.stanford.edu
34
SRF Gene-based enrichment results
• Original authors can only state: “basic cellular processes,
particularly those related to gene expression” are enriched1
SRF
SRF acts on genes both in
nucleus and cytoplasm, that
are involved in transcription
and various types of binding
SRF
http://bejerano.stanford.edu
Where’s the signal?
Top “actin” term is
ranked #28 in the list.
[1] Valouev A. et al., Nat. Methods, 2008
35
Associating only proximal
peaks loses a lot of information
Relationship of binding peaks to nearest genes for
eight human (H) and mouse (M) ChIP-seq datasets
SRF (H: Jurkat)
NRSF (H: Jurkat)
GABP (H: Jurkat)
Stat3 (M: ESC)
p300 (M: ESC)
p300 (M: limb)
p300 (M: forebrain)
p300 (M: midbrain)
Fraction of all elements
0.7
0.6
Restricting to
proximal peaks
often leads to
complete loss of
key enrichments
0.5
0.4
0.3
0.2
0.1
0
0-2
2-5
5-50
50-500
> 500
Distance to nearest transcription start site (kb)
http://bejerano.stanford.edu
36
Bad Solution: Associating distal peaks
brings in many false enrichments
Why bad? 14% of human genes tagged ‘multicellular organismal development’.
But 33% of base pairs have such a gene nearest upstream/downstream.
SRF ChIP-seq set has 2,000+ binding events.
Throw a random set of 2,000 regions at the genome.
What do you get from a gene list analysis?
Term
Bonferroni
corrected p-value
nervous system development
5x10-9
system development
8x10-9
anatomical structure development
7x10-8
multicellular organismal development 1x10-7
developmental process
2x10-6
http://bejerano.stanford.edu
Regulatory jungles are often
next to key developmental genes
37
Real Solution: Do not convert to gene list.
Analyze the set of genomic regions
Gene transcription start site
Ontology term ( ‘actin cytoskeleton’)
Gene regulatory domain
Genomic region (ChIP-seq peak)
p = 0.33 of genome annotated with
n = 6 genomic regions
k = 5 genomic regions hit annotation
GREAT = Genomic Regions
Enrichment of Annotations Tool
P = Prbinom(k ≥5 | n=6, p =0.33)
Since 33% of base pairs are near a ‘multicellular organismal development’
gene, we now expect 33% of genomic regions to hit this term by chance.
=> Toss 2,000 random regions at genome, get NO (false) enrichments.
http://bejerano.stanford.edu
38
How does GREAT know how to assign
distal binding peaks to genes?
Future: High-throughput assays
based on chromosome
conformation capture (3C)
methods will elucidate complex
regulation mechanisms
Currently: Flexible computational definitions allow assignment of
peaks to nearest gene, nearest two genes, etc.
• Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream
of transcription start site, extends to basal domain of nearest genes within 1 Mb
• Though some associations may be missed or incorrect, in
general signal richness and robustness is greatly improved by
associating distal peaks
http://bejerano.stanford.edu
39
GREAT infers many specific functions
of SRF from its binding profile
Top GREAT enrichments of SRF
Top gene-based
enrichments of SRF
Ontology
Term
# Genes Binomial Experimental
P-value
support*
30
31
7x10-9
5x10-5
Miano et al. 2007
Pathway
Commons
TRAIL signaling
32
Class I PI3K signaling 26
5x10-7
2x10-6
Bertolotto et al. 2000
TreeFam
FOS gene family
5
1x10-8
Chai & Tarnawski
2002
TF Targets
Targets of SRF
Targets of GABP
Targets of YY1
Targets of EGR1
84
28
44
23
5x10-76
4x10-9
1x10-6
2x10-4
Positive control
Gene Ontology actin cytoskeleton
actin binding
(top actin-related
term 28th in list)
Miano et al. 2007
Poser et al. 2000
ChIp-Seq support
Natesan & Gilman
1995
* Known from literature – as in function is known, SOME of the
genes are known, and the binding sites highlighted are NOT.
Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq
http://bejerano.stanford.edu
[McLean et al., Nat Biotechnol., 2010]
40
GREAT data integrated
• Twenty ontologies spanning broad categories of biology
• 44,832 total ontology terms tested in each GREAT run
(2,800 terms)
(5,215)
(834)
(6,700)
(3,079)
(911)
(5,781)
(427)
(456)
(150)
(1,253)
(288)
(706)
http://bejerano.stanford.edu
(615)
(19)
(222)
(9)
(6,857)
(8,272)
(238)
Michael Hiller
41
GREAT implementation
• Can handle datasets of hundreds of thousands of genomic regions
• Testing a single ontology term takes ~1 ms
• Enables real-time calculation of enrichment results for all ontologies
http://bejerano.stanford.edu
Cory McLean
42
GREAT web app: input page
http://great.stanford.edu
Pick a genome
assembly
Input BED
regions of
interest
http://bejerano.stanford.edu
Dave Bristor 43
GREAT web app: output summary
Additional ontologies,
term statistics,
multiple hypothesis
corrections, etc.
Ontology-specific
enrichments
http://bejerano.stanford.edu
44
GREAT web app: term details page
Genes annotated as “actin
binding” with associated
genomic regions
Genomic regions annotated
with “actin binding”
Drill down to explore how a particular peak
regulates Plectin and its role in actin binding
Frame holding
http://www.geneontology.org
definition of “actin binding”
http://bejerano.stanford.edu
45
You can also submit any track
straight from the UCSC Table Browser
A simple, well documented
programmatic interface allows
any tool to submit directly to GREAT.
See our Help. Inquiries welcome!
http://bejerano.stanford.edu
46
GREAT web app: export data
HTML output displays all user selected rows and columns
Tab-separated values also available for additional postprocessing
http://bejerano.stanford.edu
47
External Web Stats: Catching On
last 500 entries only
http://bejerano.stanford.edu
48
Summary
• Current technologies identify cis-regulatory sequences
• GREAT accurately assesses functional enrichments of cisregulatory sequences using a genomic region-based
approach [McLean et al., Nat Biotechnol., 2010]
• Online tool available (version 1.5 coming soon, in QA)
http://great.stanford.edu
• GREAT is immediately applicable to all sets with a
significant cis-regulatory content:
• Regulatory Chromatin Markers (e.g., H3K4me1)
• Genome Wide Association Studies (GWAS)
• Comparative Genomics sets
(e.g., ultraconserved elements)
http://bejerano.stanford.edu
49
Acknowledgments
GREAT developers
Cory McLean
Dave Bristor
Michael Hiller
Shoa Clarke
Craig Lowe
Aaron Wenger
Gill Bejerano
Other help
Fah Sathira
Marina Sirota
Bruce Schaar
Terry Capellini
Christopher Meyer
Jennifer Hardee
http://great.stanford.edu
http://bejerano.stanford.edu
50