Transcript lecture15x

CS173
Lecture 15: TF Motifs (Harendra)
MW 11:00-12:15 in Beckman B302
Prof: Gill Bejerano
TAs: Jim Notwell & Harendra Guturu
http://cs173.stanford.edu [BejeranoWinter12/13]
1
Announcements
• Project milestones due Today
http://cs173.stanford.edu [BejeranoWinter12/13]
2
Review: Transcriptional regulation of genes
Enhancer (CRM)
Transcription Start Site (TSS)
Thousands of transcription factor-CRM interactions that control gene expression in each cell type
http://cs173.stanford.edu [BejeranoWinter12/13]
3
Last Time: ChIP-Seq - a first glimpses of the
regulatory genome in action
Peak Calling
Cis-regulatory peak
http://cs173.stanford.edu [BejeranoWinter12/13]
4
Last Time: Infer functions of ChIP-seq binding
profile using GREAT
Gene transcription start site
SRF binding ChIP-seq peak
Ontology term (e.g. ‘actin cytoskeleton’)
GREAT = Genomic Regions
Enrichment of Annotations Tool
p = 0.33 of genome annotated with
n = 6 genomic regions
k = 5 genomic regions hit annotation
P = Prbinom(k ≥5 | n=6, p =0.33)
http://cs173.stanford.edu [BejeranoWinter12/13]
5
GREAT gives you a tables of functions
Top GREAT enrichments of SRF
Ontology
Term
# Genes Binomial Experimental
P-value
support*
30
31
7x10-9
5x10-5
Miano et al. 2007
Pathway
Commons
TRAIL signaling
32
Class I PI3K signaling 26
5x10-7
2x10-6
Bertolotto et al. 2000
TreeFam
FOS gene family
5
1x10-8
Chai & Tarnawski
2002
TF Targets
Targets of SRF
Targets of GABP
Targets of YY1
Targets of EGR1
84
28
44
23
5x10-76
4x10-9
1x10-6
2x10-4
Positive control
Gene Ontology actin cytoskeleton
actin binding
Miano et al. 2007
Poser et al. 2000
ChIp-Seq support
Natesan & Gilman
1995
* Known from literature – as in function is known, SOME of the
genes are known, and the binding sites highlighted are NOT.
http://cs173.stanford.edu [BejeranoWinter12/13]
6
Last Time: Infer functions of ChIP-seq binding
profile using GREAT
Gene transcription start site
SRF binding ChIP-seq peak
π
Ontology term (e.g. ‘actin binding’)
π
π
π
GREAT = Genomic Regions
Enrichment of Annotations Tool
p = 0.5 of genome annotated with
π
n = 6 genomic regions
kπ = 4 genomic regions hit annotation
P = Prbinom(kπ≥4 | n=6, pπ =0.5)
http://cs173.stanford.edu [BejeranoWinter12/13]
7
GREAT gives you a tables of functions
Top GREAT enrichments of SRF
Ontology
Term
# Genes Binomial Experimental
P-value
support*
30
31
7x10-9
5x10-5
Miano et al. 2007
Pathway
Commons
TRAIL signaling
32
Class I PI3K signaling 26
5x10-7
2x10-6
Bertolotto et al. 2000
TreeFam
FOS gene family
5
1x10-8
Chai & Tarnawski
2002
TF Targets
Targets of SRF
Targets of GABP
Targets of YY1
Targets of EGR1
84
28
44
23
5x10-76
4x10-9
1x10-6
2x10-4
Positive control
Gene Ontology actin cytoskeleton
actin binding
Miano et al. 2007
Poser et al. 2000
ChIp-Seq support
Natesan & Gilman
1995
* Known from literature – as in function is known, SOME of the
genes are known, and the binding sites highlighted are NOT.
http://cs173.stanford.edu [BejeranoWinter12/13]
8
GREAT gives you a tables of functions
Top GREAT enrichments of SRF
Ontology
Term
30
31
7x10-9
5x10-5
Miano et al. 2007
Pathway
Commons
TRAIL signaling
32
Class I PI3K signaling 26
5x10-7
2x10-6
Bertolotto et al. 2000
TreeFam
FOS gene family
5
1x10-8
Chai & Tarnawski
2002
TF Targets
Targets of SRF
Targets of GABP
Targets of YY1
Targets of EGR1
84
28
44
23
5x10-76
4x10-9
1x10-6
2x10-4
Positive control
Gene Ontology actin cytoskeleton
actin binding
Different
# Genes Binomial Experimental
P-value
support*
Miano et al. 2007
Poser et al. 2000
ChIp-Seq support
Natesan & Gilman
1995
* Known from literature – as in function is known, SOME of the
genes are known, and the binding sites highlighted are NOT.
http://cs173.stanford.edu [BejeranoWinter12/13]
9
But doing the experiment is the hard part!
• Hard or impossible to get the required cells
• Some cells don’t occur in enough quantity to ChIP
• Others are hard to dissect
• Certain human tissues are hard to obtain
• Hard to get a good antibody
• Ex: We have ChIP results for a factor in brain
• We have not be able to repeat it since we can’t find the same
antibody
• Lots of time and money to do one experiment
• Only information for one context – cell type or time
Can we computationally predict the binding
sites for many contexts and factors?
http://cs173.stanford.edu [BejeranoWinter12/13]
10
Recall: TFBS Position Weight Matrix (PWM)
Experimentally determined
sites
A T G G C A T G
A G G G T G C G
A T C G C A T G
T T G C C A C G
A T G G T A T T
A T T C G A C G
A G G G C G T T
A T G A C A T G
A T G G C A T G
A C T G G A T G
Alignment (count) Matrix
A
9 0 0 1 0 8 0
C
0 1 1 1 7 0 3
G
0 2 7 8 1 2 0
T
1 7 2 0 2 0 7
Frequency Weight Matrix
A 0.9 0.0 0.0 0.1 0.0 0.8 0.0
C 0.0 0.1 0.1 0.1 0.7 0.0 0.3
G 0.0 0.2 0.7 0.8 0.1 0.2 0
T 0.1 0.7 0.2 0.0 0.2 0.0 0.7
Cons A T G G C A T
0
0
8
2
0.0
0.0
0.8
0.2
G
Can we use a PWM to predict
where the TF will bind in the genome
(without doing ChIP-seq)?
http://cs173.stanford.edu [BejeranoWinter12/13]
11
Binding Site Prediction using Match
Problem: High number of false positives.
http://cs173.stanford.edu [BejeranoWinter12/13]
12
Recall: TFBS Position Weight Matrix (PWM)
Experimentally determined
sites
A T G G C A T G
A G G G T G C G
A T C G C A T G
T T G C C A C G
A T G G T A T T
A T T C G A C G
A G G G C G T T
A T G A C A T G
A T G G C A T G
A C T G G A T G
Alignment (count) Matrix
A
9 0 0 1 0 8 0
C
0 1 1 1 7 0 3
G
0 2 7 8 1 2 0
T
1 7 2 0 2 0 7
Frequency Weight Matrix
A 0.9 0.0 0.0 0.1 0.0 0.8 0.0
C 0.0 0.1 0.1 0.1 0.7 0.0 0.3
G 0.0 0.2 0.7 0.8 0.1 0.2 0
T 0.1 0.7 0.2 0.0 0.2 0.0 0.7
Cons A T G G C A T
0
0
8
2
0.0
0.0
0.8
0.2
G
1.2 0.7 0.7 0.7 0.6 1.0 0.8 1.0
Information
content of
each column
Information content of a motif
= sum of all columns
= 1.2 + 0.7 + 0.7 +0.6 + 1.0 + 0.8 + 1.0 = 6.0
http://cs173.stanford.edu [BejeranoWinter12/13]
13
Information content is a measure of motif
specificity
SRF
(IC ~ 12)
SPIB
(IC ~ 5)
REST
(IC ~ 25)
How do these compare to a library of many PWMs?
http://cs173.stanford.edu [BejeranoWinter12/13]
14
PWMs have a range of information content
SRF
SPIB
REST
http://cs173.stanford.edu [BejeranoWinter12/13]
15
Information content determines how accurately
we can predict the binding site
• Measure of motif specificity
SRF
SRF
2 million
http://cs173.stanford.edu [BejeranoWinter12/13]
16
Information content determines how accurately
we can predict the binding site
• Measure of motif specificity
SRF
2 million matches to the SRF motif, SRF
but ChIP-seq and other estimates
suggest ≈ 10,000 actual binding sites
2 million
Can we do better?
http://cs173.stanford.edu [BejeranoWinter12/13]
17
Use excess conservation to improve
prediction accuracy
Aaron
Shoa
Wenger et al., PRISM offers a comprehensive genomic approach to transcription factor function prediction. 2013
http://cs173.stanford.edu [BejeranoWinter12/13]
18
Use shuffled motifs to calculate confidence of
excess conservation binding site prediction
Transcription
factor motif
Genome-wide
binding site
predictions
10 Shuffled
Transcription
factor motifs
Genome-wide
binding site
predictions
fraction
conserved
Confidence is the fraction
conserved in excess.
confidence = excess / total
real
total = 0.32
excess = 0.12
shuffled
branch length (subst / site)
http://cs173.stanford.edu [BejeranoWinter12/13]
19
Probabilistic interpretation
• Confidence is the probability that a motif instance is
functional given its observed conservation.
PrR(functional | C ≥ c) = 1 - PrR(not functional | C ≥ c)
PrR(C ≥ c | not F) PrR(not F)
=1PrR(C ≥ c)
R: real motif
S: average shuffled motif
PrR(C ≥ c)
PrR(C ≥ 1.5) = 0.2
PrS(C ≥ c)
branch length (subst / site)
PrS(C ≥ c) PrR(not F)
=1PrR(C ≥ c)
PrR(C ≥ c) - PrS(C ≥ c) PrR(not F)
=
PrR(C ≥ c)
PrR(C ≥ c) - PrS(C ≥ c)
excess
≈
=
PrR(C ≥ c)
total
http://cs173.stanford.edu [BejeranoWinter12/13]
20
Excess conservation score defined by genomic
background
http://cs173.stanford.edu [BejeranoWinter12/13]
21
Excess conservation score also defined
by motif
http://cs173.stanford.edu [BejeranoWinter12/13]
22
Perform genome-wide binding site predictions…
ARE THE PREDICTIONS ANY
GOOD?
http://cs173.stanford.edu [BejeranoWinter12/13]
23
Use ChIP-seq overlap as a measure of sensitivity
Genome-wide
binding site
predictions
for one factor
(Ex: E2F4)
ChIP-seq
for same factor
(Ex: E2F4)
Sensitivity =
Overlapping ChIP-peaks / Total ChIP-peaks
But how do you assess if your overlap is good?
Compare to the best tool out there
(or all the tools, if there is no “best”)
http://cs173.stanford.edu [BejeranoWinter12/13]
24
Excess conservation binding site prediction is more
accurate than existing methods
(prior state of the art)
http://cs173.stanford.edu [BejeranoWinter12/13]
25
conservation
(% identity)
Excess conservation captures binding site
profile similar to ChIP-seq
http://cs173.stanford.edu [BejeranoWinter12/13]
26
Submit predictions to GREAT
• Now we have good genome-wide binding site
predictions for many factors
Lets submit them to GREAT and find
out what they are doing…
http://cs173.stanford.edu [BejeranoWinter12/13]
27
Comparing binding site prediction to ChIP-seq
Transcription
factor
GABPA
REST (NRSF)
SRF
In Jurkat
STAT3
In mESC
Ontology
GO Biological Process
GO Cellular Component
GO Molecular Function
Mouse Phenotypes
PANTHER Pathway
Pathway Commons
GO Biological Process
GO Cellular Component
GO Molecular Function
Mouse Phenotypes
PANTHER Pathway
Pathway Commons
GO Biological Process
GO Cellular Component
GO Molecular Function
Mouse Phenotypes
PANTHER Pathway
Pathway Commons
GO Biological Process
GO Molecular Function
Mouse Phenotypes
Pathway Commons
Top-ranked biological context
translation
membrane coat
translation initiation factor activity
increased single-positive T cell number
general transcription by RNA polymerase I
transcription
neurotransmitter transport
neuronal cell body
cation channel activity
abnormal synaptic transmission
synaptic vesicle trafficking
transmission across chemical synapses
muscle structure development
actin cytoskeleton
structural constituent of muscle
dilated heart ventricles
cytoskeletal regulation by Rho GTPase
regulation of insulin secretion by acetylcholine
negative regulation of signal transduction
transforming growth factor beta binding
abnormal spleen B cell follicle morphology
Signaling events mediated by TCPTP
http://cs173.stanford.edu [BejeranoWinter12/13]
GREAT rank for ChIP-seq
2
14
4
None
1
3
1
None
1
1
2
3
None
1
None
None
None
None
None
None
None
None
Experimental support
(Genuario and Perry, 1996)
Novel
(Genuario and Perry, 1996)
(Yu et al., 2010)
(Hauck et al., 2002)
(Hauck et al., 2002)
(Schoenherr et al., 1996)
(Schoenherr et al., 1996)
(Schoenherr et al., 1996)
(Schoenherr et al., 1996)
(Schoenherr et al., 1996)
(Schoenherr et al., 1996)
(Miano et al., 2007)
(Miano et al., 2007)
(Miano et al., 2007)
(Parlakian et al., 2004)
(Hill et al., 1995)
Novel
(Naka et al., 1997)
(Kinjyo et al., 2006)
(Schmidlin et al., 2009)
(Yamamoto et al., 2002)
28
PRISM re-discovers known functions
TF
SRF
function
muscle structure development
p-value
7.43×10-41
target genes
157
GLI2
skeletal system development
7.07×10-48
192
Is the number of re-discovered known
functions impressive?
CRX
retinal photoreceptor degeneration 1.30×10-10
AR
abnormal spermiogenesis
1.19×10-6
http://cs173.stanford.edu [BejeranoWinter12/13]
34
26
29
Evaluate re-discovery of known function using
“closed loops”
How can we assess if the functional associations predicted
by PRISM for a particular TF are reasonable without
reading a lot of papers?
One way is to check if the TFs are
annotated with the function (form a closed loop)
Genes involved in
“muscle structure development”
SRF
Is SRF itself annotated with the term
“muscle structure development”?
SRF
YES – a “closed loop”
http://cs173.stanford.edu [BejeranoWinter12/13]
30
PRISM predictions are consistent with known
transcription factor biology
Null Model:
How many closed loops
using 50,000 random
shuffled PWM libraries?
http://cs173.stanford.edu [BejeranoWinter12/13]
31
Many non-closed loops are still true
TF
function
GATA6 abnormal pancreas development
SRF
actin cytoskeleton
p-value
5.69×10-13
4.84×10-58
target genes
23
142
1. Incomplete annotation
Nature Genetics, December 2011.
2. “Regulation of” annotation
SRF acts in the nucleus, where it
regulates actin cytoskeleton genes.
http://cs173.stanford.edu [BejeranoWinter12/13]
32
Raw GREAT results need cleaning for
conserved TFBS
• Now we have good genome-wide binding site
predictions for many factors
• AND we have functional predictions without ChIP-seq
Was it as easy as creating binding sites
and submitting the results to GREAT?
…not quite…
http://cs173.stanford.edu [BejeranoWinter12/13]
33
Shuffled motifs also give GREAT
enrichments
Transcription
factor motif
Genome-wide
Run GREAT
binding site
and observe
predictions
biological function
10 Shuffled
Transcription
factor motifs
Genome-wide
Run GREAT
binding site
and observe
predictions
biological function
Filter
PRISM
Examine closely
http://cs173.stanford.edu [BejeranoWinter12/13]
34
Shuffled motifs are used to create a “E-value” metric
to black list enrichments that show up for shuffles
(from
shuffles)
Stage 1: GREAT on
binding site
prediction
Obtained = GREAT
# TF-term associations
TF-term FDR
closed loop %
31,946
50.5%
3.3%
Stage 2: Top
Stage 3: PRISM
significant
terms (via black
GREAT terms
listing)
Kept Kept = PRISM PRISM vs. GREAT on b.s. prediction
7,529
49.5%
5.3%
1,658 GREAT predictions kept
16.4% FDR improvement
10.9% fraction loops improvement
5.2%
308%
329%
What are all the terms we are throwing away?
http://cs173.stanford.edu [BejeranoWinter12/13]
35
GREAT enrichments from shuffles are due to
conservation bias
• Create 10,000 random sets of random conserved non-coding regions
• Run GREAT
• How do the enrichments compared to those from shuffled motifs?
Shuffles (2488) 755
1733
546 CNEs (2279)
Pro: E-value helps us get more accurate predictions by removing false predictions
Con: Conservation bias filter, causes us to lose potentially real enrichments
in systems that are more often conserved
http://cs173.stanford.edu [BejeranoWinter12/13]
36
So far…
• “Excess Conservation”
• advanced the state of the art for binding site prediction
• “PRISM pipeline”
• combined accurate binding site prediction with GREAT
• Publically offered as a web application
• bejerano.stanford.edu/prism
http://cs173.stanford.edu [BejeranoWinter12/13]
37
The rest of the talk includes
pre-publication work
http://cs173.stanford.edu [BejeranoWinter12/13]
38