Memphis presentation_final

Download Report

Transcript Memphis presentation_final

Modeling Splice Site and
Transcription Factor Binding Site
Variation by Information Theory
Peter K. Rogan, Ph.D.
St. Jude’s Children’s Research Hospital
Memphis, TN
May 15, 2003
Background
• Information theory provides general
solutions to the problem of how to
recognize members of a group of
related nucleic acid (or protein)
sequences.
• The average information of a related set of
sequences, Rsequence, represents the total sequence
conservation:
Rsequence = 2 - [ -f(b,l) log2 f(b,l) + e(n(l)) ]
f(b,l) is the frequency of each base b at position l,
e(n(l)) is a correction for the small sample size n at
position l
Schneider et al. J. Mol. Biol. 1984
Sequence Logo
Conservation and diversity
among related binding sites
can be visualized using a
sequence logo.
The area under the logo is
Rsequence, the average
Information of the binding
site.
Definition of Individual Information
• The individual information, Ri, of a single
member of a sequence family is the dot
product of that sequence vector and a
weight matrix, Ri(b,l), based on the of the
base frequencies at each position of the
sequence.
t
Ri(j) =  s(b,l,j) Riw(b,l)
l
b=a
(bits per site j)
Distribution of Individual Information
for related binding sites
The average of the set of Ri values for a family of sequences is
Rsequence.
Second law of thermodynamics
-kBT ln 2  q / R
q: heat dissipated; T: temperature; R: information
q>0
HLH Protein
=>
R <0
DNA Mutation or
Unrelated sequence
q<0
=>
R >0
HLH Protein bound to WT DNA
Sequence Walker Definition
Among related sequences having a common function,
functional sites can be distinguished from non-sites
with the sequence walker.
(E. coli Fis protein)
bits
2
0
-4
Sequence Walker Application I
The matrix can be scanned along a “test sequence” until...
bits
2
0
-4
Ri = - 6.7 bits at position 179 of the sequence. The Z score is -5.4.
Sequence Walker Application II
… a green bar indicates a potential binding site
bits
2
0
-4
Ri = 9.2 bits at position 180 of the sequence. The Z score is 0.3.
mRNA splicing
5’
IVS 1
Exons 1
5’
donor
acceptor
IVS 2
2
3
Transcription
IVS1
1
IVS2
2
3
gene
DNA
3’
hnRNA
3’
Splicing
or
1
2
3
Mature mRNA
1 3
Alternative mRNA
Splice Site Model Building
•We extracted coordinates of unique donor and acceptor
splice sites of known genes from the given strand of the
10/7/00 Human Genome Working Draft.
•Valid splice junctions were evaluated by information theory
(Ri > 0) and the Ri(b,l) matrix was computed.
•This process was iterated (~ 10 cycles) until all sites
evaluated with the matrix had Ri > 0.
Semi-automated Splice site Model Refinement
Parameters
Acc (+ strand)
Starting set (n)
86,068
Refined model (n)
53,985
Site coordinates
[-25, 2]
Rsequence
7.45
Standard deviation
3.47
Ri of consensus sequence 22.93
Acc (- strand)
84,076
54,101
[-25, 2]
7.41
3.47
22.78
Acc_total
170,144
108,079
[-25, 2]
7.42
3.47
22.88
Acc (1992)
1,744
1,744
[-25, 2]
8.87
4.58
21.68
Don (+ strand) Don (- strand) Don_total Don (1992)
Starting set (n)
86,221
84,229
170,450
1,799
Refined model (n)
56,286
55,491
111,772
1,799
Site coordinates
[-3,6]
[-3,6]
[-3,6]
[-10,10]
Rsequence
6.73
6.74
6.74
8.01
Standard deviation
2.36
2.33
2.34
3.29
Ri of consensus sequence 11.80
11.80
11.79
15.18
• ~ 1/3 of exon-intron junctions are misaligned in the draft,
owing to the rapid alignment procedures used (ie. BLAT).
Splice junction logos: (+) strand
Ri analysis of sequence
variation at binding sites
•
•
•
•
Effects of mutations
Effects of polymorphisms
Detection of cryptic sites
Relationship between information content
and phenotype
Comparison of the binding energies of normal and
variant splice junctions:
Gwt/ Gv = 2
Ri
where Ri = the difference between the respective Ri values,
Gwt = Free energy of the natural binding site,
Gv = Free energy of the variant binding site.
The fold difference in binding the normal vs. the variant site is
Gwt/ Gv.
mRNA splicing mutations (*, ^)
5’
IVS 1 *
Exons
1
donor
5’
IVS 2
2
acceptor
IVS1
1
*
*
^
3
*
IVS2
3
2
or
1
2
3
Leaky or no wild type
mRNA
or
gene
DNA
3’
hnRNA
3’
^
1 2 3
1 3
Exon
Cryptic
skipping (*)
splicing (^)
Mutant forms
The minimum information required for donor site
recognition
Temperature sensitive mutation in COL3A1 results in 50% exon
skipping and Ehlers-Danlos syndrome, Type VII. Splicing is impaired at 39 deg.C
and restored at 30 deg. C, which is consistent with weak binding by U1 splicesome.
Cryptic splicing mutations
A C->T mutation in exon 3 of the iduronidate
synthetase gene activates a cryptic donor site upstream
of the natural donor site.
Mechanism of exon recognition
U1
splicesome
U2 splice
+ U2AF
exon
5’ mRNA
acceptor
donor
Binding sites
3’
Mechanism of exon recognition:
cryptic splicing mutation
(2a)
U1
splicesome
U2 splice
+ U2AF
exon
5’ mRNA
Natural
acceptor
Activated
cryptic
donor
Recognized
Binding sites
3’
Natural
donor
Either not
recognized or
to lesser degree
Mild (or leaky) splicing mutation
CFTR Polymorphism (5T, 7T, 9T)
Pop
Freq
60%
35%
5%
Splicing among 3 common alleles that differ in length in the polymorphic polythymidine tract
of the IVS 8 acceptor of the CFTR gene.The shortest allele (top walker) shows 90% skipping
of exon 9 and is associated with congenital absence of the vas deferens. Individuals with the
two longer alleles have a normal phenotype, although the 7T allele produces less mRNA than
the 9T allele.
Prediction of clinical phenotypes
•Hereditary non-polyposis colon cancer
•Hemophilia A and B
•Atherosclerosis
Predicting Phenotype of HNPCC Splicing Mutations by
Information Analysis
The Lynch I form of HNPCC is confined to the colon, but the more severe Lynch II
type shows multi-organ involvement. The HNPCC phenotype is hypothesized to be
related to the amount of normal and abnormal MLH1 and MSH2 mRNA present
predicted from the individual information in mutant splice sites.
Lynch II mutations
Lynch I mutations
Mutant splice sites (n=31) in these genes contained significantly less information than the
cognate natural sites. Each of the Lynch I mutations had Ri values >2.4 bits, which is
consistent with reduction (not abolition) of mRNA. Lynch I and II phenotypes were
distinguishable by their Ri values for all but 3 Lynch II mutations (with 2.4 to 4.8 bits).
Statistical analysis: HNPCC
Hypothesis: Ri values will be highest for normal splice sites,
Intermediate values for Lynch I and lowest values for Lynch II
syndrome.
The medians for these three groups are different and in the correct order
and that there are some outliers in the two Lynch mutation groups.
The three groups have significantly different RI values.
{Kruskal-Wallis 2 (df=2) =17.9833  P= 0.0001}
Each of the groups are different from one another based on pairwise
comparisons with the Wilcoxon rank-sum test:
Group comparison
Lynch I vs. Normal variants
Lynch II vs. Normal variants
Lynch I vs. Lynch II
Corrected Rank-sum
P
Normal (Z) statistic________________
2.68
3.73
2.17
0.0072
0.0002
0.03
Results are consistent
with MSH2 -/and MSH2 +/transgenic mouse
phenotypes. Increased
proliferation induces
widespread DNA
replication errors, which
are repair normally until
DNA repair systems are
saturated (Cancer Res.
62:2092, 2002).
Mismatch repair
machinery is activated by
DNA damaging agents
(Nature 399:806, 1999;
PNAS 96:10704, 1999).
Relating Information Content of F8C and F9 Splicing
Mutations and Bleeding Phenotypes
Ri
Reduction in Protein
Reduction in Protein
Value
Level
Activity
Cutoff,
(bits)
________________
Mild
Severe
________________
Mild
Severe
Bleeding Symptoms
________________
Mild
Severe
< 2.4
0/13
13/13 (100)
5/37 (14)
32/37 (86)
0/9
9/9 (100)
> 2.4
5/7 (71)
2/7 (29)
23/36 (64)
13/36 (36)
8/21 (38)
13/21 (62)
To predict severity of hemophilia, mutations
in the factor VIII (F8C) or factor IX (F9) genes
were analyzed for changes in RI:
v The receiver operating curve discriminated
mildly or moderately from severely
reduced protein activity for values  2.4
bits or Ri < 7 bits (P=.001).
v Using these thresholds:
- 91% of mutations with severely
reduced protein expression were
correctly identified (n=45; P< 0.001).
-
86% of mutations associated with
severe bleeding and all mutations
with moderate bleeding symptoms
were correctly identified (n= 22 p< .0009).
Information Content of Splicing Mutations
in Lipid Metabolizing Genes vs. Phenotype
Phenotype*
Ri value
cutoff
(bits)
Dyslipidemia
Reduction in protein level or
activity
Mild
Average
Severe
Mild
Average
Severe
< 2.4
0/15
10/15
5/15
1/9
7/9
1/9
> 2.4
2/5
3/5
0/5
2/3
1/3
0/3
Fraction is the number of mutations in category / total number above or below 2.4 bits. Mutant
genes included APOAII,APOB,APOCII,APOE,CBS,CETP,LCAT,LIPA,LDLR, and LPL.
Generating information models of eukaryotic
transcription factor cis-regulatory binding sites
Unique challenges:
•Variant sequences are not obvious
•Requires experimental determination and validation
•Effect of ascertainment bias
in published sites
in SELEX-generated sites
•Binding protein does not necessarily signify that it
activates (or represses) transcription
Greek Hereditary Persistence of Fetal Hemoglobin(HBGA, -119G>A)
6.8 bits
7.3 bits
(A) Mutation in the CCAAT box of the A-gamma
globin gene results in 1.4 fold increased
expression of fetal globin mRNA into adulthood.
The CCAAT box protein binding site is
strengthened by 0.5 bits (or 1.41 fold) over wild
type. (B) The binding site logo and distribution of
Ri values of 171 binding sites in the Transfac
Database (www.biobase.de) are indicated.
Models of NF-E2, GATA1, and GATA2 protein
binding Sites were also constructed, but sites
were not found in this interval (not shown).
The Transcription Factor
Binding Site Problem:
Bias in Models Derived from TRANSFAC data
towards Consensus Sequences*
*Consensus sequences have the strongest binding, but are often
not representative of the majority of sites.
Model development strategy
Refinement of the Pregnane X Receptor
(PXR/RXRα) binding site model
Initial PXR/RXR Model. Published PXR/RXR binding sites (n=15; and
flanking sequences) were multiply aligned by minimization of
uncertainty. The -2 to +20 interval contained most of the information,
was consistent with published binding studies, and was therefore used
to define the site.
bits
Competition Curves for Novel PXREs Identified by Model 1
To quantify the relative affinity of PXR/RXR, band density was plotted versus pmol
competitor to determine the concentration of competitor required to deplete
PXR/RXRα binding to the CYP3A4 proximal PXRE by 50%. Relative binding was
normalized to the band intensity of the reactions with no added competitor as 100%.
Comparison of predicted and measured binding
affinities for novel PXR/RXRα sites
identified with the initial model
GENE
Position
(relative
to ATG)
PXRE
(Model 2 derived walker)
RI (bits)
Model
1
Minimum
Theoretical Change
in Affinity
Model
Model 1
2
Observed
Change in
Affinity
(EMSA)
Model 2
CYP3A4
-270
17.3
18.0
CYP2B6
-8572
15.0
17.9
4.92
1.07
4.4
UGT1A3
-6930
10.9
17.2
84.4
1.74
4.4
UGT1A3
-8040
10.7
16.5
97.0
2.83
3.7
UGT1A6
-9216
9.9
14.3
168.9
13.0
29.6
Predicted fold differences in binding were closer to densitometricallydetermined differences when these weaker sites were added in Model 2.
Model 2 Characteristics
(A) Alignment of published + validated PXREs
(B) Histogram
(C) Sequence logo
Scans of CYP3A4 and CYP2B6 promoters
Each promoter was scanned with PXR/RXR model 2. Ri
values are plotted versus the position of the PXRE in the
CYP3A4 gene or the CYP2B6 gene. Ri values of sites on
the antisense strand are shown upside down. Previously
characterized PXR binding sites identified by the model are
indicated in color.
Activation of the CYP2B6 Distal PXRE
Transient transfections with CYP2B6 and control CYP3A4 PXRE fusion
constructs. Rifampin induced luciferase activitiy 4- to 5-fold in cells
cotransfected with an expression plasmid for human PXR and
CYP2B6-dPXRE(2X)-luc, and 2- to 3- fold in cells cotransfected with
CYP3A4-pPXRE(2X)-luc. Rifampin had no effect on luciferase activity
in cells transfected with the enhancerless-reporter.
Average luciferase activity ± SD of three replicates from 3 independent transfections is shown.
PXR/RXR Model 3
Weaker binding sites from well
established PXR/RXRα target gene
promoters (Ri < Rsequence) were validated
and introduced into Model 3.
Novel validated binding sites in Model 4
These 14 binding sites are not present in the Nov 02 human genome draft!
Ri
Site name
Site name - Ri(b,l) matrix
CYP3A4-pPXRE(0/10G)
NG_000004.a148729g.a148739g
15.1
CYP3A-dNR1(0/10G)
NG_000004.t141178c.t141168c
16.8
CYP3A7-dNR2(0/10G)
NG_000004.a190205g.a190215g
17.6
CYP2B6-dPXRE(10G)
CYP2B6.a1446g
16.2
UGT1A3b(0/10G)
AF297093.t137695c.t137685c
18.3
UGT1A3a(0/10G)
AF297093.a138805g.a138815g
14.9
GSTM1(0/10G)
AC000031.6.a1959g.a1969g
12.0
UGT1A1gtNR1(0/10G)
AF297093.1.t171676c.t171666c
7.1
UGT1A1b(0/10G)
AF297093.1.t165761c.t165751c
14.0
FMO4b(10G)
AL031274.1.a57947g
11.0
catalase(0/10G)
AL035079.14.t43503g.a43513g
14.6
NOS2A(1A)
chr17_27002541-27012540.c8336t
12.9
NOS2A(11A)
chr17_27002541-27012540.c8326t
10.5
MAOBd(0/10G)
Z95125.t36576c.t36566c
11.1
Possible significance of novel sites
• Not present in reference sequence, but they are
polymorphisms or mild mutations
– Advantage is that binding is not abrogated, but
reduced, ie. gene is less PXR/RXR responsive.
– Possible “wobble” code for regulatory elements
• Ancestral binding sequence present in primate
lineage
– PXR/RXR mutation rate is slower than cis-regulatory
element; protein retains ability to recognize
sequences that are no longer present
– This could explain why heterologous cross-species
transfections are faithfully regulated.
Development of a Xenobiotic biosensor based
on the information theory-derived optimal site
Firefly RLU/Renilla RLU
HepG2 cells were transiently transfected with 100 ng luciferase reporter, 5 ng
pRL-CMV and 25 ng pSG5-hPXRDATG with Lipofectamine Plus. After
treatment for 24 hours with 10 mM Rifampin or 0.1% DMSO (solvent), cells
were harvested and Dual-luciferase assays were performed. Results are the
average of three separate wells transfected and treated in parallel.
14
DMSO
12
10 uM
Rifam pin
10
8
6
4
2
0
PXREv2-OPT(2X)luc
CYP3A4pPXRE(2X)-luc
Architecture
of the Delila
Genome
System
Performance metrics
Histogram of binding site strengths for sites
in genome scan >10 bits
Delila-Genome Visualization Tools
Visualization of successive genome scans of
PXR/RXRα binding site models
Monitoring PXR/RXR refinement through
Table 2: Differences in total binding site counts based on genome scans of promoters with successive PXR/RXR information
complete genome promoter scans
weight matrices
PXR/RXR
Models
+
Number of sites in each category
Unique sites
Z scores
A
B
A-B*
B-A^
Threshold
(Z)
S
I
Threshold
(Ri, bits)
1
2
11758
45219
1.0
589
71658
2
3
17065
157922
1.0
48657
3
4
61906
148894
1.0
5044
(A
~
B),
Confidence intervals+
Ri
(A
S
I
Threshold
(±S.D.)
3
2293
69954
3
23625
48622
51744
3
11044
89357
3
37822
62579
191373
3
11069
185348
3
68846
127571
@
B),
(A
B),
(A
B),
(A
B), S
(A
B),
I
Standard error computation for individual Ri values is based on derivation given in reference 18; *Sites found with model A but not with model B; ^sites
found with model B, but not with model A; ~ Number of sites with differences in Ri values exceeding threshold Z scores; @Number of sites with differences in Ri
values less than the threshold.
Development and Experimental Refinement
of NFkB p65/p50 Binding Site Model
Panel 1. Logos for NFkB p50/p65 binding sites. (A) Model
2 based on 55 Published and 8 experimentally determined
binding sites (B) Model 3 based On 55 published and 20
experimentally determined binding sites. Inset s are
histogram distributions of Ri values of sites comprising
each model.
CYP2D6 Promoter Mutation Analysis of NFkB p65/p50 Binding Site
CYP2D6:
“C allele”
3.3 bits
“G allele”
-0.8 bits
The -1496C allele contains a weak p50/p65 site (–1495 to –1508; R i =3.3 bits) that is
abolished (R i < 0) in the G variant. These alleles each also contain p50 homodimer
binding sites on opposite strands; however, the C allele is predicted to bind with
1.6 fold difference). The higher CYP2D6 activity
greater affinity (3.5 vs. 2.7 bits;
observed for the –1496G allele may be due to reduced binding and repression of
CYP2D6 expression by NF-kB p50 homodimers.
Future efforts
• Automate binding site validation
• Genomic signature of PXR/RXRα – target
genes
• (Hypothesis-based microarray studies of
ligand-induced gene expression)
Automated binding site validation:
microtiter plate immunoassay
•
•
•
•
•
Covalently link reference oligo to plate
Bind synthetic PXR/RXRα ± competitor oligo*
Bind 1o RXR α (or PXR) antibody
Detect with 2o antibody/ HRP
(Automated with Biomek 2000 workstation)
*Competitor oligos are detected in PXR/RXRα target genes and exhibit
Ri values that are ±2 bits of reference oligo.
Genomic analysis to identify genes
regulated by transcription factors:
•Requires robust binding site model
•Genomic signature should delineate differences
between regulated and constitutively expressed
genes:
• Define promoter interval interval
• Binding site strength
• Densities of sites
• Organization of sites
regulated by NF-kB + unregulated
“NF-kB binding Genes
sites”
in gene promoters
16
Legend
14
Ri-reg
(n=8)
Ri-unreg
(n=3)
12
Ri
Ri
10
8
6
4
2
0
-10000
-10000
-8000
-8000
-6000
-6000
-4000
-4000
-2000
-2000
00
Position
Position
-400 bp
binding sites for promoters of upregulated genes scanned by model 3
NF-kBNF-kB
Binding
Sites in Upregulated Genes
16
Legend
15
Ri
14
INF-beta
13
LCAM
12
E-Selectin
11
Lymphotoxin
10
TNF-alpha
9
IL-2
8
GM-CSF
7
Urokinase
Ri = 4.0
6
5
4
3
2
1
0
-400
-350
-300
-250
-200
Position
-150
-100
-50
0
“NF-kB binding sites” in genes not known to be regulated by NF-kB
16
15
14
Legend
13
GAPDS
12
GAPD
11
VEGF
10
9
8
Ri
7
Ri = 1.3
6
5
4
3
2
1
0
-400
-350
-300
-250
-200
Position
-150
-100
-50
0
Criteria for scanning chromosomes 21/22
with NF-kB Model 3:
•Average information threshold of >4 bits. Of 548 promoter intervals
(400 bp each): the mean Ri values for sites in 138 promoters on the
transcribed strand and 137 on the antisense strand had sites exceeding
threshold. 37% of the genes on chromosome 21 would be NF-kB
targets!! Also, multiple weak binding sites with low Ri values can falsely
exclude genes containing strong binding sites. This genomic signature
has very LOW specificity.
•Eliminate promoters with only weak binding sites (Ri<Rsequence). This
signature identifies smaller set of genes: 11 and 19, respectively, on
chromosomes 21 and 22. Several expected cytokine genes are not
identified with this genomic signature. These criteria introduce biased
towards the consensus sequence (or an incomplete model). This
approach appears to lack adequate sensitivity.
Genomic signature determination for PXR/RXR
with machine learning approach
Refinement of genomic signature
Add
True Negatives Genome Scan
True Positives
Unknowns
Prediction
Training/Validation
Add
Promoter region input
Freq Dist of Binding Strengths
Markov Cluster
Algorithm
Clusters of Sites
Distances from
TSS
Hybrid Neural Network
Positive/Negative Prediction
Positive
Negative
Experimental Confirmation
Predictions of Binding Strength Network
• Network Input: Frequency distributions of
binding sites based on 5 bit-wide bins
• Trained with 15 PXR/RXR responsive
and 15 non-responsive promoter regions
• Results of testing 9 positive and 22
negative promoter regions:
– <TP,FP,TN,FN> = <7,4,18,2>
– Sensitivity = 77.8%
– Specificity = 81.8%
In conclusion...
•Genetic variation in binding sites can be comprehensively
modeled by information theory.
•Information is related to binding energy and can be used rank
order binding strengths.
•Beware of experimental bias towards strong binding
sites. Information theory can be used to develop and refine
binding site models that are representative of the range of
binding strengths found in the genome.
•Robust binding site models are a prerequisite for accurate
mutation/polymorphism analysis and for comprehensive
identification of binding sites in the genome.
Contributors
Children’s Mercy Hospital and Clinics:
•Sashidar Gadiraju, Stan Svojanovsky
•J. Steven Leeder, Carrie Vyhlidal, Ivy Hurwitz
SICE, University of Missouri-Kansas City:
•Deendayal Dinakarpandian, Saumil Mehta
St. Jude’s Children’s Research Hospital:
•Erin Schuetz
University of Hamburg:
•Yskert von Kodolitsch
NCI:
•Tom Schneider
Support
Merck Genome Research Foundation
PHS ES10855-02