Protein synthesis - Swedish Institute of Computer Science
Download
Report
Transcript Protein synthesis - Swedish Institute of Computer Science
Detecting binding sites for transcription factors
by correlating sequence data with expression.
Erik Aurell
Adam Ameur
Jakub Orzechowski Westholm
in collaboration with AstraZeneca
Outline of the talk
• Introduction
• Data description
• The REDUCE method
• Results
• Applications and Conclusions
Ameur,
Orzechowski
11/3 2003
Introduction - the REDUCE method
• The aim is to find binding sites for transcription factors, motifs, in the human genome by
using a method developed at Rockefeller University (Bussemaker, Li & Siggia 2001).
• This method is called REDUCE and has previously only been applied to yeast data. We
applied it to human data.
• The idea is to find motifs by correlating sequence and expression data.
• Input consists of: Expression data, sequence data and a set of putative motifs.
• Output is a list of significant motifs:
consensus
id
description
Dc2
NNNRRCCAATSRGNNN
NNNCGGCCATCTTGNCTSNW
NNRACAGGTGYAN
NNNRGGNCAAAGKTCANNN
TWTTTAATTGGTT
KNNKNNTYGCGTGCMS
NANCACGTGNNW
NNBTNTNCTATTTNTT
NNGAATATKCANNNN
M00287
M00069
M00060
M00134
M00424
M00235
M00123
M00092
M00136
NF-Y
YY1
Sn
HNF-4
NKX6-1
AhR/Arnt
c-Myc/Max
BR-CZ2
Oct-1
0.0044
0.0014
0.0013
0.0008
0.0007
0.0006
0.0006
0.0005
0.0005
F
0.0661
0.0363
-0.0345
0.0290
-0.0234
-0.0254
-0.0243
0.0233
-0.0230
probes
hits
1041
300
368
263
428
155
50
92
213
1279
314
374
272
457
161
50
94
244
Ameur,
Orzechowski
11/3 2003
Expression data
• Expression data is provided by AstraZeneca. It consists of 81 samples of human cerebral
cortex stem cells undergoing various treatments. Expressions are measured on an Affymetrix
U133 chip.
• We visualize expression data in a heatmap.
gene 1
gene 2
.
.
.
gene n
sample 1
...
sample 81
e(1,1)
e(2,1)
.
.
.
e(n,1)
...
...
e(1,81)
e(2,81)
.
.
.
e(n,81)
...
• It is possible to identify regions of correlated genes in the heatmap.
Ameur,
Orzechowski
11/3 2003
Sequence data
• In the REDUCE model, expression levels are explained by the number of times the
motifs occur in the upstream sequences of human genes.
• For this, sequences around the transcription starts are extracted. We take sequences in
the range [1000 bp upstream, 100 bp downstream].
• Transcription starts and genome data are provided by AstraZeneca.
• The upstream sequences are masked for repeats (with RepeatMasker).
• Putative motifs are matched to the resulting sequences.
transcription start
-1000 bp
+100 bp
GGAGTTCAAGACCAACCTAAGCAACAAAGTGAAACCACATCACTATAAATATATTCTTAAACGTGAAATGTTCACTCAGGCTTTTTAATATTTTATTTCATTT
• The motif TKAAA and its reverse complement TTTMA are matched in the example.
Ameur,
Orzechowski
11/3 2003
Motifs
• Motifs are represented as weight matrices :
W(M) =
pos 1
pos 2
.
.
.
pos n
A
C
G
T
w(1,A)
w(2,A)
.
.
.
w(n,A)
w(1,C)
w(2,C)
.
.
.
w(n,C)
w(1,G)
w(2,G)
.
.
.
w(n,G)
w(1,T)
w(2,T)
.
.
.
w(n,T)
w(i,B) is the probability that base i is
the nucleotide B in the motif M.
• We generate the set of putative motifs as weight matrices. This can be done in several ways:
• One possibility is to use the matrices (about 300) in the TransFac data base.
• Another possibility is to generate matrices of our own, for example for all sequences of a
certain length. Since the number of possible sequences grows exponentially with the length,
this is only possible for sequneces up to length 7 or 8.
• We have implemented a method based on Gibbs sampling to match weight matrices to
upstream regions.
Ameur,
Orzechowski
11/3 2003
Matching motifs to the upstream sequences
• A weight matrix W is matched to a sequence s1 s2 … sn the following way:
• The entropy of position i in a weight matrix W is defined as:
Ei W
1
log i
1
A,C ,G ,T
• If the sequence S is added to the the weight matrix W a new weight matrix Ws is obtained.
• We then define a score based on the changes in the entropies when a sequence S is added
to a weight matrix W:
l
scoreW , S Wi Ei W Ei Ws
i 1
• If the score is non-negative, that is if the entropy decreases, a match is reported.
Ameur,
Orzechowski
11/3 2003
Pre-processing and REDUCE
Mapping from probes to
transcription starts
Human
genome
Upstream
sequences
Putative
motifs
Masked upstream
sequences
Expression data for
cortex stem cells
Motif occurences in
upstream regions
REDUCE
indata
Ag C
F ng
M
Ameur,
Orzechowski
11/3 2003
REDUCE output
consensus
id
description
NNNRRCCAATSRGNNN
NNNCGGCCATCTTGNCTSNW
NNRACAGGTGYAN
NNNRGGNCAAAGKTCANNN
TWTTTAATTGGTT
KNNKNNTYGCGTGCMS
NANCACGTGNNW
NNBTNTNCTATTTNTT
NNGAATATKCANNNN
M00287
M00069
M00060
M00134
M00424
M00235
M00123
M00092
M00136
NF-Y
YY1
Sn
HNF-4
NKX6-1
AhR/Arnt
c-Myc/Max
BR-CZ2
Oct-1
Dc2
0.0044
0.0014
0.0013
0.0008
0.0007
0.0006
0.0006
0.0005
0.0005
F
0.0661
0.0363
-0.0345
0.0290
-0.0234
-0.0254
-0.0243
0.0233
-0.0230
probes
hits
1041
300
368
263
428
155
50
92
213
1279
314
374
272
457
161
50
94
244
consensus - A consensus sequence for the motif.
id - A unique id for each motif.
description - The transcription factor name.
Dc2 - The significance of the motif.
F - The effect. A positive value indicates activation and negative repression.
probes - Number of probes with occurences of the motif in their upstream regions.
hits - Total number of motif occurences.
Ameur,
Orzechowski
11/3 2003
Visualizing REDUCE outdata
• REDUCE outadata can be visualized in a heatmap.
m1
1
mn
sample 1
F m1
...
F1 mn
sample 2
...
.
.
.
F2 m1
.
.
.
F2 mn
.
.
.
sample 81
F81 m1
...
F81 mn
• The motifs in this heatmap are taken from TransFac.
• Green dots indicate repressing and red dots indicate activating motifs.
• The heatmap gives a clustering of samples on motifs.
Ameur,
Orzechowski
11/3 2003
Validation of results
• A bootstrap test was carried out to validate the results of REDUCE.
• 10 sets of randomized data. Each set consists of the same upstream sequences and
expression levels, but combined randomly.
35
30
25
20
actual data
15
randomized data, worst
ranomized data, best
randomized data, mean
10
5
0
• For most samples the results for the actual data is significantly better than for the
randomized data.
Analyzing REDUCE outdata
• More validation: The pictures below show the samples clustered on expression and on motifs.
• Analysis of significant motifs:
By analyzing the motifs found by REDUCE we hope to find motifs
that explain clusters of correlated genes.
For example, REDUCE found a TransFac motif in the samples
associated with the red area in the picture. It matches 18% of the
109 genes in the picture, and 4% of the other genes.
• Finding new motifs:
CCGGA
GCGGA
TCGCG
GCGAC
GCGCG
CCGCG
GCGGC
CGGCG
CCGCC
AGGCG
GCGCC
GCGGG
GGGCG
One iteration of REDUCE was run on all sequences
of length 5.
A
0.17
0
0
0
0
0.14
0.29
C
0.33
0.5
0.15
1
0
0.29
0.57
G
0.33
0.5
0.85
0
1
0.57
0.14
T
0.17
0
0
0
0
0
0
N
S
G
C
G
S
M
Ameur,
Orzechowski
11/3 2003
Applications
• Identify coregulated genes with potentially different expression profiles, using the
motifs found by REDUCE.
• Predict previously unknown motifs, or new properties of known ones.
Conclusions
As the bootstrap test shows our results are significant in most cases. This suggests that the
motifs we find are biologically meaningful, and that the method is applicable on human
data.
To determine the role of each transcription factor, requires more in-depth examination of
subgroups of genes. This is very intereseting, but beyond the scope of this project.
Our results on human data had somewhat lower significance than previuos results on yeast
presented in (Bussemaker, Li & Siggia, 2001). There are several possible causes for this:
• Data quality: Expression data, upstream regions.
• Gene regulation probably more complicated in human.
Even so, our results suggest that the REDUCE method might give useful information
about transcription factor binding sites in humans. Probably, this requires prior knowledge
about motifs and other methods such as clustering.
Ameur,
Orzechowski
11/3 2003