Protein synthesis - Swedish Institute of Computer Science

Transcript Protein synthesis - Swedish Institute of Computer Science

Detecting binding sites for transcription factors
by correlating sequence data with expression.
Erik Aurell
Adam Ameur
Jakub Orzechowski Westholm
in collaboration with AstraZeneca
Outline of the talk
• Introduction
• Data description
• The REDUCE method
• Results
• Applications and Conclusions
Ameur,
Orzechowski
11/3 2003
Introduction - the REDUCE method
• The aim is to find binding sites for transcription factors, motifs, in the human genome by
using a method developed at Rockefeller University (Bussemaker, Li & Siggia 2001).
• This method is called REDUCE and has previously only been applied to yeast data. We
will apply it to human data.
• The idea is to find motifs by correlating sequence and expression data.
• Input consists of: Expression data, sequence data and a set of putative motifs.
• Output is a list of significant motifs:
consensus
id
description
Dc2
NNNRRCCAATSRGNNN
NNNCGGCCATCTTGNCTSNW
NNRACAGGTGYAN
NNNRGGNCAAAGKTCANNN
TWTTTAATTGGTT
KNNKNNTYGCGTGCMS
NANCACGTGNNW
NNBTNTNCTATTTNTT
NNGAATATKCANNNN
M00287
M00069
M00060
M00134
M00424
M00235
M00123
M00092
M00136
NF-Y
YY1
Sn
HNF-4
NKX6-1
AhR/Arnt
c-Myc/Max
BR-CZ2
Oct-1
0.0044
0.0014
0.0013
0.0008
0.0007
0.0006
0.0006
0.0005
0.0005
F
0.0661
0.0363
-0.0345
0.0290
-0.0234
-0.0254
-0.0243
0.0233
-0.0230
probes
hits
1041
300
368
263
428
155
50
92
213
1279
314
374
272
457
161
50
94
244
Ameur,
Orzechowski
11/3 2003
Expression data
• Expression data is provided by AstraZeneca. It consists of 81 samples of human cerebral
cortex stem cells undergoing various treatments. Expressions are measured on an Affymetrix
u133 chip.
• We visualize expression data in a heatmap.
gene 1
gene 2
.
.
.
gene n
sample 1
...
sample 81
e(1,1)
e(2,1)
.
.
.
e(n,1)
...
...
e(1,81)
e(2,81)
.
.
.
e(n,81)
...
• It is possible to identify regions of correlated genes in the heatmap.
Ameur,
Orzechowski
11/3 2003
Sequence data
• In the REDUCE model, expression levels are explained by the number of times the
motifs occur in the upstream sequences of human genes.
• For this, sequences around the transcription starts are extracted. We take sequences in
the range [1000 bp upstream, 100 bp downstream].
• Transcription starts and genome data are provided by AstraZeneca.
• The upstream sequences are masked for repeats (with the program RepeatMasker).
• Putative motifs are matched to the resulting sequences.
transcription start
-1000 bp
+100 bp
GGAGTTCAAGACCAACCTAAGCAACAAAGTGAAACCACATCACTATAAATATATTCTTAAACGTGAAATGTTCACTCAGGCTTTTTAATATTTTATTTCATTT
• The motif TKAAA and its reverse complement TTTMA are matched in the example.
Ameur,
Orzechowski
11/3 2003
Motifs
• Motifs are represented as weight matrices :
W(M) =
pos 1
pos 2
.
.
.
pos n
A
C
G
T
w(1,A)
w(2,A)
.
.
.
w(n,A)
w(1,C)
w(2,C)
.
.
.
w(n,C)
w(1,G)
w(2,G)
.
.
.
w(n,G)
w(1,T)
w(2,T)
.
.
.
w(n,T)
w(i,B) is the probability that base i is
the nucleotide B in the motif M.
• We generate the set of putative motifs as weight matrices. This can be done in several ways:
• One possibility is to use the matrices (about 300) in the TransFac data base.
• Another possibility is to generate matrices of our own, for example for all sequences of a
certain length. Since the number of possible sequences grows exponentially with the length,
this is only possible for sequneces up to length 7 or 8.
• We have implemented a method based on Gibbs sampling to match weight matrices to
upstream regions.
Ameur,
Orzechowski
11/3 2003
Matching motifs to the upstream sequences
• A weight matrix W is matched to a sequence s1 s2 … sn the following way:
• For each of the bases s1 s2 … sn we extract the corresponding weight matrix entry w(i,si)
and compute the following sum
n
Score   log(w(i, si ) / bsi )
i 1
Here bsi is the background frequence of base si.
• An example: Assume we have the sequence AATCG and the matrix
pos 1
pos 2
pos 3
pos 4
pos 5
A
0.5
0.7
0.4
0.05
0.25
C
0.1
0.1
0.2
0.9
0.25
G
0.3
0.1
0.2
0.1
0.25
T
0.1
0.1
0.2
0.05
0.25
If all background frequencies are 0.25, this would give the score
Score  log(0.5 / 0.25)  log(0.7 / 0.25)  log(0.2 / 0.25)
 log(0.9 / 0.25)  log(0.25 / 0.25)  2.79
• The score is then compared to a threshold value: Threshold  Maxscore 1   
l
Ameur,
Orzechowski
11/3 2003
Pre-processing and REDUCE
Mapping from probes to
transcription starts
Human
genome
Upstream
sequences
Putative
motifs
Masked upstream
sequences
Expression data for
cortex stem cells
Motif occurences in
upstream regions
REDUCE
indata
Ag  C 
 F ng
M
Ameur,
Orzechowski
11/3 2003
REDUCE output
consensus
id
description
NNNRRCCAATSRGNNN
NNNCGGCCATCTTGNCTSNW
NNRACAGGTGYAN
NNNRGGNCAAAGKTCANNN
TWTTTAATTGGTT
KNNKNNTYGCGTGCMS
NANCACGTGNNW
NNBTNTNCTATTTNTT
NNGAATATKCANNNN
M00287
M00069
M00060
M00134
M00424
M00235
M00123
M00092
M00136
NF-Y
YY1
Sn
HNF-4
NKX6-1
AhR/Arnt
c-Myc/Max
BR-CZ2
Oct-1
Dc2
0.0044
0.0014
0.0013
0.0008
0.0007
0.0006
0.0006
0.0005
0.0005
F
0.0661
0.0363
-0.0345
0.0290
-0.0234
-0.0254
-0.0243
0.0233
-0.0230
probes
hits
1041
300
368
263
428
155
50
92
213
1279
314
374
272
457
161
50
94
244
consensus - A consensus sequence for the motif.
id - A unique id for each motif.
description - The transcription factor name.
Dc2 - The significance of the motif.
F - The effect. A positive value indicates activation and negative repression.
probes - Number of probes with occurences of the motif in their upstream regions.
hits - Total number of motif occurences.
Ameur,
Orzechowski
11/3 2003
Visualizing REDUCE outdata
• REDUCE outadata can be visualized in a heatmap.
m1
1
mn
sample 1
F m1
...
F1 mn
sample 2
...
.
.
.
F2 m1
.
.
.
F2 mn
.
.
.
sample 81
F81 m1
...
F81 mn
• The motifs in this heatmap are taken from TransFac.
• Green dots indicate repressing and red dots indicate activating motifs.
• The heatmap gives a clustering of samples on motifs.
Ameur,
Orzechowski
11/3 2003
Analyzing REDUCE outdata
• Validation: The pictures below show the samples clustered on expression and on motifs.
• Analysis of significant motifs:
By analyzing the motifs found by REDUCE we hope to find motifs
that explain clusters of correlated genes.
For example, REDUCE found a TransFac motif in the samples
associated with the red area in the picture. It matches 18% of the
109 genes in the picture, and 4% of the other genes.
• Finding new motifs:
CCGGA
GCGGA
TCGCG
GCGAC
GCGCG
CCGCG
GCGGC
CGGCG
CCGCC
AGGCG
GCGCC
GCGGG
GGGCG
One iteration of REDUCE was run on all sequences
of length 5.
A
0.17
0
0
0
0
0.14
0.29
C
0.33
0.5
0.15
1
0
0.29
0.57
G
0.33
0.5
0.85
0
1
0.57
0.14
T
0.17
0
0
0
0
0
0
N
S
G
C
G
S
M
Ameur,
Orzechowski
11/3 2003
Applications
• Identify coregulated genes with potentially different expression profiles, using the
motifs found by REDUCE.
• Predict previously unknown motifs, or new properties of known ones.
Conclusions
Our results on human data had somewhat lower significance than previuos results on yeast
presented in (Bussemaker, Li & Siggia, 2001). There are several possible causes for this:
• Data quality: Expression data, upstream regions.
• Hard to validate findings.
• Gene regulation probably more complicated in human.
Even so, our results suggest that the REDUCE method might give useful information
about transcription factor binding sites in humans. Probably, this requires prior knowledge
about motifs and other methods such as clustering.
Ameur,
Orzechowski
11/3 2003

Protein synthesis - Swedish Institute of Computer Science

Transcript Protein synthesis - Swedish Institute of Computer Science

Directory