No Slide Title

Download Report

Transcript No Slide Title

Detecting binding sites for transcription factors
by correlating sequence data with expression.
Erik Aurell
Adam Ameur
Jakub Orzechowski Westholm
in collaboration with AstraZeneca
Outline of the talk
• Biological background:
Protein synthesis
Gene expression
Gene regulation
• The REDUCE method:
Introduction
Data description
Methods
Results
Applications and Conclusions
Ameur,
Orzechowski
18/3 2003
Protein synthesis
5'
5'
3'
3'
DNA
gene
5'
3'
TCGGTACCGATG TTCGGGTAAATATGCGTGTGAAATCGTCCGCGCTCCCTAATGTGTAAGTATAG CGTGTAGTTGCCCGCAA
A
start
stop
transcription
mRNA
5'
3'
UCGGUACCGAUG UUCGGGUAAAUAUGCGUGUGAAAUCGUCCGCGCUCCCUAAUGUGUAAGUAUAG CGUGUAGUUGCCCGCAA
A
translation
protein
Ameur,
Orzechowski
18/3 2003
Gene expression
• All cells in an organism contain the same DNA, but they still have different properties:
blood cells, skin cells, liver cells, etc. The properties are determined by the protein
concentrations in the cell.
• The amount of protein produced from a gene is called the expression level.
• Thus, the expression level for a gene differ between cells. Expression can also differ in the
same cell over time.
Expression
start
stop
promoter region
Time
• Expression levels are controlled by a gene regulation mechanism.
Ameur,
Orzechowski
18/3 2003
Gene regulation
• Control of transcription is the most important form of gene regulation.
• When a gene is transcribed a transcriptase binds to the promoter region.
start
stop
promoter region
• Other proteins, transcription factors, can also bind to the genome close to the promoter
region. The transcription factors can either attract or repel the transcriptase.
start
TF binding site
stop
promoter region
• Combinatorial control: TFs can work together to control the level of transcription.
start
TF binding site
TF binding site
promoter region
stop
Ameur,
Orzechowski
18/3 2003
Regulatory networks
• Since TFs are proteins, they are products of genes. This implies that genes regulate
the expressions of other genes, forming a regulatory network.
+
+
-
+
+
+
-
+
• Understanding how the regulatory networks function is a big challenge in biology.
Ameur,
Orzechowski
18/3 2003
Introduction - the REDUCE method
• The aim is to find binding sites for transcription factors, motifs, in the human genome by
using a method developed at Rockefeller University (Bussemaker, Li & Siggia 2001).
• This method is called REDUCE and has previously only been applied to yeast data. We
will apply it to human data.
• The idea is to find motifs by correlating sequence and expression data.
• Input consists of: Expression data, sequence data and a set of putative motifs.
• Output is a list of significant motifs:
consensus
id
description
Dc2
NNNRRCCAATSRGNNN
NNNCGGCCATCTTGNCTSNW
NNRACAGGTGYAN
NNNRGGNCAAAGKTCANNN
TWTTTAATTGGTT
KNNKNNTYGCGTGCMS
NANCACGTGNNW
NNBTNTNCTATTTNTT
NNGAATATKCANNNN
M00287
M00069
M00060
M00134
M00424
M00235
M00123
M00092
M00136
NF-Y
YY1
Sn
HNF-4
NKX6-1
AhR/Arnt
c-Myc/Max
BR-CZ2
Oct-1
0.0044
0.0014
0.0013
0.0008
0.0007
0.0006
0.0006
0.0005
0.0005
F
0.0661
0.0363
-0.0345
0.0290
-0.0234
-0.0254
-0.0243
0.0233
-0.0230
probes
hits
1041
300
368
263
428
155
50
92
213
1279
314
374
272
457
161
50
94
244
Ameur,
Orzechowski
18/3 2003
Expression data
• Expression data is provided by AstraZeneca. It consists of 81 samples of human cerebral
cortex stem cells undergoing various treatments. Expressions are measured on an Affymetrix
u133 chip.
• We visualize expression data in a heatmap.
gene 1
gene 2
.
.
.
gene n
sample 1
...
sample 81
e(1,1)
e(2,1)
.
.
.
e(n,1)
...
...
e(1,81)
e(2,81)
.
.
.
e(n,81)
...
• It is possible to identify regions of correlated genes in the heatmap.
Ameur,
Orzechowski
18/3 2003
Sequence data
• In the REDUCE model, expression levels are explained by the number of times the
motifs occur in the upstream sequences of human genes.
• For this, sequences around the transcription starts are extracted. We take sequences in
the range [1000 bp upstream, 100 bp downstream].
• Transcription starts and genome data are provided by AstraZeneca.
• The upstream sequences are masked for repeats (with the program RepeatMasker).
• Putative motifs are matched to the resulting sequences.
transcription start
-1000 bp
+100 bp
GGAGTTCAAGACCAACCTAAGCAACAAAGTGAAACCACATCACTATAAATATATTCTTAAACGTGAAATGTTCACTCAGGCTTTTTAATATTTTATTTCATTT
• The motif TKAAA and its reverse complement TTTMA are matched in the example.
Ameur,
Orzechowski
18/3 2003
Motifs
• Motifs are represented as weight matrices :
W(M) =
pos 1
pos 2
.
.
.
pos n
A
C
G
T
w(1,A)
w(2,A)
.
.
.
w(n,A)
w(1,C)
w(2,C)
.
.
.
w(n,C)
w(1,G)
w(2,G)
.
.
.
w(n,G)
w(1,T)
w(2,T)
.
.
.
w(n,T)
w(i,B) is the probability that base i is
the nucleotide B in the motif M.
• We generate the set of putative motifs as weight matrices. This can be done in several ways:
• One possibility is to use the matrices (about 300) in the TransFac data base.
• Another possibility is to generate matrices of our own, for example for all sequences of a
certain length. Since the number of possible sequences grows exponentially with the length,
this is only possible for sequneces up to length 7 or 8.
• We have implemented a method based on Gibbs sampling to match weight matrices to
upstream regions.
Ameur,
Orzechowski
18/3 2003
Matching motifs to the upstream sequences
• A weight matrix W is matched to a sequence s1 s2 … sn the following way:
• For each of the bases s1 s2 … sn we extract the corresponding weight matrix entry w(i,si)
and compute the following sum
n
Score   w(i, si ) / bsi
i 1
Here bsi is the background frequence of base si.
• An example: Assume we have the sequence AATCG and the matrix
pos 1
pos 2
pos 3
pos 4
pos 5
A
0.5
0.7
0.4
0.05
0.25
C
0.1
0.1
0.2
0.9
0.25
G
0.3
0.1
0.2
0.1
0.25
T
0.1
0.1
0.2
0.05
0.25
If all background frequencies are 0.25, this would give the score
Score  0.5 / 0.25  0.7 / 0.25  0.2 / 0.25  0.9 / 0.25  0.25 / 0.25  16.128
• The score is then compared to a threshold value: Threshold  Maxscore 1   l
Ameur,
Orzechowski
18/3 2003
Pre-processing and REDUCE
Mapping from probes to
transcription starts
Human
genome
Upstream
sequences
Putative
motifs
Masked upstream
sequences
Expression data for
cortex stem cells
Motif occurences in
upstream regions
REDUCE
indata
Ag  C 
 F ng
M
Ameur,
Orzechowski
18/3 2003
REDUCE iterations
Let K be the set of putative motifs, M the motifs in the model, and G the set of all genes.
The error of the model is defined as c2M  S G ( Ag - (C+SM F ng)2
Let M’ be the model M U {’}
Then we can define a score for each ’ , Dc2 ‘  c2M  c2M ’ , that tells how good the
motif ’ fits the model M
• The REDUCE method:
M = empty
K = {all putative motifs}
while K is not empty
Compute Dc2 for every  in K
Let ’ be the motif that gives the best fit (highest Dc2 value)
M = M U {’} , K = K - {’}
Remove all motifs in K with low Dc2 values
Ameur,
Orzechowski
18/3 2003
REDUCE output
consensus
id
description
NNNRRCCAATSRGNNN
NNNCGGCCATCTTGNCTSNW
NNRACAGGTGYAN
NNNRGGNCAAAGKTCANNN
TWTTTAATTGGTT
KNNKNNTYGCGTGCMS
NANCACGTGNNW
NNBTNTNCTATTTNTT
NNGAATATKCANNNN
M00287
M00069
M00060
M00134
M00424
M00235
M00123
M00092
M00136
NF-Y
YY1
Sn
HNF-4
NKX6-1
AhR/Arnt
c-Myc/Max
BR-CZ2
Oct-1
Dc2
0.0044
0.0014
0.0013
0.0008
0.0007
0.0006
0.0006
0.0005
0.0005
F
0.0661
0.0363
-0.0345
0.0290
-0.0234
-0.0254
-0.0243
0.0233
-0.0230
probes
hits
1041
300
368
263
428
155
50
92
213
1279
314
374
272
457
161
50
94
244
consensus - A consensus sequence for the motif.
id - A unique id for each motif.
description - The transcription factor name.
Dc2 - The significance of the motif.
F - The effect. A positive value indicates activation and negative repression.
probes - Number of probes with occurences of the motif in their upstream regions.
hits - Total number of motif occurences.
Ameur,
Orzechowski
18/3 2003
Visualizing REDUCE outdata
• REDUCE outadata can be visualized in a heatmap.
m1
1
mn
sample 1
F m1
...
F1 mn
sample 2
...
.
.
.
F2 m1
.
.
.
F2 mn
.
.
.
sample 81
F81 m1
...
F81 mn
• The motifs in this heatmap are taken from TransFac.
• Green dots indicate repressing and red dots indicate activating motifs.
• The heatmap gives a clustering of samples on motifs.
Ameur,
Orzechowski
18/3 2003
Analyzing REDUCE outdata
• Validation: The pictures below show the samples clustered on expression and on motifs.
• Analysis of significant motifs:
By analyzing the motifs found by REDUCE we hope to find motifs
that explain clusters of correlated genes.
For example, REDUCE found a TransFac motif in the samples
associated with the red area in the picture. It matches 18% of the
109 genes in the picture, and 4% of the other genes.
• Finding new motifs:
CCGGA
GCGGA
TCGCG
GCGAC
GCGCG
CCGCG
GCGGC
CGGCG
CCGCC
AGGCG
GCGCC
GCGGG
GGGCG
One iteration of REDUCE was run on all sequences
of length 5.
A
0.17
0
0
0
0
0.14
0.29
C
0.33
0.5
0.15
1
0
0.29
0.57
G
0.33
0.5
0.85
0
1
0.57
0.14
T
0.17
0
0
0
0
0
0
N
S
G
C
G
S
M
Ameur,
Orzechowski
18/3 2003
Applications
• Identify coregulated genes with potentially different expression profiles, using the
motifs found by REDUCE.
• Predict previously unknown motifs, or new properties of known ones.
Conclusions
Our results on human data had somewhat lower significance than previous results on yeast
presented in (Bussemaker, Li & Siggia, 2001). There are several possible causes for this:
• Data quality: Expression data, upstream regions.
• Hard to validate findings.
• Gene regulation probably more complicated in human.
Even so, our results suggest that the REDUCE method might give useful information
about transcription factor binding sites in humans. Probably, this requires prior knowledge
about motifs and other methods such as clustering.
Ameur,
Orzechowski
18/3 2003