Transcript file

Discovery of transcription
networks
4
3
2
1
0
-1
-2
0
5
10
15
Lecture3 Nov 2012
Regulatory Genomics
Weizmann Institute
Prof. Yitzhak Pilpel
Hierarchical clustering
Promoter
Motifs and
expression
profiles
CGGCCCCGCGGA
CTCCTCCCCCCCTTC
TGGCCAATCA
ATGTACGGGTG
3
AlignACE Example
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
300-600 bp of upstream sequence
per gene are searched in
Saccharomyces cerevisiae.
http://statgen.ncsu.edu/~dahlia/journalclub/S01/jmb1205.pdf
…HIS7
…ARO4
…ILV6
…THR4
…ARO1
…HOM2
…PRO3
A cluster of gene may contain a common
motif in their promoter
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
Find a needle in a haystack
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
**********
Computational Identification of
Cis-regulatory Elements
Associated with Groups of
Functionally Related Genes in
Saccharomyces cerevisiae
J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church
Journal of Molecular Biology (2000)
Example
GAL4 is one of the yeast genes
required for growth on galactose.
http://www.cifn.unam.mx/Computational_Genomics/old_research/BIOL2.html
Motif Representation
G1
G2
G3
G4
G5
A
A
G
A
A
G
A
A
G
G
A
A
A
A
A
A
T
T
A
A
G
G
G
G
G
A
A
A
A
A
1
2
3
4
5
6
A
0.8
0.4
1
0.6
0
1
C
0
0
0
0
0
0
G
0.2
0.6
0
0
1
0
T
0
0
0
0.4
0
0
Finding New Motif
• By lab work
• By comparison to known motifs in other species
• By searching upstream regions of
a set of potentially co-regulated genes
The genes bound by the TF Abf1 can be clustered
into several groups, some contain a motif
NCGTNNNNARTGAT
CGATGAGMTK
NCGTNNNNARTGAT & CGATGAGMTK
(sporulation experiment)
Search Space
• Size of search space:
• L=600, W = 15, N = 10 :
( L  W  1) N  LN
size  10 27
• Exact search methods are not feasible
AlignACE Example
Input Data Set
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
300-600 bp of upstream sequence
per gene are searched in
Saccharomyces cerevisiae.
Based on slides from G. Church Computational Biology course at Harvard
K-means
• Start with random
positions of centroids.
• Assign data points to
centroids.
• Move centroids to
center of assigned
points.
• Iterate till minimal cost.
Iteration = 3
AlignACE Example
Initial Seeding
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
MAP score = -10.0
Based on slides from G. Church Computational Biology course at Harvard
AlignACE Example
Sampling
Add?
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
How much better is the
alignment with this site
as opposed to without?
TCTCTCTCCA
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
Based on slides from G. Church Computational Biology course at Harvard
AlignACE Example
Sampling
Add?
Remove.
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
How much better is the
alignment with this site
as opposed to without?
TGAAAAAATG
TGAAAAATTC
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
Based on slides from G. Church Computational Biology course at Harvard
AlignACE Example
Column Sampling
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
GACATCGAAA
GCACTTCGGC
GAGTCATTAC
GTAAATTGTC
CCACAGTCCG
TGTGAAGCAC
How much better is the
alignment with this new
column structure?
GACATCGAAAC
GCACTTCGGCG
GAGTCATTACA
GTAAATTGTCA
CCACAGTCCGC
TGTGAAGCACA
Based on slides from G. Church Computational Biology course at Harvard
AlignACE Example
The Best Motif
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3
AAAAGAGTCA
AAATGACTCA
AAGTGAGTCA
AAAAGAGTCA
GGATGAGTCA
AAATGAGTCA
GAATGAGTCA
AAAAGAGTCA
MAP score = 20.37
Based on slides from G. Church Computational Biology course at Harvard
The MAP Score
•
MAP – Maximal a priori log likelihood
score
•
This is what the algorithm tries to
optimize.
•
Measures the degree of over
representation
of the motif in the input sequence
relative to expectation in a random
sequence.
MAP
The MAP Score
B,G = standard Beta & Gamma functions
N = number of aligned sites; T = number of total possible sites
Fjb = number of occurrences of base b at position j (F = sum)
Gb = background genomic frequency for base b
bb = n x Gb for n pseudocounts (b = sum)
W = width of motif; C = number of columns in motif (W>=C)
Based on slides from G. Church Computational Biology course at Harvard
The MAP Score
N
MAP  N  log
exp
N = number of aligned sites
exp = expected number of sites in the
input sequence, comparing to a
random model
7
1
P =    1 site every 16,000 bases
4
For 64,000 bases sequence - exp = 4
Some examples
Motif
Number of
genes (each
1,000 BPs
long
promoter)
Number of
times found
Expected
number of
times
MAP score
AGGGTAA (7)
16
10
~1
10
GTAGATG (7)
16
2
~1
0.60206
CCGTGAG (7)
160
10
~10
0
GATGTA
(6)
16
2
~4
-0.60206
AGGGTA
(6)
16
10
4
4.089354
A
(1)
16
2504
~2500
1.73
AAAAAAA (7)
16
5
~1.5
2.614394
GGGGGGG (7)
16
5
~0.5
5
N
MAP  N  log
exp
The MAP Score Properties
MAP  N  log
a) Motif should be “strong”
b) Input sequence can’t be too long
1 7

P =    1 site every 16,000 bases
4
1
 2 12 106 = 1500
16000
Genome length ~12Mb :
exp =
Motif needs more than 1500 sites
to get a positive MAP score:
MAP = N  log
N
1500
= 1500  log
= 0
exp
1500
Problem: most transcription factor
binding sites will only occur in
dozens to hundreds of genes
N
exp
Solution:
Cluster genes before searching for motifs
Time-point 3
Time-point 1
Group Specificity Score:
All Genome (N)
How well a motif targets the genes used
to find it comparing to all genome ?
Motif ORFs
Group (S1)
X
ORFs with
best sites (S2)
What is the probability to have such
large intersection?
  
 
S1
x
N  S1
S2  x
N
S2
N = Total # of ORFs in the genome (6226)
S1 = # ORFs used to align the motif
S2 = # targets in the genome (~ 100 ORFs
with best ScanACE scores)
X = # size of intersection of S1 and S2
Based on slides from G. Church Computational Biology course at Harvard
Group Specificity Score:
All Genome (N)
How well a motif targets the genes used
to find it comparing to all genome ?
Motif ORFs
Group (S1)
X
ORFs with
best sites (S2)
What is the probability to have such
large intersection?
S=
  
  
min( S1 , S 2 )
i= x
S1
i
N  S1
S 2 i
N
S2
N = Total # of ORFs in the genome (6226)
S1 = # ORFs used to align the motif
S2 = # targets in the genome (~ 100 ORFs
with best ScanACE scores)
X = # size of intersection of S1 and S2
Based on slides from G. Church Computational Biology course at Harvard
Positional Bias Score:
#ORFS
Measures the degree of preference of positioning
in a particular range upstream to translational start.
10
6
1
50 bp
Start
-600 bp
Based on slides from G. Church Computational Biology course at Harvard
Positional Bias Score:
#ORFS
10
1
• Find best 200 sites in the genome
50 bp
Start
Restrict sites to segment of length [s = 600 bp] from translation start
• t = # sites in the segment
• Choose window size [w = 50 bp]
• m = # sites in the most enriched window
What is the probability to have m or more
sites in a window of size w?
 t  w  m w  t  m
    1 

 m  s  
s




 
Based on slides from G. Church Computational Biology course at Harvard
-600 b
Positional Bias Score:
#ORFS
10
1
• Find best 200 sites in the genome
50 bp
Start
Restrict sites to segment of length [s = 600 bp] from translation start
• t = # sites in the segment
• Choose window size [w = 50 bp]
• m = # sites in the most enriched window
What is the probability to have m or more
sites in a window of size w?
t  
t i
i




t
   w   1 w 

 i   s  
s



i = m 
P= 
Based on slides from G. Church Computational Biology course at Harvard
-600 b
Lecture Topics
• Introduction to DNA regulatory motifs
• AlignACE - A motif finding algorithm
• Assessment of motifs
• AlignACE results on yeast genome
• Summary & Conclusions
Comparisons of motifs
• The CompareACE program finds best alignment between
two motifs and calculates the correlation between the two
position-specific scoring matrices
• Similar motifs:
CompareACE score > 0.7
Based on slides from G. Church Computational Biology course at Harvard
Clustering motifs by similarity
motif A
motif B
motif C
motif D
Pairwise CompareACE scores
1
2
3
4
5
6
A
0.8
0.4
1
0.6
0
1
C
0
0
0
0
0
0
G
0.2
0.6
0
0
1
0
T
0
0
0
0.4
0
0
1
2
3
4
5
6
A
0.4
0.4
1
0.6
0
0
C
0
0
0
0
0
1
G
0.6
0.6
0
0
1
0
T
0
0
0
0.4
0
0
A B
A 1.0 0.9
B
1.0
C
D
C
0.1
0.2
1.0
D
0.0
0.1
0.8
1.0
Hierarchical
Clustering
cluster 1: A, B
cluster 2: C, D
Most Group Specific Motifs
Most Positional Biased Motifs
Negative Controls
• 250 AlignACE runs on randomly created groups
of ORFs, of size 20, 40, 60, 80,and 100 ORFs.
MAP
MAP
random
real
Based on slides from G. Church Computational Biology course at Harvard
Negative Controls
10
MAP cut off of 10, Group Specificity cutoff of 10 :
False Positives = 10-20%
Positive Controls
• 29 listed TFs with five or more known binding sites were chosen.
• AlignACE was run on the upstream regions of the corresponding
regulated genes.
• An appropriate motif was found in 21/29 cases.
• False negative rate = ~ 10-30 %
Based on slides from G. Church Computational Biology course at Harvard
The data
• Organism: Saccharomyces cerevisiae
• Microarray experiment : Affymetrix
microarrays of 6,220 mRNA
• Data: gathered by Cho et al.
• 15 time points, spanned about 4 hours
across two cell cycles
• Genome sequence
Typical clusters of genes in the data
Variance normalization and clustering of
expression time series
•3,000 most variable ORFs were chosen (based on the normalized dispersion
in expression level of each gene across the time points (s.d./mean).
•The 15 time points were used to construct a 3,000 by 15 data matrix.
•The variance of each gene was normalized across the 15 conditions:
Subtracting the mean across the time points from the expression level of each gene and
dividing by the standard deviation across the time point.
Before and after mean - variance normalization
Gene Expression
Before
normalization
Gene1
Gene2
Gene3
13
1
3
5
7
9
11
8
6
4
2
0
-2
-4
-6
Tim e
Normalized Expression
Gene1
Gene2
Gene3
After
normalization
2
1
0
-1
-2
Tim e
1
3
5
7
9
11
13
-3
Representation of expression data
Normalized Expression
Data from microarrays
Time-point 1
Gene 1
Gene 2
K-means
•Start with random
positions of centroids.
= position of data
point Xi
= position of data
centroid C
Iteration = 0
Choosing K
Since we don’t know the
number of clusters in advance
we need a way to estimate it.
In order to choose the
number of clusters K, the Sum
of Squares of Errors is
calculated for different K
values. A clear break point
indicates the “natural” number
of clusters in the data.
Sum
Squared
errors
K
Significantly enrichment of functional
category within clusters
• Each gene was mapped into one of 199 functional
categories ( according to MIPS database ).
• For each cluster, P-values was calculated for
observing the frequencies of genes from particular
functional categories.
• There was significant grouping of genes
within the same cluster.
The hyper-geometric score
P values were calculated for finding at least (k) ORFs from a particular
functional category within a cluster of size (n).
where (f) is the total number of genes within a functional category and (g)
is the total number of genes within the genome (6,220).
P- values greater than 3×10- 4 are not reported,
as their total expectation within the cluster would be higher than 0.05 As
we tested 199 MIPS (ref.15).
Challenge: generalize hypergeometric for more than two sets
Chr V
Functional group
Expression cluster
Sequence- MCB element Consensuses
nucleotides
This motif was later mapped to the literature and confirmed to be the very
well known MCB element which is known to control the periodicity of
the genes which peak at G1-S.
The existence of motif in all ORF’s of each
clusters
MCB element
clusters
Location of the motif - MCB element
•
Distance from ATG (b.p)
SCB element
This motif (later found to be the SCB element) was the second
scoring motif within this cluster. The SCB element is also a very
well-known cis-regulatory element which contributes to the
periodicity of the genes within the G1-S regulon.
ribonucleotide reductase
Determining the cell-cycle periodicity of
clusters
Show Fourier Analysis allow to rank the genes according to the periodicity
of cell cycle.
expression matrix
cell cycle
high Periodicity
low periodicity
low periodicity
5
0
35 33 31 29 27 25 23 21 19 17 15 13 11
9
7
5
3
1
-5
-10
time
expression
10
Explain FFT… (including ORs
variability)
Periodic clusters
Non periodic clusters
And this was just the beginning…
In case of two motifs derived from
a cluster
?
Collaboration
Co-occurrence
(AND)
Redundancy
(OR)
http://longitude.weizmann.ac.il/publications/PilpelNatGent01.pdf
Logic of interaction of motifs
Only M2
Expression level
Only M1
Expression level
M1
M1 AND M2
M2
G2
G2
Synergistic motifs
A combination of two motifs is called
‘synergistic’ if the expression coherence
score of the genes that have the two motifs
is significantly higher than the scores of the
genes that have either of the motifs
Mcm1
SFF
A global map of combinatorial expression control
Pilpel et al. Nature Genetics 2001
Heat-shock
Cell cycle
Sporulation
Diauxic shift
MAPK signaling
DNA damage
STRE
*High connectivity
*Hubs
*Alternative partners
in various conditions
PHO4
CCA
ALPHA1
mRPE8
mRPE57
AFT1
PDR
SWI5
MIG1
mRPE69
RAP1
mRPE72
GCN4
CSRE
SFF '
mRPE34
MCB
mRPE58
MCM1
mRPE6
RPN4
ECB
BAS1
SCB
LYS14
ABF1
SFF
STE12
ALPHA2
MCM1'
ALPHA1'
HAP234
mRRPE
PAC
mRRSE3
The human cell cycle
G1-Phase
S-Phase
G2-Phase
M-Phase
The proliferation cluster genes are cell cycle periodic
4
3
2
0
-1
Gene Expression
1
Disrtribution of cell cycle periodicity
-3
-4
Proportion
-2
G2/M
G1/S
CHR
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
All genes
Proliferation genes
1
5
10
15
20
Samples
25
2
30
3
4
35
5
6
CCP score
7
40
8
9
10
45
The cell cycle motifs are enriched among the
proliferation cluster genes
CHR
ELK1
CDE
E2F
NFY
200
Not in the cluster,
mutated in cancer
150
100
50
TSS
Regulation of the proliferation cluster:
significant motifs
Motif
P-value
NFY
3.74*10-11
CDE
5.31*10-10
E2F
2.37*10-09
ELK1
3.10*10-06
CHR
1.42*10-05
1000bp up stream
Sequence logo
326 MathInspector motifs
Potential regulatory motifs in 3’ UTRs
Finding 3’ UTRs elements associated with high/low transcript
stability (in yeast)
Entire genome
AAGCTTCC CCTACAAC