Transcript Slide 1

Modeling Promoter and Untranslated Regions in Yeast
YUJING LIANG
BORIS BABENKO
AARON STONESTROM
JAMAL BENHAMIDA
ELEAZAR ESKIN
Computer Science and Engineering
University of California, San Diego
Computer Science and Engineering
University of California, San Diego
Division of Biology
University of California, San Diego
Computer Science and Engineering
University of California, San Diego
Computer Science and Engineering
University of California, San Diego
Abstract
Comparative Genomics
Transcriptional regulation is the primary form
Expression Analysis
Positional Analysis
Transcription Factor Binding Sites
of gene regulation in eukaryotes.
Approaches to identifying functional regions
based on comparative genomics and
microarray expression data have recently
been applied in promoter and 3'-untranslated
region (UTR) sequences in the yeast
genome. Here we combine these
approaches to construct a robust set of
motifs active in the yeast genome. With this
set we consider the combinatorial actions of
these motifs and apply a linear model to
explain observed expression. A deeper
understanding of gene regulation in yeast is
the first step toward understanding gene
regulation and complex disease in higher
organisms.
The YKL182W gene promoter, with highlighted Transcription
Factor Binding Sites:
AAGTTATAGGGGAAAACTAAAAATATAAGAAAAAAAAAGGTATTGATTGATAAGGAAAAAGAACCAAGGGAAAAAT
ATAAAAAAGTACATTGGGCCTTTTCATACTTGTTATCACTTACATTACAAAGAAGAACAAACAACTTTTTTAAACG
AATTTTCTTTCTTCCTTTTTCAATTTATTAATTCTTTTTTTCCATACAATTCAAGGTCAAATATATTCTTATATGC
TCTTTGAATATTTCTGAAAAATATATAAAGAAAAGAAACTACAAGAACAT
Comparative genomics method uses aligned sequences of
several closely related species to find patterns that are conserved
across multiple genomes. A high rate of conservation implies that
the pattern is functional and important.
Speceies1:
Speceies2:
Speceies3:
Speceies4:
Speceies5:
Speceies6:
Speceies7:
TAATATCAAAATCAATCTCAAAATTACCACCGGTTAGAACTTGG
TAATGTCAAAATCAATCTCAAAGTTACCACCGGTTAGAACTTGG
TAATATCAAAATCAATCTCAAAATTACCACCAGTTAGAACCTGA
TAATATCGAAATCAATCTCAAAATTACCACCGGTTAGAACTTGG
TGATGTCAAAATCGATCTCGAAATTACCACCAGTCAGGACTTGG
TAATCTCAAAATCAATTTCAAAATTACCACCCGTCATAACTTGA
TAATTTCAAAGTCAATTTCAAAGTTACCACCGGTCAAGACTTGA
Purpose
Our goal is to understand how the
combinations of various Transcription
Factor Binding Sites (TFBS) on a gene
affect it’s expression in different
experimental conditions.
Linear Model
To predict the contributions of motifs to a
gene’s expression level.
Each gene contains zero or more motifs
Each motif (assumed to be a TFBS) has
an “expression factor” score (+/-) for each
experiment
The expression of a gene is the sum of
the scores of the motifs it contains
Transcription factor binding sites are not distributed
uniformly in promoter regions
The motif CGATGAG most frequently occurs between
60 and 100 nucleotides away from the transcription
start site (where the code for a protein begins)
Significant Motifs Found
Data Set
7 Yeast strains:
Saccharomyces cervisiae
Saccharomyces bayanus
Saccharomyces castellii
Saccharomyces kudriavzevii
Saccharomyces mikatae
Saccharomyces kluyveri
Saccharomyces paradoxus
5769 promoters analyzed
1,730,700 DNA nucleotides analyzed per
strain
Expression data come from heat-shock
microarray experiment
(Stanford Microarray Database)
http://smd.stanford.edu/
Pattern CGGTGGCAA appeared 15 times, and was conserved 15 times, MCS: 100.0
Pattern is conserved on: [YJL001W, YNL155W, YOR052C, YOR259C, YOR260W, YBL022C, YCL043C,
YCL042W, YCR092C, YCR093W, YDL148C, YDL147W, YDL070W, YDR427W, YER012W]
Pattern AGCTCATCGC appeared 29 times, and was conserved 27 times, MCS: 93.10344827586206
Pattern is conserved on: [YJL109C, YKL191W, YKR024C, YKR025W, YKR081C, YKR082W, YLR014C,
YLR015W, YLR106C, YLR107W, YLR336C, YMR049C, YNL248C, YNL247W, YOL125W, YPL094C,
YPL093W, YCR057C, YCR072C, YCR087C-A, YDR449C, YHR052W, YHR147C, YHR148W,
YHR170W, YIL127C, YIL126W]
Grouping Motifs
Annotating the Genes
Some of the discovered motifs are minor
variants or exact reverse compliments of
each other. Thus, the motifs were grouped,
and each group was assignment a unique id:
We can now annotate the genes
with the Motif Groups that were
discovered:
M0
M1
M2
M3
Gene Name : Motif Groups
…
:
:
:
:
CGGTGGCAA, GGTGGCAAG, CGTGGC
AGCTCATCGC, AGCTCATAGC
GCTCATCG, CGATGAGC
AGCTCATCG
YPR111W: M248, M319, M74
YPR148C : M12, M153, M25
YPR194C : M127, M202, M41
YAL044W-A : M255, M27, M270, M49
Assumption
We assume every motif is independent to each other. The
same motif is bound by the same transcription factor and has
the same affect on the expression.
Limitations
Finds only transcription factors activated or
deactivated in an experimental condition relative to
the control.
Calculating the Expression Factor
Gene Expression Level
Motifs
Y01 =
0.456
= M1 + M2 + M3
Y02 =
0.745
= M2 + M4 + M16
Y03 =
0.834
= M1 + M3 + M10
…
Using a system of linear equations, we can find the value of
unknowns (M1, M2…) using any linear regression technique
such as least squares.
Results
331 motifs are found.
Using linear regression, 22 significant active motifs are
found by heat-shock expression data.
Some motifs and their scores:
M66 : CCCCTT(AAGGGG), 1.2460824979780836
M218 : CAGGGG, 1.209783124842816
M259 : CCCTTAA(TTAAGGG), 1.1325379612649848
M264 : TAGGGG(CCCCTA), 0.8571825629506061
…