Promoter sequence analysis

Download Report

Transcript Promoter sequence analysis

Introduction to the Eukaryotic
Promoter Database (EPD)
and Signal Search Analysis
(SSA)
Workshop on Regulatory Sequence Motif Discovery,
November 10th 2006.
The Linnaeus Centre for Bioinformatics, SLU-UU, Sweden.
Giovanna Ambrosini
Christoph Schmid
Components of
transcriptional
regulation
Distal transcription-factor binding sites (enhancer)
cis-regulatory modules
Wasserman
5, 276-287 (2004)
EPD
The Eukaryotic Promoter Database
Current Release 88 (SEPT-2006)
• founded in 1986 (Bucher and Trifonov; Nucleic Acids Res, 14, 10009-10026)
• originally exclusively based on literature, carefully maintained and regularly updated
• in recent years started with consideration of mass sequencing data
• aim at high precision of mapping of transcription start site (+/- 5bp)
• promoter sequences of 139 different species, still relatively low coverage (i.e. 1871
human entries)
• format of annotation of TSS:
DR EMBL; ZZ999999.1; HS28BP; [-19, 9].
-15
-10
-5
0
5
' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ‘
a c c c g c c t g c a c c c g a t t c A T G T G A G A A
• one or several alternative transcription start sites per gene
EPD format
ID
XX
AC
XX
DT
DT
XX
DE
OS
XX
HG
AP
NP
XX
DR
DR
DR
DR
DR
DR
HS_RPS3
standard; multiple; VRT.
EP74176;
10-JAN-2003 (Rel. 73, created)
13-SEP-2004 (Rel. 80, Last annotation update).
Ribosomal protein S3.
Homo sapiens (human).
none.
none.
none.
GENOME; NT_033927.7; NT_033927; [-5333322, 12577805]. [ ENSEMBL;
UCSC HapMap ]
CLEANEX; HS_RPS3.
EMBL; AP000744.4; [-90138, 35862]. [ EMBL; GenBank; DDBJ ]
SWISS-PROT; P23396; RS3_HUMAN.
RefSeq; NM_001005 [ DBTSS ].
MIM; 600454.
Frequency of full-length transcripts
TSS determined by modelling
Gaussian distributions (MADAP)
10 bp
45 bp
R
R
84046905-84046987
84047148-84047231
Genomic position
The Eukaryotic Promoter Database EPD: the impact of in silico primer extension.
Schmid, C.D., Praz, V., Delorenzi, M., Perier, R. and Bucher, P. (2004) Nucleic Acids Res, 32,
D82-85.
EPD 70
36
[-10;10] [-400;400]
0.83
1
RefSeq mRNA
933
0.32
0.95
Genome annot.
890
0.31
0.95
DBTSSv1 (human)
933
0.13
0.68
Eponine
494
0.12
0.46
Superior precision of in silico
primer extension (ISPE)
virtual counts
(2** log ratio)-1
New data sources for EPD
ChIP-chip
Kim et al. (2005)
Nature, 436, 876880
GEO: GSE2672
(remapped!)
ENSEMBL
chro12:
6.8 – 6.94 Mb
ChIP-chip data with insufficient resolution
FP Hs USP5 :+R EU:NC_000012.10 1+ 6831557; 74339.
1.5
1.0
0.5
0.0
Frequency
2.0
2.5
3.0
Distribution of T SS
6831200
6831400
6831600
Genomic position
6831800
6832000
EPD webserver:
http://www.epd.isb-sib.ch/
• find EPD entry(-ies) using gene symbols,...
– extraction of promoter sequences in user-defined
ranges
– direct transfer to Signal Sequence Analysis (SSA)
• download of complete (reference!) promoter sets
http://www.epd.isb-sib.ch/seq_download.html
SSA
Signal Search Analysis
Giovanna Ambrosini
ISREC Swiss Institute for Experimental Cancer Research
 History: Signal Search Analysis is a method developed by P Bucher in the early eighties
(Bucher, P. and Bryan B., E.N.; Nucleic Acids Res, v.12(1 Pt 1): 287–305)
 Purpose: to discover and characterize sequence motifs that occur at constrained distances
from physiologically defined sites in nucleic acid sequences.
 Signal search analysis programs:
1. CPR: generates a “constraint profile” for the neighborhood of a functional site
2. SList: generates lists of over and under-represented motifs in particular regions relative to a
functional site
3. OProf: generates a “signal occurrence profile” for a particular motif
4. PatOP: optimizes a weight matrix description of a locally over-represented sequence motif
 Recent events: Adaptation of software to new environment, SSA web server, application to
promoters and translational start sites
Locally Over-represented Sequence Motifs
Definition of a Locally Over-represented Sequence Motif

Concept
A motif which preferentially occurs at a characteristic distance (range) from a
certain type of functional position
Example: the TATA-box is a locally over-represented sequence motif of the -30
region of eukaryotic POL II transcription initiation sites

Components of the formal motif description
1.
A weight matrix or consensus sequence defining the motif
2.
A cut-off value determining which subsequence constitutes a motif match
3.
A preferred region of occurrence defined by 5’ and 3’ borders relative to a
functional site, e.g. a transcription initiation site
Locally Over-represented Sequence Motifs
 Input Data Structure
Primary experimental data
(Functional Position Set)
 annotated functional positions in DNA
sequences stored in a database
 Work data
A DNA sequence matrix
 a set of fixed-length sequence segments
with an experimentally defined site at a fixed
internal position
The Motif Search Problem
 Statement
For a given DNA sequence matrix
 find locally optimal combination of
 using a given quality criterion
 Quantitative motif description
 Cut-off value
 Region of preferential occurrence
TATA-box Signal Occurrence Profile for
EPD and ENSEMBL Drosophila Promoters
CCAAT-box Signal Occurrence Profile for Vertebrate
and ENSEMBL Drosophila Promoters
SSA webserver: http://www.isrec.isb-sib.ch/ssa
 Provides access to precompiled functional position sets
 Collections of transcription initiation sites (promoters) from
eukaryotic species
 Collections of translation initiation sites from large variety of
prokaryotic genomes
 Provides access to the four signal search analysis programs
Application to a bacterial translational control signal:
the Shine-Dalgarno ribosome binding-site motif
 Compare the strength and location of the Shine-Dalgarno mRNA-rRNA interaction motif in E.
coli and B. subtilis in a qualitative manner.

Result: the Shine-Dalgarno interaction motif is stronger in B. subtilis than in E .coli and
centered about two bases further upstream in the former species. More than hundred
bacterial genomes are now available to perform this type of analysis.
Studying transcription regulatory processes with
specialized bioinformatics resources – and example
 Biological question:
Do genes that are generally up-regulated in cancer cells
have different types of promoters?
 Procedure:



Define cancer up- and down-regulated gene sets using CleanEx
Extract corresponding promoter regions from EPD
Analyse the signal content of the two promoter sequence sets using SSA
Comparative analysis of cancer up- and downregulated promoters
 Signals considered:
Initiator
preferred position
approx. frequency
Initiator
TATA-box
GC-box
CCAAT-box
0
-30 to -25
-200 to 0
-200 to -50
25% - 50%
~30%
~50%
~20%
Positional distribution of Initiator motif in cancer upand down-regulated promoters
Positional distribution of TATA-boxes in cancer up- and
down-regulated promoters
Positional distribution of GC-boxes in cancer up- and
down-regulated promoters
Positional distribution of CCAAT-boxes in cancer upand down-regulated promoters
Comparative analysis of cancer up- and downregulated promoters: Summary of results
 Signal content
Initiator
Frequency in
cancer-up genes
Frequency in
cancer-down genes
Initiator
TATA-box
GC-box
CCAAT-box
no change
up
no change
up
no change
down
no change
down
 Next questions:
Are TATA-box and CCAAT-box binding factors disregulated in cancer cells ?
Or do cancer-specific transcription factors (binding to adjacent sites) preferentially interact with
TATA-box and CCAAT-box binding factors?
Concluding remarks
 Signal search analysis has played an instrumental role in the characterization of eukaryotic
promoter elements
 The method has originally been developed for the analysis of eukaryotic promoters but has
a much broader application potential (e.g. Shine-Dalgarno signal analysis)
 Rapidly growing collection of complete genomes and high-throughput methods for genomic
analysis increase the statistical power to discover new motifs, or better characterize already
known control signals
 Aligning sequence sets with respect to a well characterized motif might allow the detection
of binding sites of cooperating transcription factors positionally correlated with the known
motif
 Confirm or challenge commonly accepted hypotheses originally derived from small sets