Searching for Transcription Start Sites

Download Report

Transcript Searching for Transcription Start Sites

Searching for Transcription Start
Sites in Drosophila
Wilson Leung
01/2017
Outline
Transcription start sites (TSS) annotation goals
Promoter architecture in D. melanogaster
New D. melanogaster TSS datasets
Find the initial transcribed exon
Annotate putative transcription start sites
Search for core promoter motifs
Muller F element, heterochromatic, and
euchromatic genes show similar expression levels
Riddle NC, et al. Enrichment of HP1a on Drosophila chromosome 4 genes creates an
alternate chromatin structure critical for regulation in this heterochromatic domain.
PLoS Genet. 2012 Sep;8(9):e1002954.
Goals for the transcription start sites
(TSS) annotations
Research goal:
Identify motifs that enable Muller F element genes to
function within a heterochromatic environment
Annotation goals:
Define search regions enriched in regulatory motifs
Define precise location of TSS if possible
Define search regions where TSS could be found
Document the evidence used to support the TSS annotations
Estimated evolutionary distances with
respect to D. melanogaster
D. melanogaster
D. simulans
D. sechellia
D. yakuba
D. erecta
D. ficusphila
D. eugracilis
D. biarmipes
D. takahashii
D. elegans
D. rhopaloa
D. kikkawai
D. bipectinata
D. ananassae
D. pseudoobscura
D. persimilis
D. willistoni
D. mojavensis
D. virilis
D. grimshawi
New species sequenced by modENCODE
Species
Substitutions per
neutral site
D. ficusphila
0.80
D. eugracilis
0.76
D. biarmipes
0.70
D. takahashii
0.65
D. elegans
0.72
D. rhopaloa
0.66
D. kikkawai
0.89
D. bipectinata
0.99
Data from table 1 of the modENCODE
comparative genomics whitepaper
GEP annotation projects
Phylogenetic tree based on the analysis of
13 Type IIB restriction endonucleases
D. simulans
D. sechellia
D. melanogaster
D. yakuba
D. santomea
D. erecta
D. eugracilis
D. biarmipes
D. takahashii
D. elegans
D. rhopaloa
D. ficusphila
D. kikkawai
D. ananassae
D. bipectinata
D. persimilis
D. pseudoobscura
D. willistoni
D. virilis
D. mojavensis
D. grimshawi
Simulate restriction
digests of 21 genomes
DNA fragments range
from 21-33 bp in size
Calculate distance
between two genomes
based on number of
shared fragments
Seetharam AS and Stuart GW.
Whole genome phylogeny for 21
Drosophila species using predicted
2b-RAD fragments. PeerJ. 2013
Dec 23;1:e226.
D. biarmipes Muller F element
TSS annotation projects
Reconcile coding region annotations submitted by
GEP students
Reconciled annotations might be incorrect
Based on older FlyBase release
Possible misannotations
See the “Revised gene models report form” section
of the Transcription Start Sites Project Report
Challenges with TSS annotations
Fewer constraints on untranslated regions (UTRs)
UTRs evolve more quickly than coding regions
Open reading frames, compatible phases of donor and
acceptor sites do not apply to UTRs
Low percent identity (~50-70%) between D. biarmipes
contigs and D. melanogaster UTRs
Most gene finders do not predict UTRs
Lack of experimental data
Cannot use RNA-Seq data to precisely define the TSS
TSS annotation workflow
1. Identify the ortholog
2. Note the gene structure in D. melanogaster
3. Annotate the coding exons
4. Classify the type of core promoter in D. melanogaster
5. Annotate the initial transcribed exon
6. Identify any core promoter motifs
7. Define TSS positions or TSS search regions
RNA Polymerase II core promoter
Initiator motif (Inr)
contains the TSS
TFIID binds to the
TATA box and Inr
to initiate the
assembly of the preinitiation complex
(PIC)
Juven-Gershon T and Kadonaga JT. Regulation of gene expression via the core promoter
and the basal transcriptional machinery. Dev Biol. 2010 Mar 15;339(2):225-9.
Peaked versus broad promoters
Peaked promoter
(Single strong TSS)
Broad promoter
(Multiple weak TSS)
50-300 bp
Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip
Rev Dev Biol. 2012 Jan-Feb;1(1):40-51.
RNA-Seq biases introduced by
library construction
RNA-Seq Read Count
cDNA fragmentation
Strong bias at the 3’ end
RNA fragmentation
More uniform coverage
Miss the 5’ and 3’ ends
of the transcript
5’
Gene Span
3’
Wang Z, et al. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.
Techniques for finding TSS
Identify the 5’ cap at the beginning of the mRNA
Cap Analysis of Gene Expression (CAGE)
RNA Ligase Mediated Rapid Amplification of cDNA Ends
(RLM-RACE)
Cap-trapped Expressed Sequence Tags (5’ ESTs)
More information on these techniques:
Takahashi H, et al. CAGE (cap analysis of gene expression): a protocol for
the detection of promoter and transcriptional networks. Methods Mol
Biol. 2012 786:181-200.
Sandelin A, et al. Mammalian RNA polymerase II core promoters: insights
from genome-wide studies. Nat Rev Genet. 2007 Jun;8(6):424-36.
Promoter architecture in Drosophila
Classify core promoter based on
the Shape Index (SI)
Determined by the distribution of
CAGE and 5’ RLM-RACE reads
Shape index is a continuum
Most promoters in D. melanogaster
contain multiple TSS
Median width = 162 bp
~70% of vertebrate genes have
broad promoters
Hoskins RA, et al. Genome-wide analysis of promoter
architecture in Drosophila melanogaster. Genome Res.
2011 Feb;21(2):182-92.
Genes with peaked promoters show stronger
spatial and tissue specificity
46% of genes with
broad promoters are
expressed in all
stages of embryonic
development
19% of genes with
peaked promoters
are expressed in all
stages
Hoskins RA, et al. Genome-wide
analysis of promoter architecture in
Drosophila melanogaster. Genome
Res. 2011 Feb;21(2):182-92.
Peaked and broad promoters are enriched in
different core promoter motifs
Rach EA, et al. Motif composition, conservation and condition-specificity of single and
alternative transcription start sites in the Drosophila genome. Genome Biol. 2009;10(7):R73.
Using modENCODE data to classify the
type of core promoter in D. melanogaster
Only a subset of the modENCODE data are available
through FlyBase
D. melanogaster GEP UCSC Genome Browser
[Aug. 2014 (BDGP Release 6) assembly]
FlyBase gene annotations (release 6.13)
modENCODE TSS (Celniker) annotations
DNase I hypersensitive sites (DHS)
CAGE and RAMPAGE TSS datasets
9-state and 16-state chromatin models
Transcription factor binding site (TFBS) HOT spots
9-state chromatin model
Kharchenko PV, et al.
Comprehensive analysis of
the chromatin landscape in
Drosophila melanogaster.
Nature. 2011 Mar
24;471(7339):480-5.
DNaseI Hypersensitive Sites (DHS)
correspond to accessible regions
Aasland R and Stewart
AF. Analysis of DNaseI
hypersensitive sites in
chromatin by cleavage
in permeabilized cells.
Methods Mol Biol.
1999;119:355-62.
Ho JW, et al.
Comparative analysis of
metazoan chromatin
organization. Nature.
2014 Aug
28;512(7515):449-52.
modENCODE TSS annotations
Two sets of modENCODE TSS predictions
TSS (Celniker)
Most recent dataset produced by modENCODE
Available on the GEP UCSC Genome Browser
TSS (Embryonic)
Older dataset available from FlyBase GBrowse
Use TSS (Celniker) dataset as the primary evidence
Hoskins RA, et al. Genome-wide analysis of promoter architecture in Drosophila melanogaster.
Genome Res. 2011 Feb;21(2):182-92
Classify the D. melanogaster core
promoter for each unique TSS
TSS classification
# Annotated TSS
# DHS positions
Peaked
1
0
1
0
1
1
Intermediate
≤1
>1
>1
≤1
Broad
>1
>1
Insufficient evidence
0
0
Consider DHS positions within a 300bp window
surrounding the start of the D. melanogaster transcript
DEMO: Classify the core promoter of
D. melanogaster Rad23
Additional DHS data from different
stages of embryonic development
DHS data
produced by the
BDTNP project
Evidence tracks:
Detected DHS Positions
(Embryos)
DHS Read Density
(Embryos)
Thomas S, et al. Dynamic reprogramming of chromatin accessibility during Drosophila
embryo development. Genome Biol. 2011;12(5):R43.
New TSS evidence tracks available
in FlyBase release 6.11
Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling
reveals widespread alternative promoter usage and transposon-driven developmental
gene expression. Genome Res. 2013 Jan;23(1):169-80.
Benefits of RAMPAGE
RAMPAGE = RNA Annotation and Mapping of
Promoters for Analysis of Gene Expression
CAGE only allows sequencing of short sequence
tags (~27 bp) near the 5’ cap
Ambiguous read mapping to large parts of the genome
RAMPAGE produces long paired-end reads
instead of short sequence tags
Developed novel algorithm to identify TSS clusters
Used paired-end information during peak calling
Used Cufflinks to produce partial transcript models
Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5'complete cDNAs. Curr Protoc Mol Biol. 2013 Nov 11;104:Unit 25B.11.
Signals in the FlyBase RAMPAGE and
MachiBase TSS tracks are off by one base
“FlyBase: GBrowse Tracks” page on the FlyBase Wiki
http://flybase.org/wiki/FlyBase:GBrowse_Tracks#Aligned_Evidence
RAMPAGE results on the
GEP UCSC Genome Browser
Lifted RAMPAGE results from release 5 to release 6
Results from 36 developmental stages
Combined TSS peak call from all samples
Available under the “Expression and Regulation” section
Analysis of MachiBase and
modENCODE CAGE data using CAGEr
Bioconductor package developed by RIKEN
Standardize analysis of CAGE data compared to the
custom protocols used by modENCODE
Map datasets against release 6 assembly
37 modENCODE CAGE samples; 7 MachiBase samples
Define TSS and promoters for each sample
Define consensus promoters across all samples
Haberle V, et al. CAGEr: precise TSS data retrieval and high-resolution promoterome
mining for integrative analyses. Nucleic Acids Res. 2015 Apr 30;43(8):e51.
TSS classifications based on CAGEr
Peaked
FlyBase Genes
modENCODE CAGE Peaks
modENCODE CAGE (Plus)
Intermediate
FlyBase Genes
modENCODE CAGE Peaks
modENCODE CAGE (Plus)
Broad
FlyBase Genes
modENCODE CAGE Peaks
modENCODE CAGE (Minus)
Changes in the dominant TSS of Rad23
across different developmental stages
Stages of Development
Rad23
CAGE
Tag
Clusters
Adult
females
Evidence for TSS annotations
(in general order of importance)
1. Experimental data
RNA-Seq
RNA Pol II ChIP-Seq (D. biarmipes only)
2. Conservation
Type of TSS (peaked/intermediate/broad) in D. melanogaster
Sequence similarity to initial exon in D. melanogaster
Sequence similarity to other Drosophila species (Multiz)
3. Core promoter motifs
Inr, TATA box, etc.
Determine the gene structure in
D. melanogaster
FlyBase: GBrowse
Gene Record Finder: Transcript Details
UTR
CDS
Identify the initial transcribed exon
using NCBI blastn
Retrieve the sequences of the initial exons from the
Transcript Details tab of the Gene Record Finder
Use placement of the flanking exons to reduce the
size of the search region if possible
Increase sensitivity of nucleotide searches
Change Program Selection to blastn
Change Word size to 7
Change Match/Mismatch Scores to +1, -1
Change Gap Costs to Existence: 2, Extension: 1
Optimize alignment parameters based on
expected levels of conservation
Derive alignment scores
using information theory
Relative entropy of target and
background frequencies
Match +2, Mismatch -3
optimized for 90% identity
Match +1, Mismatch -1
optimized for 75% identity
States DJ, et al. Improved Sensitivity of Nucleic Acid Database Searches Using ApplicationSpecific Scoring Matrices. Methods 3:66-70.
RNA PolII ChIP-Seq tracks
(D. biarmipes only)
Show regions that are enriched in RNA Polymerase
II compared to input DNA
Gene Models
RNA-Seq
RNA PolII Peaks
RNA PolII
Enrichment
Use core promoter motifs to
support TSS annotations
Some sequence motifs are enriched in the region
(~300 bp) surrounding the TSS
Some motifs (e.g., Inr, TATA) are well-characterized
Other motifs are identified based on computational analysis
Presence of core promoter motifs can be used to
support the TSS annotations
Inr motif (TCAKTY) overlaps with the TSS (-2 to +4)
Absence of core promoter motifs is a negative result
Most D. melanogaster TSS do not contain the Inr motif
Use UCSC Genome Browser Short Match to
find Drosophila core promoter motifs
TATA box
Initiator (Inr)
Ohler U, et al. Computational
analysis of core promoters in
the Drosophila genome.
Genome Biol. 2002;
3(12):RESEARCH0087.
Available under “Projects”  “Annotation Resources”  “Core Promoter Motifs” on
the GEP web site: http://gander.wustl.edu/~wilson/core_promoter_motifs.html
Core Promoter Motifs tracks
Show core promoter motif matches for each contig
Separated by strand
Visualize matches to different core promoter motifs
Use UCSC Table Browser (or other means) to export
the list of motif matches within the search region
DEMO: Use the Inr motif to determine the
TSS position of Rad23
Using RNA-Seq and RNA PolII ChIP-Seq
data to define the TSS search region
D. mel Transcripts
RNA-Seq
RNA PolII Peaks
RNA PolII
Enrichment
TSS search region
TSS annotation for Rad23
TSS position: 28,936
Conservation with D. melanogaster
blastn search of initial exon
“D. mel Transcripts” track
Location of the Inr motif
TSS search region: 28,716-28,936
Enrichment of RNA PolII upstream of the TSS position
RNA-Seq read coverage upstream of the TSS position
Search region defined by the extent of the RNA PolII peak
TSS annotation resources
Walkthroughs:
Annotation of Transcription Start Sites in Drosophila
Sample TSS report for onecut
Reference:
TSS Annotation Workflow
GEP Annotation Report:
Classify the type of core promoter
Evidence that supports or refutes the TSS annotation
Distribution of core promoter motifs
Additional TSS annotation resources
The D. melanogaster gene annotations are the primary
source of evidence
Resources that could be useful if the D. melanogaster
evidence is ambiguous
Whole genome alignments of 14 Drosophila species
PhastCons and PhyloP conservation scores
Genome browsers for nine Drosophila species
RNA Pol II ChIP-Seq (D. biarmipes only)
RNA-Seq coverage, TopHat junctions, assembled transcripts
Augustus and N-SCAN gene predictions
TSS annotation summary
Most of the D. melanogaster core promoters have
multiple TSS
Classify the type of promoter (peaked/intermediate/broad)
based on the transcriptome evidence from D. melanogaster
Define search regions that might contain TSS
Use multiple lines of evidence to infer the TSS region
Identify initial exon
RNA-Seq coverage
blastn (change search parameters)
Distribution of core promoter motifs (e.g., Inr)
D. biarmipes RNA PolII ChIP-Seq peaks
Maintain conservation compared to D. melanogaster
Questions?
Structure of a typical mRNA
Pesole G. et al. Untranslated regions of mRNAs. Genome Biology. 2002: 3(3)
reviews0004.1-reviews0004.10.
RAMPAGE protocol
DEMO
blastn search of the initial transcribed exon of Rad23 against D. biarmipes
contig19
Use RNA PolII tracks on the D. biarmipes
genome browser to identify putative TSS
April 2013 (BCM-HGSC/Dbia_2.0) assembly
Search for orthologous regions in D. elegans
Gnomon predictions for eight
Drosophila species
Based on RNA-Seq data from either the same or
closely-related species
D. simulans, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura,
D. willistoni, D. virilis, and D. mojavensis
Predictions include untranslated regions and multiple isoforms
Records not yet available through the NCBI RefSeq database
Conservation tracks on the D. melanogaster
GEP UCSC Genome Browser
Whole genome alignments of 14 Drosophila species
Drosophila Chain/Net composite track
Generate multiple sequence (Multiz) alignments from
these pairwise alignments
Identify conserved regions from Multiz alignments
PhastCons: identify conserved elements
PhyloP: measure level of selection at each nucleotide
Multiz alignment of 27 insect species available on the
official UCSC Genome Browser
Aug. 2014 (BDGP Release 6 + ISO1 MT/dm6) assembly
Use the conservation tracks to identify
regions under selection
PhyloP scores:
Under negative selection
Fast-evolving
Examine the Multiz alignments to
identify the orthologous TSS regions
Use RNA-Seq data to predict untranslated
regions and putative TSS
TSS predictions available for 9 Drosophila species
N-SCAN+PASA-EST, Augustus, TransDecoder
D. mel
Proteins
N-SCAN
Augustus
TransDecoder
RNA-Seq
DEMO
Genome browsers for other Drosophila species