Transcript Document

RNA surveillance and degradation: the Yin Yang
of RNA
AAAAAAAAAAA
RNA
Pol II
production
RNA
destruction
AAA
Ribosome
MODEL:
Mtr4
Trf4p Polyadenylation
AAAAA
by Trf4p
* *
* *
Hypomodified tRNAiMet
* *
* *
*
AAAAA
Rrp46p
Csl4p
Rrp43p
Rrp44p Rrp45p Rrp42p
Mtr3p Rrp41p
* *
* *
Rrp6p
Rrp40p
Exosome
Rrp4p
Degradation of hypomodified
tRNAiMet
*- Hypothetical diagram
of the exosome
Workflow
Knockdown
mMtr4
Connect&
Compare
Collect
Library
Construction
PolyA-Seq
Mapping
Normalize
Aggregate
Remove
Internal A
Visualize
Next Gen sequencing
PolyA-Seq
TRAMP Complex
AAAA
AAAA
Papd5
ZCCHC7
Mtr4
AAAA
AAAA
siRNA knockdown
Library creation for NGS
Map paired end reads to genome
• BWA (Burrows-Wheeler Aligner) Algorithm used to
map each pair of reads to the genome
• Report each pair of reads as a single nucleotide
position within the genome where polyadenylation
detected in an RNA sample
• Average insert size 300
– Read size ~45
3’-A
AAAA-3’
TTTp-5’
Raw reads vs Mapped reads
Data type/kd type
Raw reads
Mapped reads
positions
Mtr4
15,135,078
10,853,534
651,551
Ctrl
16,348,780
11,708,310
652,128
Rrp6
15,971,926
12,388,266
705,173
Mtr4
ND
34,204,534
1,124,968
Ctrl
ND
7,195,942
582,256
Rrp6
ND
8,241,505
597,672
Replicate Data
Original data
Normalization of data: reads per million (rpm)
Analysis
• Starting with refseq database
– Raw read counts converted to reads per million
• Reads at position/total reads in sample
– Remove all non-coding RNAs
– From each sample collect normalized reads
mapping at the 3’ end +/- 50 bases of each refseq
encoding protein
– Dot Plot normalized reads on log scale, X
axis=control and Y axis=mMtr4KD
mRNA polyadenylation does not change
between Mtr4 and control KD
10000
R2=0.95141
Normalized pASeq reads in mMtr4KD
1000
100
10
1
0.1
0.1
1
10
100
Normalized pASeq reads in mControlKD
1000
10000
Problems encountered
• Sequencing read depth very different in the
original data
– 34 mil mapped reads in one sample 8 mil in other
• Lack of 3 replicates for robust statistical analysis
of data
• Removal of internal A
– Seq reads that map to a oligoadenylate track in the
genome
– Algorithm developed misses many
– Manual removal takes too much time.
Remove Internal A
AAAAAAAA
AAAAAAAA
TTTTTTTTT
TTTTTTTTT
How to mine the data based on a
hypothesis
• Hypothesis: PolyA+ RNAs of unknown identity will
accumulate upon depletion of mMtr4 vs. the
control.
– How can the transcriptome be queried?
– How detailed should a query be?
• Every pA position, or only those exhibiting greater than x
number of raw/normalized reads?
• How do we find significant differences with one sample, or
possibly two?
• How can repetitive elements be accounted for in the data?
Custom annotation to remove bias
from existing annotations
• Data mapped with Bowtie to mouse genome
mm10 build
• Mapped data from KD and control compared
using cufflinks to explore gene expression
differences using a custom annotation
• Custom annotation
– 1000 base pair genes with 500 base pair overlap
with next gene
• This did not work well
Problems with using custom
annotation
• First real problem was the no computing could handle more
than 5000 genes of the custom annotation at a time
– One chromosome had 147K genes
• There was a problem with assignment when the reads
overlapped
– Cuffdiff would randomly assign the reads to only one of the
genes.
• Overlaps split into two fasta files, but we could not capture
differences in the data that we knew exists.
– cuffdiff collects data from the entire 1000 bp gene and
compares between 2 samples
– This method leads to false negatives for pA data where the
focus is on one or a few positions as a pA event.
What next?
Mapping
• Map raw reads against mm10 assembly with Bowtie2/ Tophat
Strand and 3’ end selection
• Select alignments on positive and negative strand
• Select 3’ read of paired reads to define site of polyadenylation
Custom annotation preparation and count
• Run F-Seq to identify the mode of all peaks
• Normalize data then collect reads at mode (+/-5-10 nucleotides)
Statistical test with DESeq, an R package
• Negative binominal model
F-Seq
• Tags to identify specific sequence features for different library
preparations (ChIP-seq), (DNase-seq) and (pA-seq).
• Will summarize and display individual sequence data as an
accurate and interpretable signal, by generating a continuous
tag sequence density estimation.
Generating Peaks with FSeq
• 1. Estimate kernel density to estimate pdf
• 2. compute threshold
–
–
–
–
nw=nw/L.
xc,
Repeat step 2 k times
s SDs above the mean
• 2.1 threshold output module is modifiable
Magnitude of data: one sample both
strands
51 million bases of Chromosome 12
12 thousand bases of Chromosome 12
Chromsome 12 is 121 million base pairs long
rRNA workflow
Mapping
• Map raw reads against 13kb rDNA with Bowtie2
Strand and 3’ end selection
• Select alignments on positive strand
• Select 3’ read of a pair and 3’ end of a read
Density estimation and visualization
• Density estimation with F-Seq, a peak calling tool
1
316
631
946
1261
1576
1891
2206
2521
2836
3151
3466
3781
4096
4411
4726
5041
5356
5671
5986
6301
6616
6931
7246
7561
7876
8191
8506
8821
9136
9451
9766
10081
10396
10711
11026
11341
11656
11971
12286
12601
12916
13231
13546
pA reads intersecting 45S prerRNA
20000
18000
16000
14000
12000
10000
Ctrl
8000
Mtr4
6000
4000
2000
0
18S
5.8S
28S
1
309
617
925
1233
1541
1849
2157
2465
2773
3081
3389
3697
4005
4313
4621
4929
5237
5545
5853
6161
6469
6777
7085
7393
7701
8009
8317
8625
8933
9241
9549
9857
10165
10473
10781
11089
11397
11705
12013
12321
12629
12937
13245
13553
pA reads intersecting 45S prerRNA
100%
90%
80%
70%
60%
50%
Mtr4
40%
Ctrl
30%
20%
10%
0%
18S
5.8S
28S
Accumulation of micro RNA processed 5’
leader upon depletion of Mtr4
• Comparison of Mtr4 V. Control KD
• Abundant polyA found near 5’ end of annotated Mir322
• Confirmed using molecular technique