Introductory_RNA-seq

Download Report

Transcript Introductory_RNA-seq

Introductory RNA-Seq Transcriptome
Profiling
Tutorial on the web….
• http://www.iplantcollaborative.org/learning-center/discoveryenvironment/de-002-characterizing-differential-expression-rna-seqtuxedo
• https://pods.iplantcollaborative.org/wiki/display/eot/RNA-Seq_tutorial
Align sequence reads to the reference genome
The most time-consuming part of the analysis is doing the
alignments of the reads (in Sanger fastq format) for all
replicates against the reference genom
Where is the Sample Data?
Step 1: Align Reads to the Genome
Align the four
FASTQ files to
Arabidopsis
genome using
TopHat
RNA-seq in the Discovery Environment
Overview: This training module is designed to provide a
hands on experience in using RNA-Seq for transcriptome
profiling.
Question:
How well is the annotated transcriptome represented in
RNA-seq data in Arabidopsis WT and hy5 genetic
backgrounds?
How can we compare gene expression levels in the two
samples?
Scientific Objective
LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper
transcription factor (TF).
Mutations in the HY5 gene cause aberrant phenotypes in
Arabidopsis morphology, pigmentation and hormonal
response.
We will use RNA-seq to compare the transcriptomes of
seedlings from WT and hy5 genetic backgrounds to identify
HY5-regulated genes.
Samples
• Experimental data downloaded from the NCBI
Short Read Archive (GEO:GSM613465 and
GEO:GSM613466)
• Two replicates each of RNA-seq runs for Wildtype and hy5 mutant seedlings.
Specific Objectives
By the end of this module, you should
1)Be more familiar with the DE user interface
2)Understand the starting data for RNA-seq analysis
3)Be able to align short sequence reads with a reference
genome in the DE
4)Be able to analyze differential gene expression in the DE
RNA-Seq Conceptual Overview
Image source: http://www.bgisequence.com
RNA-Seq Data
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC
+
BA?39AAA933BA05>A@A=?4,9#################
@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41
GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT
+
@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##
@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41
TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA
+
A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?
@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41
CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC
+
BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?
@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41
AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA
+
BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@
@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41
GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG
+
BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>
@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41
GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC
+
?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
…Now What?
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC
+
BA?39AAA933BA05>A@A=?4,9#################
@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41
GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT
+
@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##
@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41
TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA
+
A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?
@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41
CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC
+
BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?
@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41
AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA
+
BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@
@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41
GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG
+
BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>
@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41
GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC
+
?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Bioinformagician
Your RNA-Seq Data
$
$
$
$
$
$
tophat
tophat
tophat
tophat
tophat
tophat
-p
-p
-p
-p
-p
-p
$
$
$
$
$
$
cufflinks
cufflinks
cufflinks
cufflinks
cufflinks
cufflinks
8
8
8
8
8
8
-p
-p
-p
-p
-p
-p
-G
-G
-G
-G
-G
-G
8
8
8
8
8
8
genes.gtf
genes.gtf
genes.gtf
genes.gtf
genes.gtf
genes.gtf
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
C1_R1_thout
C1_R2_thout
C1_R3_thout
C2_R1_thout
C2_R2_thout
C2_R3_thout
C1_R1_clout
C1_R2_clout
C1_R3_clout
C2_R1_clout
C2_R2_clout
C2_R3_clout
genome
genome
genome
genome
genome
genome
C1_R1_1.fq
C1_R2_1.fq
C1_R3_1.fq
C2_R1_1.fq
C2_R2_1.fq
C2_R3_1.fq
C1_R1_2.fq
C1_R2_2.fq
C1_R3_2.fq
C1_R1_2.fq
C1_R2_2.fq
C1_R3_2.fq
C1_R1_thout/accepted_hits.bam
C1_R2_thout/accepted_hits.bam
C1_R3_thout/accepted_hits.bam
C2_R1_thout/accepted_hits.bam
C2_R2_thout/accepted_hits.bam
C2_R3_thout/accepted_hits.bam
$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt
$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \
./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\
./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\
./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam
Your transformed RNA-Seq Data
RNA-Seq Analysis Workflow
Your Data
Cufflinks
FASTQ
Cuffmerge
iPlant Data Store
Discovery Environment
Tophat (bowtie)
Cuffdiff
Atmosphere
CummeRbund
Quick Summary
Pre-Configured: Getting the RNA-seq Data
Import SRA data
from NCBI SRA
Extract FASTQ
files from the
downloaded SRA
archives
Examining Data Quality with FastQC
Examining Data Quality with FastQC
RNA-Seq Workflow Overview
TopHat
• TopHat is one of many applications for aligning
short sequence reads to a reference genome.
• It uses the BOWTIE aligner internally.
• Other alternatives are GSNAP, BWA, MAQ,
Stampy, Novoalign, etc.
RNA-seq Sample Read Statistics
• Genome alignments from TopHat were saved as BAM
files, the binary version of SAM
(samtools.sourceforge.net/).
• Reads retained by TopHat are shown below
Sequence run
WT-1
Reads
Seq. (Mbase)
WT-2
hy5-1
hy5-2
10,866,702 10,276,268
13,410,011
12,471,462
445.5
549.8
511.3
421.3
ATG44120 (12S seed storage protein) significantly down-regulated in hy5 mutant
Background (> 9-fold p=0). Compare to gene on right lacking differential expression
RNA-Seq Workflow Overview
CuffDiff
• CuffLinks is a program that assembles aligned RNA-Seq
reads into transcripts, estimates their abundances, and
tests for differential expression and regulation
transcriptome-wide.
• CuffDiff is a program within CuffLinks that compares
transcript abundance between samples
Examining Differential Gene Expression
Examining the Gene Expression Data
Differentially expressed genes
Filter CuffDiff results for up or down-regulated
gene expression in hy5 seedlings
Differentially expressed genes
Example filtered CuffDiff results generated with the Filter_CuffDiff_Results to
1) Select genes with minimum two-fold expression difference
2) Select genes with significant differential expression (q <= 0.05)
3) Add gene descriptions