1. FPKM tracking files - iPlant Pods

Download Report

Transcript 1. FPKM tracking files - iPlant Pods

CyVerse Workshop
Intro RNA-Seq
Experiment Overview
Goals
Determine differential expression
abundance of transcripts in between a
WT and mutant organism
RNA-Seq Overview
Basic concept
Image source: http://www.bgisequence.com
Experiment Overview
Example experiment
• LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper
transcription factor (TF).
• Mutations cause aberrant phenotypes in Arabidopsis
morphology, pigmentation and hormonal response.
•
We will use RNA-Seq to compare WT and hy5 to identify HY5regulated genes.
Source: http://www.gla.ac.uk/media/media_73736_en.jpg
Experiment Overview
Read statistics
• Genome alignments from TopHat were saved as BAM files,
the binary version of SAM (samtools.sourceforge.net/).
• Reads retained by TopHat are shown below
Now what?
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC
+
BA?39AAA933BA05>A@A=?4,9#################
@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41
GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT
+
@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##
@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41
TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA
+
A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?
@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41
CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC
+
BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?
@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41
AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA
+
BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@
@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41
GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG
+
BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>
@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41
GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC
+
?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Now what?
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC
+
BA?39AAA933BA05>A@A=?4,9#################
@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41
GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT
+
@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##
@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41
TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA
+
A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?
@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41
CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC
+
BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?
@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41
AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA
+
BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@
@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41
GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG
+
BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>
@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41
GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC
+
?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Getting a feel for the data
FASTQ format
Now what?
0
0
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC
+
BA?39AAA933BA05>A@A=?4,9#################
@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41
GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT
+
@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##
@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41
TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA
+
A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?
@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41
CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC
+
BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?
@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41
AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA
+
BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@
@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41
GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG
+
BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>
@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41
GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC
+
?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
1
1
1
1
0
0
1
0
Bioinformatician
Papers and Background
Read these first!
Tuxedo Workflow
Differential expression
*TopHat and Cufflinks require a sequenced genome
No reference genome?
Resources
Encode Standards
Suggestions before you sequence
http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq
_Standards_V1.0.pdf
$
$
$
$
$
$
tophat
tophat
tophat
tophat
tophat
tophat
-p
-p
-p
-p
-p
-p
$
$
$
$
$
$
cufflinks
cufflinks
cufflinks
cufflinks
cufflinks
cufflinks
8
8
8
8
8
8
-p
-p
-p
-p
-p
-p
-G
-G
-G
-G
-G
-G
8
8
8
8
8
8
genes.gtf
genes.gtf
genes.gtf
genes.gtf
genes.gtf
genes.gtf
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
C1_R1_thout
C1_R2_thout
C1_R3_thout
C2_R1_thout
C2_R2_thout
C2_R3_thout
C1_R1_clout
C1_R2_clout
C1_R3_clout
C2_R1_clout
C2_R2_clout
C2_R3_clout
genome
genome
genome
genome
genome
genome
C1_R1_1.fq
C1_R2_1.fq
C1_R3_1.fq
C2_R1_1.fq
C2_R2_1.fq
C2_R3_1.fq
C1_R1_2.fq
C1_R2_2.fq
C1_R3_2.fq
C1_R1_2.fq
C1_R2_2.fq
C1_R3_2.fq
C1_R1_thout/accepted_hits.bam
C1_R2_thout/accepted_hits.bam
C1_R3_thout/accepted_hits.bam
C2_R1_thout/accepted_hits.bam
C2_R2_thout/accepted_hits.bam
C2_R3_thout/accepted_hits.bam
$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt
$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \
./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\
./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\
./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam
Your transformed RNA-Seq Data
Your RNA-Seq Data
Command line version
Tophat (bowtie)
Using a GUI
Cufflinks
Cuffmerge
Your Data
FASTQ
Cuffdiff
Discovery Environment
Discovery Environment
iPlant Data Store
Atmosphere
CummeRbund
Moving your data in
Complete documentation
www.iplantc.org/ds1
Cyberduck
Easy to use!
Discovery Environment
Easy to use!
Decompress your data
Know what files you have
Remove barcodes?
Demultiplexing and adapter trimming
Image from: http://www.westburg.eu/lp/rna-seq-library-preparation
Pre-process sequences if needed (e.g., Sabre for de-multiplexing
reads, and Scythe for removing primer/adapter sequences)
Quality Control
FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Quality Control
Per base sequence quality
•
•
•
•
The central red line is the median value
The yellow box represents the inter-quartile range (25-75%)
The upper and lower whiskers represent the 10% and 90% points
The blue line represents the mean quality
Quality Control
Per sequence quality
BAD
GOOD
Fail: most frequently observed mean quality is below 20 (1% error rate)
Quality Control
Sequence length distribution
GOOD
Fail: error if any of the sequences have zero length.
Quality Control
Overrepresented sequences
BAD
Fail: module will issue an error if any sequence is found to represent more
than 1% of the total
TopHat
Maps reads to reference genome
TopHat
Maps reads to reference genome
• TopHat is one of many applications for aligning short
sequence reads to a reference genome.
• It uses the BOWTIE aligner internally.
• Other alternatives are BWA, MAQ, STAR,OLego,
Stampy, Novoalign, etc.
TopHat
Maps reads to reference genome
• TopHat has a number of parameters and options, and their default
values are tuned for processing mammalian RNA-Seq reads.
• If you would like to use TopHat for another class of organism, we
recommend setting some of the parameters with more strict,
conservative values than their defaults.
• Usually, setting the maximum intron size to 4 or 5 Kb is sufficient
to discover most junctions while keeping the number of false
positives low.
- TopHat User Manual
IGV
Visualize mapped reads
Cufflinks
Assemble transcripts
Cufflinks
Assemble transcripts
Hint: Provide a mask file (gtf/gff)
• Tells Cufflinks to ignore all reads that could
have come from transcripts in this GTF file.
• Annotated rRNA, mitochondrial transcripts
other abundant transcripts you wish to
ignore.
- Cufflinks User Manual
Cufflinks
Assemble transcripts
1) transcripts.gtf
This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard
GTF, and the last column contains attributes, some of which are also standardized
("gene_id", and "transcript_id"). There one GTF record per row, and each record
represents either a transcript or an exon within a transcript.
2) isoforms.fpkm_tracking
This file contains the estimated isoform-level expression values (FPKM).
3) genes.fpkm_tracking
This file contains the estimated gene-level expression values (FPKM).
- Cufflinks User Manual
Cufflinks
Assemble transcripts
Cuffmerge
Assemble transcriptome from RABT and Cufflinks
Cuffmerge is a meta-assembler; Assembly of Cufflinks transcripts /
Reference based assembly
Cuffdiff
Determine sample differences
Cuffdiff
Determine sample differences
•Cuffdiff evaluates variation in read counts for each gene
across the replicates this estimate is used to calculate
significance of expression changes
•Cuffdiff can identify genes that are differentially spliced or
differentially regulated via promoter switching. Isoforms of
a gene that have the same TSS are grouped
•Detection rate of differentially expressed genes/transcripts
is strongly dependent on sequencing depth
Cuffdiff
Determine sample differences
Changes in fragment counts ≠ changes in expression
True expression is estimated by the sum of the length-normalized isoform read
counts so the entire transcript must be taken into account.
Cuffdiff
Determine sample differences
1. FPKM tracking files
Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene
FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group.
(tss_groups.fpkm_tracking tracks summed FPKM of transcripts sharing tss_ids)
2) Count tracking files
Estimate of the number of fragments that originated from each transcript, primary transcript, and gene in each sample.
3) Read group tracking files
Expression and fragment count for each transcript, primary transcript, and gene in each replicate.
4) Differential expression tests
Tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary
transcripts, genes, and coding sequences.
Plus several other outputs (diff splicing, CDS, promoter, etc.)
Cuffdiff
Determine sample differences
Example filtered Cuffdiff results generated in the Discovery Environment.
Cuffdiff
Determine sample differences
Example filtered Cuffdiff results generated in the Discovery Environment.
CummeRbund
Using R in Atmosphere (tomorrow)
Cuffdiff
Density plot
Cuffdiff
Scatter plot
Cuffdiff
Volcano plot
• Detailed instructions with videos, manuals, documentation in
• Keep asking: ask.iplantcollabortive.org