lecture - RNA-Seq for the Next Generation
Download
Report
Transcript lecture - RNA-Seq for the Next Generation
+
RNAseq for differential
gene expression
analysis
Molly Hammell, PhD
2016
1
+
2
RNAseq for differential gene expression:
the tuxedo suite
Why?
How?
Examples. . . .
M. Hammell
+
3
A good reference protocol
M. Hammell
+
Analysis Outline
What do you do first?
How do you know when you’re done?
What do you do with the results?
You’ve got your data, what do you do with it?
FastQC
to check the quality of the reads
Tophat
Map to your favorite genome
Cufflinks/Cuffdiff
Assign reads to transcripts
Differential expression
Calculate your stats
RNAseq QC
Check that they mapped well, with little
contamination, & replicates correlate
Post-analysis
CummeRbund (plots)
heatmaps in R
GSEA (pathways analysis)
4
M. Hammell
+ Mapping with Tophat
5
… Overview of the process.
Making RNA-seq Libraries
AAAAAAAA
AAAAAAAA
Isolate RNA
Fragment RNA
Convert RNA to cDNA fragments
Ligate Adapters to each fragment
PCR!
M. Hammell
+ Mapping with Tophat
6
… Overview of the process.
Mapping RNA-seq Libraries
Making RNA-seq Libraries
AAAAAAA
A
AAAAAAA
A
M. Hammell
Trapnell C et al. Bioinformatics 2009;25:1105-1111
+ Mapping with Tophat
7
… more parameters than you’d care to care about.
What do you absolutely need to specify for each run?
The genome index file (just the prefix)
The reads files, one for each paired end
The transcript files (gtf format)
The insert size (-r). This is the size of the fragment minus reads length
The library type (stranded or not, illumina-like or not)
In the Green Line, these parameters are set for you as defaults in a Basic Run.
You can use the default settings and probably be ok, but we’ll talk about those.
M. Hammell
+ Mapping with Tophat
8
… more parameters than you’d care to care about.
You need to give Tophat a reference genome for mapping your reads.
In the Green Line, you choose the genome you are working with
from the list of provided genomes.
We actually give it an “index file” which is a compressed map of the genome.
There are 6 genome index files.
If you built these with bowtie1-build, they look like this
Just specify the prefix, e.g.: hg19
The genome index file (just the prefix)
The reads files, one for each paired end
The transcript files (gtf format)
The insert size (-r). This is the size of the fragment minus reads length
The library type (stranded or not, illumina-like or not)
M. Hammell
+ Mapping with Tophat
9
… more parameters than you’d care to care about.
Next, we give Tophat your actual sequenced reads.
These are FastQ files, which contain both sequence information as well as quality
scores for each read.
For paired end reads, we tell Tophat about both of them, (pair the reads) so it can
try to map them together.
In the Green Line, the output files from FastX toolkit are the input files to TopHat
Example: First WT replicate (rep1); Read 1 paired with Read 2
wt-rep1_R1.fq.gz
wt-rep1_R2.fq.gz
The genome index file (just the prefix)
The reads files, one for each paired end
The transcript files (gtf format)
The insert size (-r). This is the size of the fragment minus reads length
The library type (stranded or not, illumina-like or not)
M. Hammell
+ Mapping with Tophat
10
… more parameters than you’d care to care about.
Tophat works much better if you give it the exon/intron structure of known genes,
i.e., a gtf file.
You can still choose to find novel transcripts later, but providing this file makes it
easier to find reads that span known introns, which makes the search faster.
Example: the human file is hg19_refseq_genes.gtf
**Be careful – tophat is very picky about the format of this file. Either download
one from their website, or get someone to give you one that plays nicely with
tophat. In the Green Line, the gtf file is provided for you
The genome index file (just the prefix)
The reads files, one for each paired end
The transcript files (gtf format)
The insert size (-r). This is the size of the fragment minus reads length
The library type (stranded or not, illumina-like or not)
M. Hammell
+ Mapping with Tophat
11
… more parameters than you’d care to care about.
100
r=300
100
Mate Inner Distance
Let’s say your RNAseq library prep kit preferentially fragments the RNA into
500 bp fragments. Then, you did a PE100bp run.
Your “insert size” is 500 – 100 – 100 = 300.
You did a PE300 run? Okay, your insert size is 0.
Why does Tophat care? Tophat judges how well the two ends of a paired-end segment
mapped by whether the inferred distance between them is consistent with the expected insert
size. This is part of “mapping quality.”
The genome index file (just the prefix)
The reads files, one for each paired end
The transcript files (gtf format)
The mate inner distance (-r). This “insert size” is the size of the fragment minus
reads length
The library type (stranded or not, illumina-like or not)
M. Hammell
+ Mapping with Tophat
12
… more parameters than you’d care to care about.
In general, an Illumina Tru-seq Stranded RNA-seq library prep kit should use
fr-firststrand. That means the library is stranded and the complementary strand
is the first one sequenced. In other words, all your reads are the reversecomplement of the transcriptome!
The other options:
fr-unstranded
fr-secondstrand
common for non-stranded libraries
used for ABI Solid libraries – not common
The genome index file (just the prefix)
The reads files, one for each paired end
The transcript files (gtf format)
The insert size (-r). This is the size of the fragment minus reads length
The library type (stranded or not, illumina-like or not)
M. Hammell
+ Mapping with Tophat
13
What other options might you care to change?
Preset default options (In the Green Line in a Basic run)
-N 2
number of mismatches per read (default is 2)
-g 2
maximum number of alignments per read to report (default is 2)
--suppress-hits
set this if you want to suppress reads that map more than max (2)
--no-mixed
don’t report an alignment if you can’t map both ends of the fragment
--no-novel-juncs
don’t look for novel splice junctions – just use the ones in the GTF file
(In the Green Line, if you will be skipping CuffLinks, you must run TopHat Advanced and
choose this option to not look for novel junction)
M. Hammell
+ Mapping with Tophat
14
What should your output look like?
A typical tophat output directory should have these files:
(The output files will be in your Project folder in the Data Store)
accepted_hits.bam
insertions.bed
deletions.bed
left_kept_reads.info
junctions.bed
logs
right_kept_reads.info
The bam file is your main output – the aligned reads.
It’s in binary sam format.
Here is an example of what one line of a bam file looks like:
read name
map flags
MENDEL_0001_FC61FR7AAXX7:69:18748:7104#0 81
CGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTG
read sequence
map position map quality
chr/start
other PE read is mapped to same chr (=)
at pos 709587, which is 699060 bp away
chr1 10563 255
36M
=
709587
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
read qualities
699060
NM:i:0
NH:i:1
tophat flags
NM:i:0 = 0 mismatches
NH:i:1 = aligns uniquely
M. Hammell
+
15
Going from bam files to
differential expression
After running tophat on each of your sample replicates, you have a set of bam
files.
Now, put all your bam files together and feed them to cuffdiff
But, wait? The package was called cufflinks. Don’t we need to run cufflinks first?
No.
M. Hammell
+ There’s stuff I’m skipping, why?
16
Cufflinks was designed to find novel transcripts:
maybe your genome doesn’t have good transcript annotations
maybe you want to find novel splice junctions
maybe you want to find novel transcripts
The thing is, you need really deep data (100 million reads) to do a good job at
this.
So, if you don’t care about calling novel stuff, and/or if you don’t have deep
enough sequencing data, you should skip cufflinks and go straight to cuffdiff.
(In the Green line, you have the option of running CuffDiff directly after TopHat.)
Cuffdiff always calculates the transcript abundances, even if you already did
cufflinks. So, cufflinks saves you no time.
M. Hammell
+ Cuffdiff for DE
17
differential expression, with stats
Find reads mapping
uniquely to one isoform
Cuffdiff calculates:
isoform abundances
then fold changes
then q-values.
Iterate to find the most
likely distribution of the
rest of the reads
Output a list of isoform
abundances
Now do some math to
calculate
log2FC, Q-values
M. Hammell
Trapnell C et al. Nat Biotechnol 2010
+ Cuffdiff for DE
18
differential expression, with stats
What do you absolutely need to specify for each CuffDiff run?
A BAM file of input aligned reads for each condition (genotype, treatment, etc)
and for biological replicates
A GTF file that specifies all the transcripts you want to consider (Refseq genes)
In the Green Line, the GTF file corresponding to your organism is
provided by default
The normalization strategy you want to use (FPKM is common)
In the Green Line, FPKM is used by default
M. Hammell
+ Cuffdiff for DE
19
differential expression, with stats
What do you absolutely need to specify for each run?
A BAM file of input reads for each condition (genotype, treatment, etc).
A GTF file that specifies all the transcripts you want to consider (Refseq genes)
The normalization strategy you want to use (FPKM is common)
accepted_hits.bam
Your Tophat output files, in BAM format, are the input to Cuffdiff.
You should have at least two of these for each condition (biological replicates),
the more the better.
M. Hammell
+ Cuffdiff for DE
20
differential expression, with stats
What do you absolutely need to specify for each run?
A BAM file of input reads for each condition (genotype, treatment, etc).
A GTF file that specifies all the transcripts you want to consider
(Refseq genes)
In the Green Line, the GTF file corresponding to your organism is
provided by default
The normalization strategy you want to use (FPKM is common)
hg19_refseq_genes.gtf
**Be careful –Cuffdiff is very picky about the format of this file. Either download
one from their website, or get someone to give you one that plays nicely with
Cuffdiff.
M. Hammell
+ Cuffdiff for DE
21
differential expression, with stats
What do you absolutely need to specify for each run?
The normalization strategy you want to use:
Fragments Per Kilobase of Exons Per Million Mapped Reads (FPKM)
Transcript A
Why per kilobase?
Genes 2x as long will have 2x as many reads
from the same number of starting transcripts.
200 bases
2 fragments/transcript
Transcript B
400 bases
4 fragments/transcript
M. Hammell
+ Cuffdiff for DE
22
differential expression, with stats
What do you absolutely need to specify for each run?
The normalization strategy you want to use
Fragments Per Kilobase of Exons Per Million Mapped Reads (FPKM)
Why per million mapped reads?
Sequencing deeper gives you more reads.
So, if you got 4x the # of reads from your control
library (40M) and only 10M from your mutant
library, before normalization it will seem as if
all transcripts are expressed 4x more in the
control than in the mutant
Mutant Sample
total library
10 million reads
Control Sample
total library
40 million reads
Transcript A
Transcript A
M. Hammell
+ Cuffdiff for DE
23
differential expression, with stats
What other options might you care to change?
Other options:
-c 10
You can specify the minimum reads per gene, below which no statistics are
calculated
-M
You can specify a mask file of transcripts -- anything that should be removed
from the analysis (like an rRNA file)
-u
You can tell it to take any multi-mapper reads, and try to assign the true locus
In the Green Line, these options are pre-set defaults
--dispersion-method
(pooled, per condition, blind, poisson)
M. Hammell
+
24
Cuffdiff output
A typical cuffdiff output directory should have these files:
cds_exp.diff
gene_exp.diff
isoform_exp.diff
tss_group_exp.diff
cds.fpkm_tracking
genes.fpkm_tracking
isoforms.fpkm_tracking
tss_groups.fpkm_tracking
That’s a lot! What’s important?
gene_exp.diff
isoform_exp.diff
abundances, fold changes & q-values of genes
abundances, fold changes & q-values of isoforms
Maybe important to you?
tss
cds
tracking files
alternate transcription start sites
differential splicing that would produce distinct proteins
lots of information about reads per replicate, stats
M. Hammell
+
25
A note on P-values
The problem of multiple hypothesis testing
FDR = False Discovery Rate
Statisticians have been harping on something called P-value fishing for a
long time. If you keep running tests over and over, you’ll eventually get
something to come up as significant by random chance.
How do you fix this?
Adjust your P-values to q-values
Benjamini & Hochberg (BH):
rank genes by P-value (large to small)
each P-value is adjusted by the
number of genes with a
smaller P-value than itself
q-value = p-value * n/(n-k)
n = number of genes
k = rank in gene list
M. Hammell
+
26
So, what do I do now?
A little QC & a lot of downstream analysis
QC
How well do the replicates correlate with each other?
Does a PCA plot show that my samples group by genotype?
What fraction of transcripts are expressed > 1 RPKM?
Downstream analysis
make lots of Excel tables of comparisons
make some heatmaps (in R or using gplots, CummeRbund)
figure out if any pathways are enriched among genes that change
(DAVID, GSEA, PathDE, etc)
M. Hammell
+ RNAseq post-analysis QC
27
RNAseq Replicate & Sample Correlations
Replicates, Condition 1
Replicates, Condition 2
Condition 1 Vs. Condition 2
All axes show RPM
(reads per million
mapped)
M. Hammell
+ RNAseq post-analysis & QC
28
PCA plots & corresponding heatmap
Condition 1
+ drug
Condition 2
+ drug
Condition 1
Condition 2
PCA plot separates:
(1) Condition 1 vs. Condition 2
(2) Lines grown in drug or normal
media
Condition
1
Condition
2
Condition
2 + drug
M. Hammell
+ Standard Pathways Analysis
29
DAVID, Ingenuity’s IPA, etc.
M. Hammell
+
30
Gene Set Enrichment Analysis
Are many of the altered genes changing in a single
direction?
M. Hammell
Subramanian A et al. PNAS 2005;102:15545-15550
Molly Hammell Lab at CSHL
2016
Ray
Ami
Ying
Nik
Regina
Wen-Wei
Christos
Funding:
CSHL, NIH, NSF, Rita Allen Foundation, Ride for Life
David
31
Yuan
+ Statistical Distributions
gaussian, poisson, negative binomial
32
-- what does all this mean?
Gaussian (normal) distributions
-nice and easy to work with
-describe smooth distributions
-work well for microarrays
-underlie the t-test (among others)
The dispersion in this case is equal to
the standard deviation
You completely specify this distribution
by the mean ( ) and the standard
deviation ().
M. Hammell
+ Statistical Distributions
gaussian, poisson, negative binomial
33
-- what does all this mean?
Poisson distributions
- like a gaussian for nonsmooth distributions
- describes things like stars in
a small area on the sky
- for very large numbers, this
looks like a gaussian
distribution
The dispersion in this case is equal to
the mean ( ). You completely specify
this distribution by the mean.
e.g., For a set of samples where your gene has an
average of =100 reads overall, how likely is it that
your gene randomly has k=10 reads in the
M. mutant
Hammell
+ Statistical Distributions
gaussian, poisson, negative binomial
34
-- what does all this mean?
Negative Binomial distributions
- like a poisson but allows the
variance to be different from
the mean
- often called “over-dispersed”
poisson distribution
- for very large numbers, this
looks like a gaussian
distribution
The dispersion in this case is
measured empirically from the data
M. Hammell
+ Statistical Distributions
gaussian, poisson, negative binomial
35
-- what does all this mean?
RNA-seq data fits a Negative Binomial (NB) distribution.
But really, that’s just saying that RNAseq looks like “counts” data with
more variation than just statistical fluctuations– it also has biological
variation in it.
How do we know? Because, when you measure variance (per gene,
between replicates), it’s not equal to the mean, and it’s not even a good
linear fit
poisson
negative binomial fit
* For real data, even the NB
fit isn’t always great
M. Hammell
Anders & Huber, Genome Biology 2010
+
36
Okay, it’s an NB (ish), so what?
how do I use this to help me set the cuffdiff parameters?
--dispersion-method
(pooled, per condition, blind, poisson)
Pooled is the default. This means you average the within-type variances to
fit an NB distribution to your data.
Per condition means that you think the variance is different for each sample
type and you want to build a separate NB distribution for each. You need lots of
replicates for this.
Blind means you ignore the sample type and act like all samples are
replicates of each other. Only do this if you have no replicates.
Poisson means you’re fitting to a poisson distribution and not measuring
variance. Why would you do this? Idk.
M. Hammell