Transcript L12_RNAseq

Lecture 12
RNA – seq analysis
Some background
• RNA-seq (RNA sequencing), also called whole transcriptome
shotgun sequencing(WTSS), is a technology that uses the
capabilities of next generation sequencingto reveal a snapshot of
presence and quantity of RNA at a given moment in time.
• RNA sequencing is a high-throughput tool for investigating gene
expression, made possible with rapid advances in the speed and
efficiency of sequencing technologies.
• Unlike microarrays, RNA-seq benefits from a highly dynamic range
of signal detection, identifying both rare and common transcripts
with no a priori knowledge of the organism’s genome or
transcriptome.
• The additional information captured in RNA-seq libraries has
revolutionized our understanding of cancer, stem cell
differentiation, and plant genetics.”
Next Generation Sequencing
• Next-generation sequencing (NGS), also known as high-throughput
sequencing, is the catch-all term used to describe a number of different
modern sequencing technologies including: Illumina (Solexa) sequencing.
Roche 454 sequencing. Ion torrent: Proton / PGM sequencing. SOLiD
sequencing.
• REALLY revolutionized genome sequencing as many many can be done in
smaller amounts of time.
More Background
• An RNA-seq run reads and quantifies the transcriptome
(complete set of mRNA) in a single sequencing run.
• RNA is extracted from tissue, cleaved into fragments a few
hundred nucleotides long, and then converted to a
complementary DNA (cDNA) library (Wilhelm & Landry, 2009).
• Sequencing adaptors are ligated to both ends of each
fragment, and the products are sequenced using any highthroughput method such as 454, SOLiD, or Ion Torrent.
Comparison with Microarrays:advantages
• New sequences can be discovered.
• RNA-seq, on the other hand, determines all sequences
empirically.
• This has proved invaluable in non-model species with large
genomes,
• False positives from cross-hybridization are not an issue in
RNA-seq.
• Quantification is possible even at extremely low and high
expression levels.
• Whereas microarrays have a dynamic range of one to a few
hundred fold, RNA-seq boasts a dynamic range of >8,000 fold
(Wang, Gerstein, & Snyder, 2009).
Comparison with Microarrays: disadvantages
• Considerably more processing power is required to handle
millions of RNA-seq reads, and chemical manipulation of RNA
and cDNA can introduce artifacts.
• Slower than microarrays when the genome is known.
But as sequencing costs have plummeted and computing power
has increased, RNA-seq is now the transcriptomics method of
choice for most applications.
Pictures
Data structure
• So here you have a sequence and for each sequence you have
the number of READS
• The data is the COUNT of the sequences read.
• NOT continuous like expression data
• So, normal and other related distributions cannot be used.
• General modeling is done using the Poisson distribution
Poisson Distribution
• Generally used to model count data
• The mass function is given by
• P(Y=y)=f(y)=
•
•
•
•
•
e


y
y!
Properties:
Has a range from 0 to positive infinity
Mean, E(Y)= m
Variance, = m
Hence, mean and Variance are same.
Issues with Poisson
• The property that requires that mean and Variance are the
same is problematic for RNA-seq data, where Variance is often
much larger than the mean.
• This is called the over-dispersion problem.
• Common in litter studies where over-dispersion is induced by
auto-correlation.
Solutions: The NB Distribution
• To try and address this question ne distribution that is used is
the Negative Binomial Distribution.
• It is used to model the number of trials till the rth success and
is related to the geometric distribution.
Model:
P(Y=y)=f(y) =
 r  y  1 r
y

 p (1  p )
y

Properties of the NB distribution
(1  p )
Mean  r
p
(1  p )
Var  r
2
p
So, the mean and variance are related by a proportionality
constant
Theoretical Background
• To model over-dispersion in Poisson regression one generally adds a
random effect qi to represent the unobserved heterogenity.
• So the conditional distribution of Yi given qi is indeed Poisson with mean
and variance miqi.
• Idea is: if we knew and observed qi the data would be Poisson. But, we
don’t know it, so if we assume a assume that qi has a gamma distribution
with both parameters ab1/s2 which represents the variance of the
unobserved.
• Then the unconditional distribution is given by:
(a  y  1)! b a  y
P[Y  y ] 
y! (a  1)! (   b )a  y
Theory:
•
•
•
•
The form is a NB distribution with
r=a,
p= b/(b)
The mean and variance are related with a proportionality
constant.
• This is the form used in the Anders and Huber paper laying
the basic theory for D-seq.
DE Seq Theory
• The library DESeq2 uses Empirical Bayesian ideas for Differential
Expression for looking at differences in the genes across conditions.
• The idea, let Kij be the count associated with the ith gene and the jth
sample
• The assumption is: Kij ~ NB(ij, ai)
• Where ij=sjqij
• And log2(qij)=xjbi
• Here xj is the sample specific design and beta is our gene specific
parameters.
DE Seq2 package: contrasts
• Contrasts can be calculated for a DESeqDataSet object for which the GLM
coecients have already been fit using the Wald test steps (DESeq with
test="Wald" or using nbinomWaldTest).
• The vector of coefficients is left multiplied by the contrast vector c to form
the numerator of the test statistic.
• The denominator is formed by multiplying the covariance matrix for the
coefficients on either side by the contrast vector c.
• The square root of this product is an estimate of the standard error for the
contrast.
• The contrast statistic is then compared to a normal distribution as are the
Wald statistics for the DESeq2 package.