Lecture 6 Gene expression: microarray and deep sequencing

Download Report

Transcript Lecture 6 Gene expression: microarray and deep sequencing

Gene expression
Introduction to gene expression arrays
Microarray
Data pre-processing
Introduction to RNA-seq
Deep sequencing applications
RNA-seq data pre-processing
Basics of microarrays
An “old” technology - some predict microarrays will be replaced
by deep sequencing
Currently – much cheaper/faster than sequencing; widely used
http://www.microarraystation.com/dna-microarray-timeline/
Timeline of DNA Microarray Developments
1991: Photolithographic printing (Affymetrix)
1994: First cDNA collections are developed at Stanford
1995: Quantitative monitoring of gene expression patterns with a
complementary DNA microarray.
1996: Commercialization of arrays (Affymetrix)
1997: Genome- wide expression monitoring in S. cerevisiae (yeast)
2000: Portraits/ Signatures of cancer.
2003: Introduction into clinical practices
2004: Whole human genome on one microarray
2005: first next-generation sequencing machine
2006: All exons measured on one microarray
Basics of microarrays
They utilize the chemical binding between the four nucleotides.
A --- T, and C --- G. The DNA structure is formed through the
binding:
http://content.answers.com/main/content/wp/en/f/f0/
DNA_Overview.png
Basics of microarrays
http://mbotc.tumblr.com/
Basics of microarrays
TTAAGTCGTACCCGTGTACGGGCGC
AATTCAGCATGGGCACATGCCCGCG
Basics of microarrays
Amplified DNA segments
 fluorescence labeling
 hybridization on the array
 reading by photo scanner
 digitize into fluorescence values
 quantify amount of each target sequence
Two strategies:
(1) One sample on each array
The amount is calculated from spot intensity.
(2) Two samples, differentially labeled, on each array
The relative amount, Csample
Creference
is given by the ratio between the fluorescence.
Gene expression arrays
gene
Start
codon
exon
Poly A tail
DNA
(2 copies)
intron
The amount of these guys is
easy to measure. And it is
positively correlated with the
protein amount!
mRNA
(multiple copies)
The amount of these
guys matter! But they
are hard to measure.
Protein
(multiple copies)
Gene expression array --- affymetrix
The Affymetrix platform is one of the most widely used.
http://www.affymetrix.com/
Gene expression arrays -- Affy
Here we use the U133 system for illustration.
Some 20 probes per gene;
Selected from the 3’ end of the gene sequence;
Not necessarily evenly spaced --- sequence property matters;
The probes are located at random locations on the chip;
TTAAGTCGTACCCGTGTACGGGCGC
Target sequence
AATTCAGCATGGGCACATGCCCGCG
AATTCAGCATGGACACATGCCCGCG
Perfect match (PM) probe
Mis-match
(MM) probe
Gene expression array - affy
The hope was that mismatch probes won’t bind the target
sequence.
http://www.affymetrix.com/
Gene expression arry --- affy
http://www.affymetrix.com/
Microarray data
?
We are going to focus on pre-processing for now.
Downstream analyses are more in the realm of traditional
statistics: multiple testing, clustering, classification……
They are common across different high-throughput techniques.
Microarray data
Issues:
Background level variation caused by variations in overall
RNA concentration in the sample, image reader, etc.
Within every probeset, each probe has different
sensitivity/specificity, caused by cross-hybridization, different
chemical properties etc.
Across chips, the fluorescence intensity-concentration
response curve can be different, caused by variations in
sample processing, image reader etc.
Affy data --- general strategy
Background correction (within
chip)
Normalization
(across-chip)
Probe-set level expression value Presence/absence call
(within chip)
(within chip)
Probeset-level statistical analysis
(combining chips)
Affy data --- general strategy
There are many processing methods. The most popular
include:
MAS 5.0 (Affymetrix)
Flawed. But it comes with the Affymetrix software.
Thus widely used by non-experts.
dChip (Cheng Li & Wing Wong)
Good performance and versatile. Stand-alone
Windows application. Can handle arrays other than
expression array.
RMA (Rafael Irizarry et al.)
Good performance. Easily used in R/Bioconductor.
Affy data --- RMA Background
correction
For each array, assumes:
PM  S  B
Signal : S ~ Exp( )
Background : B ~ N (  ,  2 ) left - truncated at zero
lambda=1,miu=1,sigma=1
lambda=5, miu=1, sigma=1
Affy data --- RMA Background
correction
For each array, from the PM signal distribution, estimate the
parameters,
Find the overall mode by kernel density estimation;
Find the miu and sigma from PM values lower than the
overall mode (sample mean and sd)
Find the lambda from PM values higher than the overall
mode (1/(sample mean minus the overall mode))
then adjust the PM readings (s is PM signal; lambda is
replaced by alpha in this expression):
See the derivation here:
http://www.biochem.ucl.ac.uk/~harry/MAD/rma_bg.pdf
Affy data --- normalization
*** This is also relevant to other array platforms !
To reduce chip effect, including non-linear effect.
Difficulty: the sample is different for each chip. We can’t match a gene
in chip A to the same gene in chip B hoping they have the same
intensity.
PM
MM
Assumptions on the overall distributions of the signals on each chip are
made. For example:
Some house-keeping genes don’t change;
The overall distribution of concentrations don’t change;
……
Affy data --- normalization
Quantile normalization --- match the quantiles between two
chips.
Assumes that the distribution of gene abundances is the
same between samples.
xnorm = F2-1(F1(x)), x: value in the chip to be normalized
F1: distribution function in the chip to be normalized
F2: distribution function in the reference chip
Nature Protocols 2, 2958 - 2974 (2007)
Affy data --- RMA summary
Model-fitting: Median Polish (robust against outliers)
alternately removing the row and column medians
until convergence
The remainder is the residual;
After subtracting the residual, the row- and columnmedians are the estimates of the effects.
Affy data ---- rma summary
Remove row median
Remove column median
Affy data ---- rma summary
Remove row median
Remove column median
Affy data ---- rma summary
Remove row median
Remove column median
Converged. This is
the residual.
Affy data ---- rma summary
* This reflects the assumption that probe effects have median
zero.
Deep Sequencing
“Method of the year” 2007 by Nature Methods.
The name:
“Next generation sequencing”
“Deep sequencing”
“High-throughput sequencing”
“Second-generation sequencing”
The key characteristics:
Massive parallel sequencing
amount of data from a single run
~ amount of data from the human genome project
The reads are short
~ a few hundred bases / read
Background
Potential impact:
The “$1000 genome”
Genome sequencing will become a regular medical
procedure.
Personalized medicine
Predictive medicine
Ethical issues
For statisticians:
Data mining using hundreds of thousands of genomes
Finding rare SNPs/mutations associated with diseases
New methods to analyze epigeomics/transcriptomics data
Finding interventions to improve life quality
Background
The companies use different techniques. We use Illumina’s
as an example. (http://seqanswers.com/forums/showthread.php?t=21)
Background
Background
Background
Background
An incomplete list of some common platforms.
Bioinformatics and Biology insights 2015:9(s1)
Background
Advantages:
Fast and cost effective.
No need to clone DNA fragments.
Drawbacks:
Short read length (platform dependent)
Some platforms have trouble on identical repeats
Non-uniform confidence in base calling in reads.
Data less reliable near the 3’ end of each read.
Background
What deep sequencing can do:
Background
Nat Methods. 2009 Nov;6(11 Suppl):S2-5.
Alignment and Assembly
Sequence the genome of a person? --- Alignment
Can rely on existing human genome as a blue print.
Align the short reads onto the existing human genome.
Need a few fold coverage to cover most regions.
Sequence a whole new genome? --- Assembly
Overlaps are required to construct the genome.
The reads are short  need ~30 fold coverage.
If 3G data per run, need 30 runs for a new genome
similar to human size.
Whole gnome/exome/transcriptome sequencing
Alignment
RNA-Seq
Finding novel exons.
Alternative splicing
RNA-Seq
Gene expression profiling – to replace arrays?
Exon-specific abundance.
RNA-Seq
Genome Biology 2010, 11:220
Alignment
Hash table-based alignment. Similar to BLAST in principle.
(1) Find potential locations:
(2) Local alignment.
Normalization
Genome Biology 2010, 11:220
Normalization
RPKM: Reads per kilobase transcript per million reads
Normalization by ERCC (External RNA Controls Consortium):
Nature Methods 12: 339–342(2015)
Sequence count models
Example: Simple Poisson model:
Between group testing,
di: sequencing depth of sample i
βg : the expression level of gene g
γg : the association of gene g with the covariate
Cancer Informatics 2015:14(s1)
Sequence count models
Poisson model doesn’t allow overdispersion.
Negative binomial model:
Φg accounts for the sample to sample variability
Methods like DESeq use the negative binomial distribution.
Cancer Informatics 2015:14(s1)
RNA-Seq v.s. Array
Good agreement for genes expressed at medium-level.