[Title Slide: Your title goes here]

Download Report

Transcript [Title Slide: Your title goes here]

Microarray Data Analysis of Illumina Data
Using R/Bioconductor
Reddy Gali, Ph.D.
[email protected]
[email protected]
http://catalyst.harvard.edu
Agenda
•
•
•
•
•
Introduction to microarrays
Workflow of a gene expression microarray experiment
Microarray experimental design
Public microarray databases
Microarray preprocessing - Quality control and Diagnostic analysis
1
Agenda
•
•
•
•
•
Introduction to R/Bioconductor
Installation of R and Bioconductor Packages
General data analysis and strategies
Data analysis using lumi package
Data analysis using limma package
2
Workflow of Gene Expression
Biological question
Experimental design
QC
Tissue / sample preparation
Extraction of Total RNA
QC
Probe amplification & labeling
QC
Microarray hybridization & processing
Image analysis
QC
Data analysis
Biological Verification
Expression measures - Normalization Statistical Filtering - Clustering Pathway analysis
QC
3
Pitfalls of Microarray Experiment
• Gene expression changes detected by microarray analysis cannot be
validated by other methods
- Inadequate design
- Data quality is low
- Statistical approach is not adequate
- Expression level of gene is below detection limit
- Change in gene expression is small
- Microarray detection probe is not specific or not sensitive
4
Questions usually asked
•
•
•
•
•
•
•
•
What kind of technology or microarrays I have to use
How many replicates do I need
What is a real replicate
Do I need statistical advice
Should I do technical replicate
Should I pool my samples
How do I analyze my dataset
What software should I use
5
Design of Microarray Experiment
• Replicates
• Goal, resources, technology, quality, design and analysis
• Two fold change – 3 replicates
• Smaller change – 5 replicates
• Technical replicates and Biological replicates
• Sample pooling
• Amount of sample
• Replicates of pooled sample
• No way to find variance between samples
6
Gene Expression Omnibus- GEO
7
Public Microarray Databases
•
•
•
•
•
•
•
BodyMap - http://bodymap.ims.u-tokyo.ac.jp/
SMD - http://genome-www5.stanford.edu/
RIKEN - http://read.gsc.riken.go.jp/
MGI - http://www.informatics.jax.org/
GEO - http://www.ncbi.nlm.nih.gov/geo/
CIBEX - http://cibex.nig.ac.jp/index.jsp
ArrayExpress - http://www.ebi.ac.uk/microarray-as/ae/
8
Microarray Platforms
• Agilent Microarrays 60-mer format
• Codelink Bioarrays 30-mer format
• Affymetrix GeneChips 25-mer format
• Illumina Beadchips
• NimbleGen 60-mer format
9
Illumina Bead Array Technology
Silica Beads
Each bead is covered with hundreds of thousands of copies of a
specific oligonucleotide
10
Some Facts
• Each bead carries copies of probes with, on average, 30
replicates of every bead type per array
• Around 105 copies of a particular DNA sequence of
interest are covalently attached to each bead
• DNA sequences (oligonucleoties) attached to the beads
are 75 base pairs in length, with 25 base pairs used for
decoding and 50 base pairs used for target hybridization
• A pool of different bead types is created, beads of the
same type having the same probe sequence attached
Box Plots of unnormalized data
12
Raw vs Normalized data
Raw Data
Normalized Data
13
Histograms of unnormalized data
14
Why Normalize
• It adjusts the individual hybridization intensities to balance them
appropriately so that meaningful biological comparisons can be
made.
• Unequal quantities of starting RNA
• Differences in labeling or detection efficiencies between the
fluorescent dyes used
• Systematic biases in the measured expression levels.
•
•
•
•
•
Sample preparation
Variability in hybridization
Spatial effects
Scanner settings
Experimenter bias
15
Free Software – Data analysis
• Bioconductor
– is an open source and open development software project
to provide tools for the analysis and comprehension of
genomic data.
• TMEV 4.0
– is an application that allows the viewing of processed
microarray slide representations and the identification of
genes and expression patterns of interest.
16
R / Bioconductor
• R and Bioconductor packages
• R (http://cran.r-project.org/ )is a comprehensive
statistical environment and programming language for
professional data analysis and graphical display.
• Bioconductor (http://www.bioconductor.org/) is an
open source and open development software project for
the analysis of microarray, sequence and genome data.
• More 300 Bioconductor packages.
• http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/
R_BioCondManual.html
17
R / Bioconductor - Installation
18
Preparing R for analysis
Preparing R for analysis
Preparing R for analysis
Preparing R for analysis
Preparing R for analysis
Analysis using lumi R package
- Loading data into R/Bioconductor
>lumi_data <- lumiR(‘worshop_data.csv')
- Summary of the loaded data
>lumi_data
- Quality control of loaded data
>summary(lumi_data, 'QC')
>density(lumi.Rdata)
>boxplot(lumi.Rdata)
>MAplot(lumi.Rdata)
> plot(lumi.Rdata,
what='sampleRelati
on')
>> plot(lumi.Rdata,
what=‘cv')
>> plot(lumi.Rdata,
what=‘outlier')
>
Variance Stabilization
> lumi.Tdata <lumiT(lumi.Rdata)
> lumi.VSdata <plotVST(lumi.Tdata)
Normalization
> lumi.Ndata <- lumiN(lumi.Tdata)
Or Do all the default preprocessing in one step
> lumi.N.Q <- lumiExpresso(lumi.Rdata)
– Background Correction: bgAdjust
– Variance Stabilizing Transform method: vst
– Normalization method: quantile
– Perform all the QC again
> summary(lumi.Ndata, 'QC')
Differential expression
•
>design <- model.matrix(~ -1 + factor(c(1, 1, 1,1, 2, 2, 2,2)))
•
>colnames(design) = c("control","affected")
•
>fit <- lmFit(lumi.Ndata, design)
•
>cont.matrix <- makeContrasts(signature = affected - control,levels=design)
•
>fit2 <- contrasts.fit(fit, cont.matrix)
•
>ebFit <- eBayes(fit2)
•
>results <- topTable(ebFit, number=100, sort.by="B", resort.by="M")
•
>print(results)
•
>write.table(topTable(ebFit, coef=1, adjust="fdr", sort.by="B", number=25000),
file="results.xls", row.names=F, sep="\t")
Thank you
Reddy Gali, Ph.D.
[email protected]
Phone: 617 432 7471
http://catalyst.harvard.edu
32