Monday - Biostatistics

Download Report

Transcript Monday - Biostatistics

Summer Inst. Of Epidemiology and
Biostatistics, 2008:
Gene Expression Data Analysis
8:30am-12:30pm in Room W2017
Carlo Colantuoni – [email protected]
http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm
Class Outline
• Basic Biology & Gene Expression Analysis Technology
• Data Preprocessing, Normalization, & QC
• Measures of Differential Expression
• Multiple Comparison Problem
• Clustering and Classification
• The R Statistical Language and Bioconductor
• GRADES – independent project with Affymetrix data.
http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm
Class Outline - Detailed
•
Basic Biology & Gene Expression Analysis Technology
–
–
–
•
Data Preprocessing, Normalization, & QC
–
–
–
–
–
–
–
•
Bonferroni
False Discovery Rate Analysis (FDR)
Differential Expression of Functional Gene Groups
–
–
–
–
–
–
•
Basic Statistical Concepts
T-tests and Associated Problems
Significance analysis in microarrays (SAM) [ & Empirical Bayes]
Complex ANOVA’s (limma package in R)
Multiple Comparison Problem
–
–
•
Intensity Comparison & Ratio vs. Intensity Plots (log transformation)
Background correction (PM-MM, RMA, GCRMA)
Global Mean Normalization
Loess Normalization
Quantile Normalization (RMA & GCRMA)
Quality Control: Batches, plates, pins, hybs, washes, and other artifacts
Quality Control: PCA and MDS for dimension reduction
Measures of Differential Expression
–
–
–
–
•
The Biology of Our Genome & Transcriptome
Genome and Transcriptome Structure & Databases
Gene Expression & Microarray Technology
Functional Annotation of the Genome
Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum
Gene Set Enrichment Analysis (GSEA)
Parametric Analysis of Gene Set Enrichment (PAGE)
geneSetTest
Notes on Experimental Design
Clustering and Classification
–
–
–
Hierarchical clustering
K-means
Classification
•
•
•
LDA (PAM), kNN, Random Forests
Cross-Validation
Additional Topics
–
–
–
The R Statistical Language
Bioconductor
Affymetrix data processing example!
DAY #1:
Genome Biology
The Transcriptome
Microarray Technology
The Human Genome
• 2 copies of the entire genome
in each cell:
• 3.3 billion ”bases” (Gb)
• ~30K genes
• millions of variants
• We each get 1 copy from
MOM & 1 from DAD. Each
parent passes on a ”mixed
copy” (from their parents).
DAD
MOM
• Each copy of the genome is
contained in 23 chromosomes:
22+XorY (2 copies = 46 / cell).
• All in DNA!
YOU
•
A deoxyribonucleic acid or
DNA molecule is a doublestranded polymer composed
of four basic molecular units
called nucleotides.
•
Each nucleotide contains a
phosphate group, a
deoxyribose sugar, and one
of four nitrogen bases:
adenine (A), guanine (G),
cytosine (C), and thymine
(T).
•
The two chains are held
together by hydrogen bonds.
•
Base-pairing occurs
according to the following
rule: G pairs with C, and A
pairs with T.
•
Directionality &
Complementarity: Reverse
Complements hybridize.
DNA
How do these
molecular
interactions
influence
directionality and
complementarity?
G-C pairs are
“stickier” than A-T
pairs.
Another
View of
DNA
Where does an
individual gene lie
in this schematic?
Another
View of
DNA
Another
View of
DNA
Central Dogma of Modern
Cellular & Molecular Biology:
Transcription
From DNA to mRNA:
Transcription occurs at Genes
Transcript Processing
Translation
From RNA to Protein: In the exons of protein coding
genes (and their mRNA intermediates), each codon
(3 base pairs) encodes 1 amino acid in the protein.
Perspective: Biological Setup
Every cell in the human body contains the entire
human genome: 3.3 Gb in which ~30K genes exist.
The investigation of gene expression is meaningful
because different cells, in different environments,
doing different jobs express different genes.
Cellular “Plans”: DNA - RNA - PROTEIN
Cellular Biology, Gene Expression, and
Microarray Analysis
A protein-coding gene is a segment of
chromosomal DNA that directs the synthesis
of a protein via an mRNA intermediate.
DNA
RNA
Protein
How do we design and implement probes
that will effectively assay expression of ALL
(most? many?) genes simultaneously.
Laboratory Methods:
The Genome and The Transcriptome
Easy to sequence some genomic DNA.
Easy to sequence some expressed mRNA’s.
NOT EASY to catalogue all genomic DNA, all
expressed mRNA’s, and to map out the exact
relations between all these sequences.
Molecular Cell Biology:
Components of the Central Dogma
Protein
Translation
START
mRNA
5’ UTR
protein coding
STOP
AAAAA
3’ UTR
Transcription
Genomic
DNA
3.3 Gb
Gene: Protein coding unit of genomic DNA
with an mRNA intermediate.
Sequence is a
Necessity.
DNA
Probe
START
mRNA
STOP
AAAAA
5’ UTR
protein coding
3’ UTR
Transcription
Genomic
DNA
3.3 Gb
~30K genes
From Genomic DNA to mRNA Transcripts
EXONS
INTRONS
Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.
~30K
>30K
Alternative splicing
Alternative start & stop sites in same RNA molecule
RNA editing & SNPs
Transcript coverage
Homology to other transcripts
Hybridization dynamics
3’ bias
Designing DNA Probes From
Genomic DNA Sequence
Sequence & assemble the entire human genome.
Search for genes predicted to produce mRNA transcripts.
Protein-coding genes are not easy to find - gene density is low, and exons
are interrupted by introns.
Completeness?
Design DNA probes.
[ Genomic DNA databases & assembly ]
Designing DNA Probes From
mRNA Sequences
Sequence ALL expressed mRNA molecules.
Completeness?
Design DNA probes.
Unsurpassed as source of expressed sequence
Sequence Quality!
Redundancy!
Completeness?
Chaos?!?
From Genomic DNA to mRNA Transcripts
~30K
>30K
>>30K
Transcript-Based
Gene-Centered Information
From Genomic DNA to mRNA Transcripts
From Genomic DNA to mRNA Transcripts
DAY #1:
Genome Biology
The Transcriptome
Microarray Technology
RNA Expression Measurement: Northern Blot
“target”
SAMPLE 1
SAMPLE 2
RNA
Extraction
RNA 1
Design +
construction of
labeled “probe”
Seq DB
hybridization of
labeled probe
RNA 2
electrophoreric
separation
electrophoreric
transfer to
membrane
RNA Expression Measurement:
Northern Blot & Microarrays
Probe
Target
Probes
Target
Northern
Northern blots seek to
interrogate the expression of
ONE gene in a SINGLE
hybridization reaction.
Microarray
Microarrays seek to interrogate the
expression of MANY genes
simultaneously in a MULTIPLEX
hybridization reaction.
SEQUENCE knowledge is REQUIRED for BOTH!
Target: unknown (sample)
Probe: known
(synthetic)
Hybridization on a Northen Blot
Labeled
1 Probe
1
Hybrid
MANY
Unlabeled
Targets
MEMBRANE
Target: unknown
Probe: known
MEMBRANE
Edwin Southern et al, Nature Genetics Suppl 1999
Hybridization on a Microarray
Labeled
Target
MANY
MANY
Hybrids
MANY
Unlabeled
Probes
Solid Support
Target: unknown
Probe: known
Solid Support
Edwin Southern et al, Nature Genetics Suppl 1999
Essentials of Microarray Experimental Design:
• Probe sequence selection & design
• Probe deposition on solid support
• Target Labeling
• Target Hybridization
Target
• Signal detection
Probes
Microarray
cDNA Microarray Fabrication
Bacterial clones in
96 well plates
Printing onto standard glass
microscope slides or nylon
cDNA Microarray
cDNA Microarray Experimentation
Sample
Standard
RNA
Cy5
Cy3
cDNA
Hybridized
Microarray
Scan
cDNA Microarray Scanning
Cy5
Cy5 Channel Data
Merged Image
Cy3
Cy3 Channel Data
Quantification
cDNA Microarray Quantification
cDNA Microarray Quantification
cDNA Microarray Quantification
Log Intensity
cDNA Microarray Quantification
Log Intensity
]
cDNA Microarray Quantification
Log Ratio
[
/
Log Intensity
[
+
]
Essentials of Microarray Experimental Design:
• Probe sequence selection / design
• Probe deposition on solid support
• Target Labeling
• Target Hybridization
Target
• Signal detection
Probes
Microarray
Agilent (HP) Microarrays
44,000 oligonucleotides (60 NT’s) synthesized in situ using
inkjet printing and solid phase phosphoramidite chemistry.
2-channel fluorescence
on glass slides.
NIA Microarray
10K Full
Length
cDNA’s
Spotted on
Nylon
P33
One-Channel
Affymetrix GeneChip
1,300,000 oligonucleotides (25
NT’s) in 54,000 “probe sets”
(11 PM’s and 11 MM’s).
Oligo’s synthesized in situ
on a silicon wafer using
photolithography.
One-channel data generated
using biotin labeling.
Affymetrix GeneChip
Affymetrix Probe Set Design
5’
3’
Reference sequence
…TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT…
GTACTACCCAGTCTTCCGGAGGCTA Perfectmatch (PM)
GTACTACCCAGTGTTCCGGAGGCTA Mismatch (MM)
NSB & SB
NSB
NimbleGen Microarrays
195,000 oligonucleotides
(60 NT’s): 5 probes / gene.
One-channel data.
Oligonucleotides synthesized in situ
on a glass slide using maskless,
digital micromirror device.
Amersham’s CodeLink Arrays
54,841 oligonucleotides
(30NT’s).
Spotted into a 3-D aqueous
polyacrylamide gel surface
on a glass slide.
One-channel data.
ABI’s Human Genome Survey Array
31,077 oligonucleotides (60 NT’s).
Oligonucleotides spotted into a 3-D nylon matirx.
One-channel data using digoxigenin/AP.
Illumina’s BeadChip
1,700,000 oligonucleotides (50 NT’s)
immobilized on beads and represented
~30 times (6 full arrays per glass slide).
Oligonucleotides anchored on beads
distributed in random arrays of plasma
etched pits in the silicon wafer.
One-channel
data using biotin.
Essentials of Microarray Experimental Design:
• Probe
Oligo vs. cDNA (Design: follow-up)
sequence Probe length:
Specificity & Sensitivity
• Probe deposition on solid support
• Target Labeling
Signal? Amplification?
• Target Hybridization
• Signal detection
1 vs. 2 channel most
important for experimental
and analysis design
Specifics of each technology will
determine idiosyncrasies of data
preprocessing.
Target
Probes
Microarray
An Example to Remind us of Gene Structure
and Gene Cross-Referencing Issues
2 independent probes (!) on your microarray
interrogate the same gene (!) and both show an
extreme expression change in your cell line following
treatment: YES!!!
However, the directionality of this change is opposite:
one probe shows induction while the other shows
repression: NO !?!
Log Intensity
cDNA Microarray Quantification
Log Intensity
cDNA Microarray Quantification
Log Ratio
Probes designed to
interrogate expression
of the same gene!
Log Intensity
From Genomic DNA to mRNA Transcripts
SF1 in Entrez Gene (RefSeq):
A Complex Transcriptional Profile
Lacks regulatory SPSP phosphorylation motif
Probe
Decreased
Probe
Increased
SF1 in AceView:
A Complex Transcriptional Profile!
Gene: Protein coding unit of genomic DNA
with an mRNA intermediate.
Sequence is a
Necessity.
DNA
Probe
START
mRNA
STOP
AAAAA
5’ UTR
protein coding
3’ UTR
Transcription
Genomic
DNA
3.3 Gb
~30K genes
From Genomic DNA to mRNA Transcripts
EXONS
INTRONS
Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns.
~30K
>30K
Alternative splicing
Alternative start & stop sites in same RNA molecule
RNA editing & SNPs
Transcript coverage
Homology to other transcripts
Hybridization dynamics
3’ bias
USCS Genome Browser:
Genes
Transcripts
Probes
(Live Web Demo)
USCS example with genes, transcripts, and probe
mapping – custom tracks.