Statistical analysis of DNA microarray data

Download Report

Transcript Statistical analysis of DNA microarray data

Introduction to Microarry and
Related High Throughput Analysis
BMI 705
Kun Huang
Department of Biomedical Informatics
Ohio State University
What is microarray?
• Affymetrix-like arrays – single channel (background-green,
foreground-red)
• cDNA arrays – two channel (red, green, yellow)
• CGH array, DNA methylation array, SNP array, etc.
• CHIP-on-Chip
• Tissue microarray
• Future - Sequencing
How is microarray manufactured?
How does two-channel microarray work?
• Printed microarrays
• Long probe oligonucleotides (80-100) long
are “printed” on the glass chip
How does two-channel microarray work?
• Printing process introduces errors and
larger variance
• Comparative hybridization experiment
How does microarray work?
How is microarray manufactured?
• Affymetrix GeneChip
• silicon chip
• oligonucleiotide probes lithographically synthesized
on the array
• cRNA is used instead of cDNA
How does Affymetrix microarray work?
How does microarray work?
How does microarray work?
How does microarray work?
How does microarray work?
How does microarray work?
• Fabrication expense and frequency of error
increases with the length of probe, therefore 25
oligonucleotide probes are employed.
• Problem: cross hybridization
• Solution: introduce mismatched probe with one
position (central) different with the matched
probe. The difference gives a more accurate
reading.
How do we use microarray?
• Profiling
• Clustering
Spatial Images of the Microarrays
•
•
•
Data for the same
brain voxel but for the
untreated control
mouse
Background levels are
much higher than
those for the
Parkinson’s disearse
model mouse
There appears to be
something non random
affecting the
background of the
green channel of this
slide
How do we take readings from microarray
(measurement)?
cDNA array – ratio, log ratio
Affymetrix array
How do we process microarray data
(McShane, NCI)
How do we process microarray data
• Normalization
• Intensity imbalance between RNA samples
• Affect all genes
• Not due to biology of samples, but due to technical
reasons
• Reasons include difference in the settings of the
photodetector voltage, imbalance in total amount of
RNA in each sample, difference in uptaking of the
dyes, etc.
• The objective is to adjust the gene expression
values of all genes so that the ones that are not
really differentially expressed have similar values
across the array(s).
Normalization
• Two major issues to consider
•
•
Which genes to use for normalization
Which normalization algorithm to use
• Housekeeping genes
• Genes involved in essential activities of cell maintenance and survival, but
not in cell function and proliferation. These genes will be similarly
expressed in all samples but they may be difficult to identify – need to be
confirmed. Affymetrix GeneChip provides a set of house keeping genes
(but still no guarantee).
• Spiked controls
• Genes that are not usually found in the samples (both control and test
sample). E.g., yeast gene in human tissue samples. Note: Affy GeneChip
protocol includes the spiking of control oligonucleotides into each sample.
They are NOT for normalization. Instead, they are for other purposes such
as gridding of slide by the image analysis software.
• Using all genes
• Simplest approach – use all adequately expressed genes for normalization
The assumption is that the majority of genes on the array are
housekeeping genes and the proportion of over expressed genes is similar
to that of the under expressed genes. If the genes one the chip are
specially selected, then this method will not work.
Normalization
• Which normalization algorithm to use
• For two-color cDNA arrays - Intra-slide normalization
Scatter plot
Slope = 1
Ratio-intensity (RI)
or MA plot
Normalization
• Linear (global) normalization
• Simplest but most consistent
• Move the median to zero (slope 1 in scatter
plot, this only changes the intersection)
• No clear nonliearity or slope in MA plot
Normalization
• Intensity-based (Lowess) normalization
•
•
Overall magnitude of the spot intensity has an impact on the relative
intensity between the channels.
“Straighten” the Lowess fit line in MA plot to horizontal line and move it
to zero
Normalization
• Intensity-based (Lowess) normalization
• Nonlinear
• Gene-by-gene, could introduce bias
• Use only when there is a compelling
reason
(McShane, NCI)
Normalization
• Other normalization method
• Combination of location and intensity-based
normalization
• Location
• Quantile
• …
Normalization
• Which normalization algorithm to use
• Inter-slide normalization
• Not just for Affymetrix arrays
Normalization
• Box plot
Upper quartile
Median
Low quartile
Normalization
• Linear (global) – the chips have equal median
(or mean) intensity
• Intensity-based (Lowess) – the chips have
equal medians (means) at all intensity values
• Quantile – the chips have identical intensity
distribution
• Quantile is the “best” in term of normalizing the
data to desired distribution, however it also
changes the gene expression level individually
• Avoid overfitting
• Avoid bias
Gene Discovery and T-tests
Student’s t-test
Gene Discovery and Multiple T-tests
Controlling False Positives
• Statistical tests to control the false positives
• Controlling for no false positives (very
stringent, e.g., Bonferroni test)
• Controlling the number of false positives
• Controlling the proportion of false positives
• Note that in the screening stage, false
positive is better than false negative as the
later means missing of possibly important
discovery.
Microarray Databases
•
Gene Expression Ominbus (GEO) database – NCBI
– http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=pubmed
•
EMBL-EBI microarray database
– http://www.ebi.ac.uk/Databases/microarray.html
•
•
ArrayExpress
Stanford Microarray Database (SMD)
– http://genome-www5.stanford.edu/
•
Other specialized, regional and aggregated databases
–
–
–
–
http://psi081.ba.ars.usda.gov/SGMD/
http://www.oncomine.org/main/index.jsp
http://ihome.cuhk.edu.hk/~b400559/arraysoft_public.html
…
Microarray Softwares
•
•
•
•
•
•
•
DChip
Open source R, Bioconductor
BRBArray tools (NCI biometric research branch)
Affymetrix
GeneSpring GX
GenePattern
…
How do we use microarray (clustering)?
How do we process microarray data
(clustering)?
-Unsupervised Learning – Hierarchical
Clustering
ChIP-on-chip, “also known as genome-wide location
analysis, is a technique for isolation and identification
of the DNA sequences occupied by specific DNA
binding proteins in cells.” (http://www.chiponchip.org)
• Identify protein binding sites on DNA
• Study transcriptional factors – identify the genes that
controlled by the specific TFs
• Identify TFs
• Identify regulatory regions such as promoters,
enhancers, repressors, silencing elements, insulators,
and boundary elements
• Determine sequences controlling DNA replication
(e.g., histone binding sites)
ChIP-on-Chip
ChIP – Chromatin immunoprecipitation
Chip – Microarray
ChIP-on-Chip – Example
Simon I., Barnett J., Hannett N., Harbison C.T., Rinaldi N.J., Volkert T.L.,
Wyrick J.J., Zeitlinger J., Gifford D.K., Jaakkola T.S., et al. "Serial regulation of
transcriptional regulators in the yeast cell cycle", Cell, Volume: 106, (2001),
pp. 697-708.
Figure 2. Genome-wide Location of the Nine Cell Cycle Transcription Factors(A) 213 of the 800 cell cycle genes whose
promoter regions were bound by a myc-tagged version of at least one of the nine cell cycle transcription factors (p <
0.001) are represented as horizontal lines. The weight-averaged binding ratios are displayed using a blue and white
color scheme (genes with p value < 0.001 are displayed in blue). The expression ratios of an α factor synchronization
time course from Spellman et al. (1998) are displayed using a red (induced) and green (repressed) color scheme.(B)
The circle represents a smoothed distribution of the transcription timing (phase) of the 800 cell cycle genes (Spellman et
al., 1998). The intensity of the red color, normalized by the maximum intensity value for each factor, represents the
fraction of genes expressed at that point that are bound by a specific activator. The similarity in the distribution of color
for specific factors (with Swi4, Swi6, and Mbp1, for example) shows that these factors bind to genes that are expressed
during the same time frame
ChIP-on-Chip – Example
Simon I., et al. "Serial regulation of transcriptional regulators in the yeast cell
cycle", Cell, Volume: 106, (2001), pp. 697-708.
ChIP-on-Chip
Problem : Probe design
1. Most TF binding sites are not in exon
2. Binding sequences are short
3. Cover entire genome?
4. Signal may be small
Tiling array – divide the sequence into chunks, called tiling
path. The distance between the center of neighboring
chunks is called resolution. A path can be overlapped or
spaced.
Affymetrix tiling array for yeasts – 5bp resolution, 3.2
million probes
Affymetrix tiling array for human – 35bp spacing, 90 million
probes
Sequencing
• Solexa http://www.illumina.com/pages.ilmn?ID=203
• SOLiD
Mikkelsen et al