2_bioarray-Normalization

Download Report

Transcript 2_bioarray-Normalization

Biology and Cells
• All living organisms consist of cells.
• Humans have trillions of cells. Yeast - one cell.
• Cells are of many different types (blood, skin,
nerve), but all arose from a single cell (the
fertilized egg)
• Each cell contains a complete copy of the genome
(the program for making the organism), encoded
in DNA.
DNA
• DNA molecules are long double-stranded chains;
4 types of bases are attached to the backbone:
adenine (A), guanine (G), cytosine (C), and
thymine (T). A pairs with T, C with G.
• A gene is a segment of DNA that specifies how to
make a protein.
• Human DNA has about 30-35,000 genes;
• Rice -- about 50-60,000, but shorter genes.
Exons and Introns:
Data and Logic?
• exons are coding DNA (translated into a protein),
which are only about 2% of human genome
• introns are non-coding DNA, which provide
structural integrity and regulatory (control)
functions
• exons can be thought of program data, while
introns provide the program logic
• Humans have much more control structure than
rice
Gene Expression
• Cells are different because of differential gene
expression.
• About 40% of human genes are expressed at one
time.
• Gene is expressed by transcribing DNA into
single-stranded mRNA
• mRNA is later translated into a protein
• Microarrays measure the level of mRNA
expression
Gene Expression Measurement
• mRNA expression represents dynamic aspects of
cell
• mRNA expression can be measured with latest
technology
• mRNA is isolated and labeled with fluorescent
protein
• mRNA is hybridized to the target; level of
hybridization corresponds to light emission which
is measured with a laser
Molecular Biology Overview
Cell
Nucleus
Chromosome
Protein
Gene (mRNA),
single strand
Gene (DNA)
Graphics courtesy of the National Human Genome Research Institute
Gene Expression Microarrays
The main types of gene expression microarrays:
• Short oligonucleotide arrays (Affymetrix);
• cDNA or spotted arrays (Brown/Botstein).
• Long oligonucleotide arrays (Agilent Inkjet);
• Fiber-optic arrays
• ...
DNA Chip Microarrays
• Put a large number (~100K) of cDNA sequences or
synthetic DNA oligomers onto a glass slide in known
locations on a grid.
• Label an RNA sample and hybridize (Label 2 RNA
samples with 2 different colors of flourescent dye control vs. experimental)
• Mix two labeled RNAs and hybridize to the chip
• Measure amounts of RNA bound to each square in the
grid
• Make comparisons
– Cancerous vs. normal tissue
– Treated vs. untreated
– Time course
Spot your own Chip
(plans available for free from Pat Brown’s website)
Robot spotter
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Ordinary glass
microscope slide
cDNA Spotted Microarrays
Affymetrix “Gene chip” system
• Uses 25 base oligos synthesized in place on a
chip (20 pairs of oligos for each gene)
• RNA labeled and scanned in a single “color”
– one sample per chip
•
•
•
•
Can have as many as 20,000 genes on a chip
Arrays get smaller every year (more genes)
Chips are expensive
Proprietary system: “black box” software,
can only use their chips
Affymetrix Microarrays
Raw image
1.28cm
50um
~107 oligonucleotides,
half Perfectly Match mRNA (PM),
half have one Mismatch (MM)
Raw gene expression is intensity
difference: PM - MM
Data Acquisition
•
•
•
•
•
Scan the arrays
Quantitate each spot
Subtract background
Normalize
Export a table of fluorescent intensities
for each gene in the array
Normalization
• Can control for many of the experimental
sources of variability (systematic, not random
or gene specific)
• Bring each image to the same average
brightness
• Can use simple math or fancy – divide by the mean (whole chip or by sectors)
– LOESS (locally weighted regression)
• No sure biological standards
Multiple Comparisons
• In a microarray experiment, each gene (each
probe or probe set) is really a separate
experiment
• Yet if you treat each gene as an independent
comparison, you will always find some with
significant differences
– (the tails of a normal distribution)
Microarray Potential
Applications
• Biological discovery
– new and better molecular diagnostics
– new molecular targets for therapy
– finding and refining biological pathways
• Recent examples
– molecular diagnosis of leukemia, breast cancer, ...
– appropriate treatment for genetic signature
– potential new drug targets
Microarray Data Analysis Types
• Gene Selection
– find genes for therapeutic targets
– avoid false positives (FDA approval ?)
• Classification (Supervised)
– identify disease (biomaker study)
– predict outcome / select best treatment
• Clustering (Unsupervised)
– find new biological classes / refine existing ones
– Understanding regulatory relationship/pathway
– exploration
•…
Microarray Data Mining Challenges
•
•
•
•
too few records (samples), usually < 100
too many columns (genes), usually > 1,000
Too many columns likely to lead to False positives
for exploration, a large set of all relevant genes is
desired
• for diagnostics or identification of therapeutic
targets, the smallest set of genes is needed
• model needs to be explainable to biologists
Data Preparation Issues
• Thresholding: usually min 20, max 16,000
– For older Affy chips (new Affy chips do not have negative
values)
• Filtering - remove genes with insufficient variation
– e.g. MaxVal - MinVal < 500 and MaxVal/MinVal < 5
– biological reasons
– feature reduction for algorithmic
• For clustering, normalize each gene (sample) separately
to Mean = 0, Std. Dev = 1
Normalization issues
Within-slide
– What genes to use
– Location
– Scale
Paired-slides (dye swap)
– Self-normalization
Between slides
Control RNA Sample
Test RNA Sample
Reverse-Transcription
radio-labelled
cDNA probes
Hybridization to microarray filters
Use Phosphor Imager laser
scanner to obtain densities of each
spot on filter.
Compare densities at each spot to determine if
treatment changes gene expression. Compile subset of
differentially expressed genes.
Gene
Control
Test
A
1X
3X
:
:
:
Z
1X
0.5X
Normalization continued
• Intensity-dependent normalization (Yang, YH, 2002 )
– Do M-A plot to check the data distribution, where
M  log 2 T / C and A  log 2 T * C
– Use Lowess function in R to perform normalization
log 2 T / C  log 2 T / C  c( A)  log 2 T /( kC)
where c(A) is the lowess fit to the M-A plot
– Transform data by M'=M - c(A).
– Locally nonparametric method and is robust to a small
number of differentially expressed genes.
(R,G)  (M,A) Transformation
“Observed” data {(R,G)}
R = red channel signal
G = green channel signal
(background corrected or not)
Transformed data {(M,A)}
M = log2(R/G) (ratio),
A = log2(R·G)1/2 = 1/2·log2(R·G) (intensity)
 R=(22A+M)1/2, G=(22A-M)1/2
Normalization
• Regression normalization:
– Fit the linear regression model: yi    xi   i
– Assumption: all the genes on the array have the same
variance (homogeneity)
– Test the significance of the intercept . Fit a linear
regression without  if it is insignificant.
– Transform the treatment data:
yi  

yi 
– Problem:

• assumption may not hold
• nonlinear trend (the third replicates of RL95 data has a slight
quadratic trend) .
Scatter plot of log intensity before and after
regression normalization
log(bap1)
2
3
4
5
6
7
2
3
4
5
6
7
log(dmso1)
scatter plot of DMSO vs BAP
scatter plot after norm
6
2
4
4
6
log(bap2)
8
8
log(dmso1)
2
3
4
5
6
7
8
2
3
4
5
6
7
8
log(dmso2)
scatter plot of DMSO vs BAP
scatter plot after norm
1
5
1
3
5
log(bap3)
7
7
log(dmso2)
3
log(bap2)
2
log(bap3)
2 3 4 5 6 7
scatter plot after norm
2 3 4 5 6 7
log(bap1)
scatter plot of DMSO vs BAP
0
2
4
log(dmso3)
6
8
2
3
4
5
log(dmso3)
6
7
8