Expression Data and Microarrays

Download Report

Transcript Expression Data and Microarrays

Expression Data and Microarrays
CMMB
November 29, 2001
Todd Scheetz
Overview
Gene expression
– mRNA
– protein
Northern Blots
RT-PCR
SAGE
MicroArray
Gene Expression Review
Transcription
– generation of mRNA from genomic DNA
a complete copy is made, including both
introns and exons.
 pre-mRNA
genomic
AAAA...
pre-mRNA
Gene Expression Review
Processing / Splicing
– removal of the introns from the pre-mRNA
 mature mRNA
– also exported from the nucleus to the
cytoplasm
– alternative splicing
AAAA...
pre-mRNA
AAAA...
AAAA...
mature mRNAs
(splice variants)
Gene Expression Review
Translation
– takes an mRNA molecule and uses it to
construct an amino acid sequence.
– the ribosome is the underlying machinery used
in the process of translation.
Measuring Gene Expression
Two major differentiating factors…
Quantitative vs. Qualitative
mRNA vs protein
Most techniques can be used to determine
quantitative expression levels.
Ex. EST sequencing
Measuring Gene Expression
More sophisticated experiments…
Comparing expression levels of multiple genes
Comparing co-regulation or differential
regulation.
Ex. EST sequencing
Northern Blot
Measure relative expression levels of mRNA
1. mRNA isolation and purification
2. electrophorese on a gel
3. The gel is probed by hybridizing with a
labeled clone for the gene under study.
Northern Blot
Northern Blot
RT-PCR
Measures relative expression of mRNA
1. Isolate and purify mRNA
2. reverse transcription
3. PCR amplification
4. run on gel and probe/hybridize
RT-PCR
RT-PCR
Why use RT?
Can observe very low levels of expression
Requires very small amounts of mRNA
The bad…
Potential expression-level skew due to nonlinearity of PCR
Have to design multiple custom primers for each
gene.
SAGE
SAGE
SAGE
Tags are isolated and
concatermized.
Relative expression
levels can be
compared between
cells in different
states.
SAGE
--gene to tag mapping
http://www.ncbi.nlm.nih.gov/SAGE/SAGEcid.cgi?cid=28726
MicroArray
What are they?
allow 1000’s of expression analyses to be
performed concurrently.
What technologies are used?
How to analyze the image?
How to analyze the expression data?
What bioinformatics challenges are there?
Potential Microarray Applications
• Drug discovery / toxicology studies
• Mutation/polymorphism detection Differing
expression of genes over:
– Time
– Tissues
– Disease States
• Sub-typing complex genetic diseases
DNA Array Technology
Array Type
Nylon Macroarrays
Nylon Microarrays
Glass Microarrays
Oligonucleotide Chips
Spot Density
(per cm 2 )
< 100
< 5000
< 10,000
<250,000
Probe
Target
Labeling
cDNA
cDNA
cDNA
oligo's
RNA
mRNA
mRNA
mRNA
Radioactive
Radioactive/Flourescent
Flourescent
Flourescent
Physical Spotting
MicroArray
Glass Microarray
326 Rat Heart Genes, 2x spotting
Photolithographic
MicroArray
MicroArray
MicroArray
MicroArray
Overview of data capture
two different mRNA populations, labeled with
different fluors
excited by a laser
each fluour excites at a different wavelength,
which is captured using a photodetector
attached to a filter tuned to the particular fluor
MicroArray
Overview of image analysis
spot identification
grid alignment
skew
image normalization
variable background
uneven hybridization
Microarray Data Pipeline
Image Analysis/Data Quantization
• Feature (target  probe) segmentation
• Data extraction and quantization of:
– Background
– Feature
• Correlation of feature identity and location
within image
• Display of pseudo-color image
Image Segmentation
+
Microarray Experiment Design
• Type I: (n = 2)
– How is this gene expressed in target 1 as compared to
target 2?
– Which genes show up/down regulation between the two
targets?
• Type II: (n > 2)
– How does the expression of gene A vary over time,
tissues, or treatments?
– Do any of the expression profiles exhibit similar patterns
of expression?
Motivation & Design Constraints
• Probe set design involves the prioritizing and parsing of an
initial data set containing potentially hundreds of
thousands of probe candidates to define a reasonably sized
set for use in a microarray experiment
• A single hybridization can produce several thousand data
tuples, each containing multiple (n>10) measurements
• No “All-in-one” software package is currently available,
therefore, communication of data between the packages
must be facilitated by the pipeline
Probe Set Design
• Goal of probe set design is to identify a reasonably sized
subset of probes from a much larger starting set from a
variety of sources
• By defining a set of criteria, an investigator should be able
to create new probe sets or refine existing sets
• Pruning a data set should be done in several stages:
 Use readily available information to limit scope of data
 Obtain more information about remaining probes
 Narrow focus based on additional information
 Iterate until desired data set is obtained
Sample Probe Set Design Criteria
• 1° -- Direct
–
–
–
–
• 2° -- Indirect
Species
Tissue
Chromosome
Sequence Available
• Quality
• Tail/Poly(A) signal
– Map position known?
– Cluster size
– Blast results
• Confidence value
• Homology (or lack of)
• Annotation contains words
like “transfer”
• 3’ & 5’ EST reads hit same
gene
– Syntenic Map Information
– Known phenotypes in other
species
cDNA Microarray Slide Creation
• cDNA clones defining a probe set must be re-arrayed from
their sources (e.g. local storage or commercial) into a
format suitable for amplification and printing (e.g. 96-well
microtiter plates)
• Based on the size of the probe set and the limitations of the
printer, a parameter set (# of pens, spot spacing, grid
dimensions,…) must be defined for printing the probe set
onto the slide(s)
• A mapping operation must be performed in order to track
each probe from source to destination in order to correlate
known information with a particular “spot” in a microarray
image
MicroArray
Overview of data analysis
vs. time
vs. other genes
co-reg.
diff. reg
pathway ident.
Data Analysis
• Data analysis consists of several post-quantization
steps:
–
–
–
–
Statistics/Metrics Calculations
Scaling/Normalization of the Data
Differential Expression
Coordinated Gene Expression (aka clustering)
• Most software packages perform only a limited
number of analysis tasks
• Databases can facilitate the movement of data
between packages
Scaling and/or Normalization
• Positive Controls
– ‘Spiked’ DNA
– Housekeeping Genes
– Total Array
• Negative Controls
– Foreign DNA
– ‘Empty’ spots
Scaling and/or Normalization
•
•
•
•
•
Linear regression
Log-linear regression
Ratio statistics
Log(ratio) mean/median centering
Nonlinear regression
MicroArray
Bioinformatics challenges
1. data management
2. utilizing data from multiple experiments
(type II)
3. utilizing data from multiple groups
* with different technologies
* with only processed data available
Gene
A B C E D
Condition
1 2 34
+ +
+ - - +
+ - - + + - +
-
?
0
60 120
180
Time
0 +
A
-
Database(s)
C
1
Local Alignment
3’ … A C G G G C … … ATG … 5’
3’
… A C G G G C … … ATG … 5’
3’
… A C G G G A … … ATG … 5’
B
2
3
4
Timepoints
Search Window
MicroArray
data management
clone - spot
clone - gene
raw expression level
normalized expression level
annotation/links
expression profile
MArray Expt Mgmt Redux
Experiment 5-Tuple:
(Probe Set_ID, Target_ID, Hyb Condition_ID, Hyb Iteration_ID,
GenePix_Analysis_ID)
Database Support (EBI Schema)
http://www.ebi.ac.uk/arrayexpress/
http://www.bioinf.man.ac.uk/microarray/maxd
Differential Expression
• Type I analysis
• Look for genes with vastly different
expression under different conditions
– How do you measure “vastly different”?
– What role should derived statistics play?
Type I: Differential Expression
Gene 1 vs Gene 2
60000
50000
Gene 2
40000
30000
20000
10000
0
0
10000
20000
30000
Gene 1
40000
50000
60000
Coordinated Gene Expression
•
•
•
•
•
Type II analysis
“Eisen”ized data (dendrograms)
Self-Organizing Maps
Principal Component Analysis
k-means Clustering
Hierarchical Clustering
Self Organizing Maps
Current Software
Statistics
Normalization
Diff Exp
X
X
X
X
X
X
X
CGE
Quantization
Provider
Spotfire Inc
FujiFilm
Premier Biosoft Intl Inc
Lion Bioscience
Imaging Research Inc
TIGR
Imaging Research Inc
Applied Precision Inc
Stanford University
U of W ashington
MIT
Axon Instruments
Biodiscovery
Silicon Genetics
Genomic Solutions
Biodiscovery
Scanalytics
NEN
GeneMachines
Research Genetics
Packard Instrument Co
Stanford University
GCG
TIGR
Cose
Rosetta
Probe Set Design
Software Name
Array Explorer
Array Gauge
ArrayDesigner
ArraySCOUT
ArrayStat
ArrayViewer
ArrayVision
arrayW oRx
Cluster/Xcluster
Crazy Quant
GeneCluster
GenePix Pro
GeneSight
GeneSpring
GeneTAC
Imagene
MicroArray Suite
Micromax
OmniGrid
Pathways Analysis
Quant Array
ScanAlyze
SeqArray
Spotfinder
XDotsReader
Resolver
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Software/Pipeline Integration
• A centralized database facilitates the archival,
manipulation, and mining of all microarray data
• Most analysis programs can output data in a textual format
which is easily input into the database
• Output from one program can be used as input to a second
program either directly or through a filtering operation
facilitated by the database and a set of programs to mine
and manipulate the data
• Data from multiple hybridizations may need to be
combined in order to perform coordinated gene expression
analysis
Standards...
Want ability to exchange microarray
experiment data using a common format.
MGED -- Microarray Gene Expression Group
www.mged.org
MAGEML
Rosetta Inpharmatics
GEML -- www.geml.org
MIAME - Minimum Information About Microarry Experiments
Data and Limitations
Current Controversy:
Should the raw data be archived?
If so, who should do it?
Each slide (25 mm x 75 mm) is scanned at
200 pixels per mm.
Typical spot size = 100 um
Center-to-center = 195 um
Potential spots = 42,000
“Raw” image size = ~250 MB
Other Types of Microarrays
• Genomic BAC arrays
– allows assessment of “small” deletions
• Tissue arrays
– allows assessment of protein expressions
Type II: Data Partitioning
• Identify genes with similar expression
• Grouping unknown genes with known
genes may provide insight into function of
unknown genes
• Only useful for genes with varying
expression levels
Protein Expression
Protein expression may not correlate with
mRNA expression.
How to measure levels of protein expression?
Immunochemistry
2-antibody approach
Protein Expression
Indirect Immunofluorescence
cells are fixed
permeabilize the cells
incubate with primary antibody
incubate with secondary antibody
Protein Expression
Protein Expression
Immunofluorescence
green -- tubulin
red -- gamma tubulin
blue -- DNA
Protein Expression
Immunofluorescence
red -- alpha tubulin
green -- vimentin (cytoskeletal protein)
blue -- DNA
Protein Expression
High-throughput methods
array multiple tissue samples onto slide, and
hybridize