Microarray Data Analysis

Download Report

Transcript Microarray Data Analysis

Microarray Data Analysis
The Bioinformatics side of the bench
The anatomy of your data files
from Affymetrix array analysis
• .DAT= image file (107 pixels)
• .CEL= measured cell intensities
• .CDF= cell descriptions files (identify
probe sets and probe set pairs)
• .CHP= calculated probe set data
• .RPT= report generated from .CHP
Quality Control (QC) of the
chip – visual inspection
• Look at the .DAT file or the .CHP file
image
– Scratches? Spots?
– Corners and outside border
checkerboard appearance (B2 oligo)
• Positive hybridization control
• Used by software to place grid over image
– Array name is written out in oligos!
Chip defects
Internal controls
• B. subtilis genes (added poly-A tails)
– Assessment of quality of sample preparation
– Also as hybridization controls
•
Hybridization controls (bioB, bioC, bioD, cre)
– E. coli and P1 bacteriophage biotin-labeled cRNAs
– Spiked into the hybridization cocktail
– Assess hybridization efficiency
• Actin and GAPDH assess RNA sample/assay quality
– Compare signal values from 3’ end to signal values
from 5’ end
• ratio generally should not exceed 3
• Percent genes present (%P)
– Replicate samples - similar %P values
Microarray Data Process/Outline
1. Experimental Design
2. Image Analysis – scan to intensity measures (raw
data)
3. Normalization – “clean” data
4. More “low level” analysis-fold change, ANOVA,
data filtering
5. Data mining-how to interpret > 6000 measures
– Databases
– Software
– Techniques-clustering, pattern recognition etc.
– Comparing to prior studies, across platforms?
6. Validation
Experimental Design
A good microarray design has 4 elements
1.
A clearly defined biological question or hypothesis
2.
Treatment, perturbation and observation of biological
materials should minimize systematic bias
3.
Simple and statistically sound arrangement that minimizes
cost and gains maximal information
4.
Compliance with MIAME (minimal information about
microarray experiment)
• The goal of statistics is to find signals in a sea of noise
• The goal of exp. design is to reduce the noise so signals can
be found with as small a sample size as possible
Observational Study vs.
Designed Experiment
• Observational study– Investigator is a passive observer who
measures variables of interest, but does not
attempt to influence the responses
• Designed Experiment– Investigator intervenes in natural course of
events
What type is our DMSO exp?
Experimental Replicates
• Why?
– In any exp. system there is a certain amount of
noise—so even 2 identical processes yield slightly
different results
– Sources?
– In order to understand how much variation there is
it is necessary to repeat an exp a # of independent
times
– Replicates allow us to use statistical tests to
ascertain if the differences we see are real
Technical vs. Biological Replicates
As we progress from the starting material to the scanned
image we are moving from a system dominated by biological
effects through one dominated by chemistry and physics noise
Within Affy platform the dominant variation is usually of a
biological nature thus best strategy is to produce replicates as
high up the experimental tree as possible
Low level data analysis / pre-processing
• Varying biological or cellular
composition among sample
types.
• Differences in sample
preparation, labeling or
hybridization
• Non specific crosshybridization of target to
probes.
Lead to systemic differences
between individual arrays
• Raw Data Quality Control
• Scaling
• Normalization and
filtering.
Image Analysis - Raw Data
From probe level signals to gene abundance
estimates
The job of the expression summary algorithm is
to take a set of Perfect Match (PM) and MisMatch (MM) probes, and use these to generate
a single value representing the estimated
amount of transcript in solution, as measured
by that probeset.
To do this, .DAT files containing array images are first
processed to produce a .CEL file, which contains
measured intensities for each probe on the array.
It is the .CEL files that are analyzed by the expression
calling algorithm.
MAS 5.0 output files
• For each transcript (gene) on the chip:
– signal intensity
– a “present” or “absent” call (presence call)
– p-value (significance value) for making that
call
• Each gene associated with GenBank
accession number (NCBI database)
How are transcripts determined to be
present or absent?
• Probe pair (PM vs. MM) intensities
– generate a detection p-value
• assign “Present”, “Absent”, or “Marginal”
call for transcript
• Every probe pair in a probe SET has
a potential “vote” for presence call
PM and MM Probes
• The purpose of each MM probe is to provide a direct
measure of background and stray-signal (perhaps due
to cross-hybridization) for its perfect-match partner. In
most situations the signal from each probe-pair is simply
the difference PM - MM.
• For some probe-pairs, however, the MM signal is
greater than the PM value; we have an apparently
impossible measure of background.
Thank goodness for software!!!
• MAS 5.0 does these calculations for you
– .CHP file
• Basic analysis in MAS 5.0, but it won’t
handle replicates
• Import MAS 5.0 (.CHP) data into other
software, Genesifter, GCOS, SpotFire,
and many others
Signal Intensity
• Following these calculations, the MAS 5.0
algorithm now has a measure of the
signal for each probe in a probeset.
• Other algortihms, ex RMA, GCRMA,
dCHIP, PLIER and others have been
developed by academic teams to improve
the precision and accuracy of this
calculation
• In our Exp we will use RMA and GCRMA
How do we want to analyze
this data?
• Pairwise analysis is most appropriate
– Control vs. DMSO
• List of genes that are “upregulated” or
“downregulated”
• Determine fold up or down cutoffs
– What is significant?
• 1.5 fold up/down?
• 2 fold up/down?
• 10 fold up/down?
Normalization - “clean” data
• “Normalizing” data allows
comparisons ACROSS different
chips
– Intensity of fluorescent markers might
be different from one batch to the other
– Normalization allows us to compare
those chips without altering the
interpretation of changes in GENE
EXPRESSION
Why Normalize Data?
•The experimental goal is to identify biological variation
(expression changes between samples)
•Technical variation can hide the real data
•Unavoidable systematic bias should be recognized and
corrected
•Normalization is necessary to effectively make comparisons
between chips-and sometimes within a single chip.
•There are different methods of normalization the
assumptions of where variation exist will determine the
normalization techniques used.
•Always look at data before and after normalization
•Spike in controls can help show which method may be best
Caveat…
• There is NO standard way to
analyze microarray data
• Still figuring out how to get the “best”
answers from microarray
experiments
• Best to combine knowledge of
biology, statistics, and computers to
get answers
Venn Diagrams
MAS 5.0
GCRMA
RMA
MAS 5.0
GCRMA
RMA
Data processing is completed
now what?
Fold change, ANOVA, Data filtering
Where are we now?
• Ran analysis, output is a GENE
LIST
– List indicates what genes are up or
down regulated
– p values for t-test
– Graphs of signal levels
• Absolute numbers not as important here as
the trends you see
– Now what????
What is the first set of genes on our chips
that will be “filtered” out?
Follow the links
• Click on a gene
• Find links to other databases
• Follow links to discover what the
protein does
• Now the fun part begins….
Back to Biology
• Do the changes you see in gene
expression make sense
BIOLOGICALLY?
• If they don’t make sense, can you
hypothesize as to why those genes
might be changing?
• Leads to many, many more
experiments
The Gene Ontologies
A Common Language for Annotation of
Genes from
Yeast, Flies and Mice
…and Plants and Worms
…and Humans
…and anything else!
Gene Ontology
Objectives
• GO represents concepts used to classify
specific parts of our biological knowledge:
– Biological Process
– Molecular Function
– Cellular Component
• GO develops a common language applicable
to any organism
• GO terms can be used to annotate gene
products from any species, allowing
comparison of information across species
Sriniga Srinivasan, Chief Ontologist, Yahoo!
The ontology. Dividing human
knowledge into a clean set of categories
is a lot like trying to figure out where to
find that suspenseful black comedy at
your corner video store. Questions
inevitably come up, like are Movies part
of Art or Entertainment? (Yahoo! lists
them under the latter.) -Wired
Magazine, May 1996
The 3 Gene Ontologies
• Molecular Function = elemental activity/task
–
the tasks performed by individual gene products; examples are
carbohydrate binding and ATPase activity
• Biological Process = biological goal or
objective
–
broad biological goals, such as mitosis or purine metabolism, that are
accomplished by ordered assemblies of molecular functions
• Cellular Component = location or complex
–
subcellular structures, locations, and macromolecular complexes;
examples include nucleus, telomere, and RNA polymerase II
holoenzyme
Example:
Gene Product = hammer
Function (what)
Process (why)
Drive nail (into wood)
Carpentry
Drive stake (into soil)
Gardening
Smash roach
Pest Control
Clown’s juggling object
Entertainment
Biological Examples
Biological Process
Molecular Function Cellular Component
Validation
• Not enough to just do microarrays
• Usually “validate” microarray results
via some other technique
– rt-PCR
– TaqMan
– Northern analysis
– Protein level analysis
• No technique is perfect…
Yeast Genome and Data Mining
Dynamic Nature of Yeast Genome
eORF= essential
kORF= known
hORF= homology
identified
shORF= short
tORF= transposon
identified
qORF= questionable
dORF= disabled
First published sequence claimed 6274 genes– a # that
has been revised many times, why?
6603
4373
1410
820
The Affy detection oligonucleotide sequences are frozen at the time
of synthesis, how does this impact downstream data analysis?
Terms, Definitions, IDs
term: MAPKKK cascade (mating sensu
Saccharomyces)
goid: GO:0007244
definition: MAPKKK cascade involved in transduction
of mating pheromone signal, as described in
Saccharomyces
definition_reference: PMID:9561267
SGD
SGD public microarray data sets available
for public query
Homework
1.
Go to http://www.yeastgenome.org/ and find 3 candidate genes of
known f(x) and one of undefined f(x) that you might predict to be
altered by DMSO treatment
What GO biological processes and molecular mechanisms are
associated with your candidate genes?
Where, subcellularly does the protein reside in the cell?
What other proteins are known or inferred to interact with yours? How
was this interaction determined? Is this a genetic or physical
interaction?
Find the expression of at least one of your known genes in another
public ally deposited microarray data set?
2.
3.
4.
5.
1.
2.
6.
Name of data set and how you found it?
What is the largest Fold change observed for this gene in the public study?
Now that you are microarray technology experts can you give me 3
reasons why the observed transcript level difference may not be
confirmed through a second technology like RTQPCR?
Suggested Reading