Microarray Data Analysis
Download
Report
Transcript Microarray Data Analysis
Microarray Data Analysis
The Bioinformatics side of the bench
The anatomy of your data files
from MAS 5.0 (Microarray Suite
5.0)
• .DAT
• .CEL
• .EXP
• .CHP
• .txt files generated from .CHP
Quality Control (QC) of the
chip – visual inspection
• Look at the .DAT file or the .CHP file
image
– Scratches? Spots?
– Corners and outside border
checkerboard appearance (B2 oligo)
• Positive hybridization control
• Used by software to place grid over image
– Array name is written out in oligos!
Scratch on a chip
Possible chip
contamination
Internal controls
• B. subtilis genes (added poly-A tails)
– Assessment of quality of sample
preparation
– Also as hybridization controls
– Not used in our module
More internal controls
• Eukaryotic Hybridization controls
(bioB, bioC, bioD, cre)
– E. coli and P1 bacteriophage biotinlabeled cRNAs
– Spiked into the hybridization cocktail
– Assess hybridization efficiency
And still more internal
controls
• Actin and GAPDH assess RNA
sample/assay quality
– Compare signal values from 3’ end to
signal values from 5’ end
• ratio generally should not exceed 3
• Percent genes present (%P)
– Replicate samples - similar %P values
MAS 5.0 output files
• For each transcript (gene) on the chip:
– signal intensity
– a “present” or “absent” call (presence call)
– p-value (significance value) for making that
call
• Each gene associated with GenBank
accession number (NCBI database)
How are transcripts
determined to be present or
absent?
• Probe pair (PM vs. MM) intensities
– generate a detection p-value
• assign “Present”, “Absent”, or “Marginal”
call for transcript
• Every probe pair in a probe SET has
a potential “vote” for presence call
Discrimination score
• Probe pairs “vote” via discrimination
score (R)
• R compared to a predetermined
threshold: Tau
– R > Tau = present
– R < Tau = absent
• Voting result expressed as p-value
– Reflects confidence of expression call
Altering Tau
• You can fine tune Tau yourself within
MAS 5.0
• Increase Tau: reduce “false
positives”, may also reduce number
of TRUE present calls
• Our rule: use the default!
Calculation of R
R = (PM - MM) / (PM + MM)
– (PM – MM):
• intensity difference of probe pair
– (PM + MM):
• overall hybridization intensity
– R value closer to 1: lower p-value (detection call is
more significant)
• PM >> MM
– R value close to 0 or negative: higher p-value
(detection call is less significant)
• MM >/= PM
– One-sided Wilcoxon’s Signed Rank test used to
determine Detection p-value
Calculating signal
• One-Step Tukey Biweight Estimate
– Yields robust weighted mean
– Relatively insensitive to even extreme
outliers
• Signal intensity value is created
– related to amount of transcript present
for that gene
Thank goodness for
software!!!
• MAS 5.0 does these calculations for you
– .CHP file
• Basic analysis in MAS 5.0, but it won’t
handle replicates
• Import MAS 5.0 (.CHP) data into
GeneSifter
– web based microarray data analysis software
package designed BY biologists FOR
biologists
How do we want to analyze
this data?
• Pairwise analysis is most appropriate
– Control vs. DMSO
• List of genes that are “upregulated” or
“downregulated”
• Determine fold up or down cutoffs
– What is significant?
• 1.5 fold up/down?
• 2 fold up/down?
• 10 fold up/down?
Normalization
• “Normalizing” data allows
comparisons ACROSS different
chips
– Intensity of fluorescent markers might
be different from one batch to the other
– Normalization allows us to compare
those chips without altering the
interpretation of changes in GENE
EXPRESSION
Statistics
• Statistical tests allow us to determine how
SIGNIFICANT the data are
• t-test statistic
– compares the means of two groups while
taking into account the standard deviations of
those means
• p value (probability value) of </= 0.05
– (only 5 times out of 100 or less will the
change in gene expression be due to
chance, rather than a REAL change)
Present or absent?
• Can do analysis on genes that are
considered “absent” under all
conditions
• ONE transcript should be “present”
in a pairwise analysis
Thresholds/cutoffs
• What is a significant change in gene
expression?
– Some think 2 fold at the lowest
– Judgement call
– Can also set upper limit of expression
changes
• Remember we are talking about
changes in mRNA expression
– does that always mean more protein?
The output
• Run analysis, get output of a GENE
LIST
– List indicates what genes are up or
down regulated
– p values for t-test
– Graphs of signal levels
• Absolute numbers not as important here as
the trends you see
– Now what????
Follow the links
• Click on a gene
• Find links to other databases
• Follow links to discover what the
protein does
• Now the fun part begins….
Back to Biology
• Do the changes you see in gene
expression make sense
BIOLOGICALLY?
• If they don’t make sense, can you
hypothesize as to why those genes
might be changing?
• Leads to many, many more
experiments
Validation
• Not enough to just do microarrays
• Usually “validate” microarray results
via some other technique
– rt-PCR
– TaqMan
– Northern analysis
– Protein level analysis
• No technique is perfect…
Why microarrays?
• Ask a single question, and get more
answers than you dreamed of!
• Can assess GLOBAL changes in
gene expression under a certain
experimental condition
• Can discover new pathways, gene
regulation, the possibilities are
almost endless
Caveat…
• There is NO standard way to
analyze microarray data
• Still figuring out how to get the “best”
answers from microarray
experiments
• Best to combine knowledge of
biology, statistics, and computers to
get answers
One last note
• Microarrays are “cutting edge”
technology
• You now have experience doing a
technique that most Ph.D.s have
never done
• Looks great on a resume…