Transcript Document
Figure omitted because of copyright reason
A printed version can be found at
Leung YF, Lam DSC, Pang CP. The miracle of microarray data analysis.
Genome Biol. 2001 Aug 29; 2: 4021.1-4021.2.
“ I think you should be more explicit here in step two”
~ Normal science consists largely of moppingup operations. Experimentalists carry out
modified versions of experiments that have
been carried out may times before ~
Thomas S. Kuhn
The FAQ of biologist:
What is the best microarray
analysis software?
Different kinds of microarray
software
Image analysis software
Data mining software
– Statistics software
• R packages for microarray analysis
SNPs analysis software
Database/ LIMS software
Public Expression Database
Primer design
Software for further data mining: annotation,
promoter analysis & pathway reconstruction
Softwares won’t discuss today
Hardware control softwares
– Arrayer controlling – ArrayMaker
– Scanner controlling/ Image acquisition
A statistics on current microarray
softwares
28 Feb 2002
Jan 2001
Image analysis
17
17
Data mining
39
R packages
SNP analysis
14
1
Database/ LIMS
14
4
Public Database
16
8
Accessory
8
-
Further data mining
9
-
116
29
Total
* Extracted from http://ihome.cuhk.edu.hk/~b400559/arraysoft.html
Image analysis software
Spot recognition
Segmentation
– Foreground calculation
– Background calculation
Spot quality measures
Major Image analysis softwares
AIDA array
ArrayPro
ArrayVision
Dapple
F-scan
GenePix Pro 3.0.5
ImaGene 4.0
Iconoclust
Iplab
Lucidea Automated
Spotfinder
Phoretix Array3
P-scan
QuantArray 3.0
ScanAlyze 2
Spot
TIGR Spotfinder
UCSF Spot
Examples of common used image
analysis software
ScanAlyze 2 (Mike Eisen, LBNL)
GenePix Pro 3.0.5 (Axon Instruments)
QuantArray 3.0 (Packard Instrument)
ImaGene 4.0 (Biodiscovery)
Spot recognition
ArrayPro from Media Cybernetics
Automate and fast grid, subgrid and spot finding
algorithms
Segmentation
Purpose – classification between foreground
and background
–
–
–
–
Fixed circle
Adaptive circle
Adaptive shape
Histogram method
Segmentation
Using extra dye – DAPI, avoid morphology
assumption
UCSF Spot
Spot quality measure
E.g. QuantArray 3.0
– Diameter
– Spot Area
– Footprint
– Circularity
– Spot Signal/Noise
– Spot Uniformity
– Background Uniformity
– Replicate Uniformity
Problem: lacking rigorous spot quality definition
and experimental verification
Future Image analysis software
Rigorous quality mearsures definition
Extra dye for better segmentation
Automated analysis
Data mining software
Main purposes
1. Filtering and normalization
2. Statistical inference of differentially
expressed genes
3. Identification of biologically meaningful
patterns, i.e. expression profile; expression
fingerprint/ signature
4. Visualization
5. Other analysis like pathway reconstruction
etcs.
Different categories
Turnkey system
Comprehensive software
Specific analysis software
Extension/ accessory of other software
Major data mining software
AIDA Array
AMADA
ANOVA program for microarray
data
ArrayMiner
arraySCOUT
ArrayStat
BRB ArrayTools
CHIPSpace
Cleaver
CIT
CLUSFAVOR
Cluster
Cyber T
DNA-arrays analysis tools
dchip
Expression Profiler
Expressionist
Freeview & FreeOView
Gene Cluster
GeneLinker Gold
GeneMaths
GeneSight
GeneSpring
Genesis
Genetraffic
J-Express
MAExplorer
Partek
R cluster
Rosetta Resolver
SAM
SpotFire Decision Site
SNOMAD
TIGR ArrayViewer
TIGR Multiple Experiment Viewer
TreeView
Xcluster
Xpression NTI
Turnkey system
Definition: A computer system that has been
customized for a particular application. The term
derives from the idea that the end user can just
turn a key and the system is ready to go.
For microarray, this includes everything from OS,
server software, database, client software,
statistics software and even hardware
Examples
– Genetraffic (Iobion)
• Using Open Source softwares - LINUX, the R statistical
language, PostgreSQL, and Apache Web server
– Rosetta resolver (Rosetta Biosoftware)
• Sun Fire server and drive array, Oracle 8i, Rosetta server and
client side software
Turnkey system
Advantages
– performance
– Security
– Support multiple users
– Incorporate the experiment and data standards in design
Disadvantages
– Expensive
– Not suitable for small labs
– Require dedicated supporting staff
– Close system
Comprehensive software
Definition: Software incorporate many
different analyses for different stage in a
single package.
Examples
–
–
–
–
Cluster (Mike Eisen, LBNL)
GeneMaths (Applied Maths)
GeneSight (Biodiscovery)
GeneSpring (Silicon Genetics)
Comprehensive software
Cluster
– Filter data
– Adjust datanormalization, log
transform etc
– Clustering
– Self-Organizing Maps
(SOMs)
– Principle Component
Analysis (PCA)
GeneSpring
– & Promoter analysis
– Gene annotation with
public database
information
– Scripting tools
– Access Open DataBase
Connectivity (ODBC)
databases
Comprehensive software
GeneMaths
– & Bootstrap analysis
for clustering
– Fast clustering
algoritms
– Access Open DataBase
Connectivity (ODBC)
database
GeneSight
– & confidence analysis
for replicated data
– statistical analysis for
significant genes
– Graphical data set
builder
Comprehensive software
Advantages
– Standardized operation
– Generate various analysis easily
– Shorter learning curve for biologist
– Script language for automated process control
– Some brilliant ideas or analysis within
particular software
– “False” Sense of security?
Comprehensive software
Disadvantages
– Inflexible to latest analysis development
– Generate various analysis too easily
– Implicit data analysis/ statistics background and
definitions
– Proprietary script language
– Data compatibility with other softwares
– Necessity to design and maintain your own database
– Commercial softwares can be expensive!
– Adding particular analysis because of marketing
purpose, extra spending on unnecessary functions
– Sometimes only available in a few computing platforms
Specific analysis software
Definition: Software performing a few/ one
specific analysis
Examples
– GeneCluster (Whitehead Institute Centre
for genome research)
– INCLUSive - INtegrated CLustering, Upstream
Sequence retrieval and motif Sampler
(Katholieke Universiteit Leuven)
– SAM – Significance Analysis of Microarrays
(Stanford University)
Specific analysis software
GeneCluster – performing normalization,
filter and SOM
Specific analysis software
INCLUSive - INtegrated CLustering,
Upstream Sequence retrieval and motif
Sampler
SAM – finding statistical significant
differentially expressed gene
Specific analysis software
Advantages
– Better statistical background reference, usually
with literature support
Disadvantages
– Non-standardized environment – java, web,
excel… etc
– Data compatibility problem
– Data preprocessing problem
Extension/ accessory of other
software
Definition: extension of other software’s
capability
Examples:
– Freeview: Visualization and Optimization of
Gene Clustering Dendrograms for Cluster
– ArrayMiner: extension of GeneSpring
Statistics softwares
Excel
MATLAB
Octave
SAS
SPSS
S-PLUS
Statistica
R
Statistics softwares
Advantages
– Highly flexible
– High level, multivariate analyses are either
standard or easily programmable
Disadvantages
– Usually command line driven, impossible to
learn intuitively (a disadvantage??)
– Require a much better understanding of the
statistical data analysis to follow the steps (a
disadvantage??)
R-packages
A language and environment for statistical
computing and graphics.
Highly compatible to S/ S-plus
Open source under GNU General Public
Licence
Runs on many UNIX/ Linux/ windows
family and MacOS platform
There are growing number of microarray
analysis softwares (packages) written in R
R-packages
Dedicated for
microarray analysis
–
–
–
–
–
–
–
–
–
affy
Bioconductor
SMA extension
Cyber T
GeneSOM
Permax
OOMAL (S-Plus)
SMA
YASMA
General packages
– cclust
– cluster
– mclust
– multiv
– mva
– …etc!
R-packages
SMA - Statistical Microarray Analysis
(Terry Speed, UC Berkeley)
Bioconductor
R-packages
SMA
– perform intensity and spatial dependent
normalization
– Replicated array data analysis by an empirical
bayes approach
R-packages
Result of replicated data output B vs M plot
R-packages
Bioconductor
– open source software project to provide infrastructure
in terms of design and software to assist biologists and
statisticians for analysing genomic data, with primary
emphasis on inference using DNA microarrays
– Most software produced by the Bioconductor project
will be in the form of R libraries
• Variation 1: provide basic infrastructure support that will help
other developers produce high quality software
• Variation 2: provide innovative methodology for analyzing
genomic data
– Provide some form of graphical user interface
for selected libraries
– A mechanism for linking together different
groups with common goals
Future Data mining software
Standardized, open-source (free) platform?
– EMBOSS - European Molecular Biology Open
Software Suite.
More supervised analysis package and
pathway prediction package?
Plugin modules
– J-express
– GeneSpring
Mutation analysis software
Chip based SNP or chromosomal aberration
analysis (arrayCGH)
Various forms of protocols, e.g. primer
extension, ligase chain reaction, MALDITOF-MS, hybridization..etc
Result is in the form of base calling or
allelic imbalance
Example – genorama
Database
Definition: large collection of data organized
especially for rapid search and retrieval
Two categories
– Within laboratory/ institute database; LIMS
– Public expression database
Standardized definition of data
– Minimum Information About a Microarray Experiment
(MIAME)
•
•
•
•
•
•
Experimental design
Array design
Samples
Hybridizations
Measurements
Normalization controls
Database/ LIMS software
The database within your lab/ institute
The quality of in house data management
will affect the quality of final public data
repository
Database structure may be relatively
simple
Major Database/ LIMS software
AMAD
ARGUS
ArrayDB
ArrayInformatics
Clonetracker
GeNet
Genetraffic
GeneX
MAD
Maxd
NOMAD
Partisan Array LIMS
Phoretix Array2
Database
Rosetta Resolver
SMD
Public Expression Database
Necessities
– Provide raw data to validate published array
result and develop new analysis tools
– Further understanding of your data
– Compare among different groups, meta-data
mining
– Source for specialty array design
Different categories
– Generic
– Species specific
– Disease specific
The importance of data standardization
Major public gene expression
databases
3D-GeneExpression
Database
ArrayExpress
BodyMap
ChipDB
ExpressDB
Gene Expression Omnibus
(GEO)
Gene Expression Database
(GXD)
Gene Resource Locator
GeneX
Human Gene Expression
Index (HuGE Index)
RIKEN cDNA Expression
Array Database (READ)
RNA Abundance Database
(RAD)
Saccharomyces Genome
Database (SGD)
Standford Microarray
Database (SMD)
TissueInfo
yeast Microarray Global
Viewer (yMGV)
Primer/ probe design
Array designer
GAP (Genome- wide Automated Primer
finder servers)
OligoArray
Primer3
ProbeWiz Server
Other useful software for further
data mining
Data annotation
–
–
–
–
DRAGON
Gene Ontology
PubGene
Resourcerer
Promoter analysis
–
–
–
–
AlignACE
INCLUSive
MEME
Sequence Logo
Pathway reconstruction
– GenMAPP
– PathFinder
Data annotation
– Link GI to a particular name
– Literature mining to infer network
Network reconstruction
– Cluster + promoter analysis
– statistical inference from experimental data
Some suggestions for biologists who
are serious in microarray study
Communicate or even collaborate with
Statisticians, Mathematicians and
bioinformaticians
Learn a high level statistical language, e.g. R
Learn programming, e.g. C
Learn database, e.g. SQL
Learn Linux
Revise your statistics, probability and may be even
calculus
Lucky…?!
Picture omitted because of copyright reason
Conclusion – the future
A unified open environment for standard
analysis and development
The best microarray analysis software?
~ Exploratory data analysis can never be the
whole story, but nothing else can serve as
the foundation stone -- as the first step. ~
John. W. Tukey