Transcript Document

Figure omitted because of copyright reason
A printed version can be found at
Leung YF, Lam DSC, Pang CP. The miracle of microarray data analysis.
Genome Biol. 2001 Aug 29; 2: 4021.1-4021.2.
“ I think you should be more explicit here in step two”
~ Normal science consists largely of moppingup operations. Experimentalists carry out
modified versions of experiments that have
been carried out may times before ~
Thomas S. Kuhn
The FAQ of biologist:
What is the best microarray
analysis software?
Different kinds of microarray
software
 Image analysis software
 Data mining software
– Statistics software
• R packages for microarray analysis





SNPs analysis software
Database/ LIMS software
Public Expression Database
Primer design
Software for further data mining: annotation,
promoter analysis & pathway reconstruction
Softwares won’t discuss today
 Hardware control softwares
– Arrayer controlling – ArrayMaker
– Scanner controlling/ Image acquisition
A statistics on current microarray
softwares
28 Feb 2002
Jan 2001
Image analysis
17
17
Data mining
39
R packages
SNP analysis
14
1
Database/ LIMS
14
4
Public Database
16
8
Accessory
8
-
Further data mining
9
-
116
29
Total
* Extracted from http://ihome.cuhk.edu.hk/~b400559/arraysoft.html
Image analysis software
 Spot recognition
 Segmentation
– Foreground calculation
– Background calculation
 Spot quality measures
Major Image analysis softwares









AIDA array
ArrayPro
ArrayVision
Dapple
F-scan
GenePix Pro 3.0.5
ImaGene 4.0
Iconoclust
Iplab
 Lucidea Automated
Spotfinder
 Phoretix Array3
 P-scan
 QuantArray 3.0
 ScanAlyze 2
 Spot
 TIGR Spotfinder
 UCSF Spot
Examples of common used image
analysis software
 ScanAlyze 2 (Mike Eisen, LBNL)
 GenePix Pro 3.0.5 (Axon Instruments)
 QuantArray 3.0 (Packard Instrument)
 ImaGene 4.0 (Biodiscovery)
Spot recognition
 ArrayPro from Media Cybernetics
 Automate and fast grid, subgrid and spot finding
algorithms
Segmentation
 Purpose – classification between foreground
and background
–
–
–
–
Fixed circle
Adaptive circle
Adaptive shape
Histogram method
Segmentation
 Using extra dye – DAPI, avoid morphology
assumption
UCSF Spot
Spot quality measure
 E.g. QuantArray 3.0
– Diameter
– Spot Area
– Footprint
– Circularity
– Spot Signal/Noise
– Spot Uniformity
– Background Uniformity
– Replicate Uniformity
 Problem: lacking rigorous spot quality definition
and experimental verification
Future Image analysis software
 Rigorous quality mearsures definition
 Extra dye for better segmentation
 Automated analysis
Data mining software
 Main purposes
1. Filtering and normalization
2. Statistical inference of differentially
expressed genes
3. Identification of biologically meaningful
patterns, i.e. expression profile; expression
fingerprint/ signature
4. Visualization
5. Other analysis like pathway reconstruction
etcs.
Different categories
 Turnkey system
 Comprehensive software
 Specific analysis software
 Extension/ accessory of other software
Major data mining software
 AIDA Array
 AMADA
 ANOVA program for microarray
















data
ArrayMiner
arraySCOUT
ArrayStat
BRB ArrayTools
CHIPSpace
Cleaver
CIT
CLUSFAVOR
Cluster
Cyber T
DNA-arrays analysis tools
dchip
Expression Profiler
Expressionist
Freeview & FreeOView
Gene Cluster



















GeneLinker Gold
GeneMaths
GeneSight
GeneSpring
Genesis
Genetraffic
J-Express
MAExplorer
Partek
R cluster
Rosetta Resolver
SAM
SpotFire Decision Site
SNOMAD
TIGR ArrayViewer
TIGR Multiple Experiment Viewer
TreeView
Xcluster
Xpression NTI
Turnkey system
 Definition: A computer system that has been
customized for a particular application. The term
derives from the idea that the end user can just
turn a key and the system is ready to go.
 For microarray, this includes everything from OS,
server software, database, client software,
statistics software and even hardware
 Examples
– Genetraffic (Iobion)
• Using Open Source softwares - LINUX, the R statistical
language, PostgreSQL, and Apache Web server
– Rosetta resolver (Rosetta Biosoftware)
• Sun Fire server and drive array, Oracle 8i, Rosetta server and
client side software
Turnkey system
 Advantages
– performance
– Security
– Support multiple users
– Incorporate the experiment and data standards in design
 Disadvantages
– Expensive
– Not suitable for small labs
– Require dedicated supporting staff
– Close system
Comprehensive software
 Definition: Software incorporate many
different analyses for different stage in a
single package.
 Examples
–
–
–
–
Cluster (Mike Eisen, LBNL)
GeneMaths (Applied Maths)
GeneSight (Biodiscovery)
GeneSpring (Silicon Genetics)
Comprehensive software
 Cluster
– Filter data
– Adjust datanormalization, log
transform etc
– Clustering
– Self-Organizing Maps
(SOMs)
– Principle Component
Analysis (PCA)
 GeneSpring
– & Promoter analysis
– Gene annotation with
public database
information
– Scripting tools
– Access Open DataBase
Connectivity (ODBC)
databases
Comprehensive software
 GeneMaths
– & Bootstrap analysis
for clustering
– Fast clustering
algoritms
– Access Open DataBase
Connectivity (ODBC)
database
 GeneSight
– & confidence analysis
for replicated data
– statistical analysis for
significant genes
– Graphical data set
builder
Comprehensive software
 Advantages
– Standardized operation
– Generate various analysis easily
– Shorter learning curve for biologist
– Script language for automated process control
– Some brilliant ideas or analysis within
particular software
– “False” Sense of security?
Comprehensive software
 Disadvantages
– Inflexible to latest analysis development
– Generate various analysis too easily
– Implicit data analysis/ statistics background and
definitions
– Proprietary script language
– Data compatibility with other softwares
– Necessity to design and maintain your own database
– Commercial softwares can be expensive!
– Adding particular analysis because of marketing
purpose, extra spending on unnecessary functions
– Sometimes only available in a few computing platforms
Specific analysis software
 Definition: Software performing a few/ one
specific analysis
 Examples
– GeneCluster (Whitehead Institute Centre
for genome research)
– INCLUSive - INtegrated CLustering, Upstream
Sequence retrieval and motif Sampler
(Katholieke Universiteit Leuven)
– SAM – Significance Analysis of Microarrays
(Stanford University)
Specific analysis software
 GeneCluster – performing normalization,
filter and SOM
Specific analysis software
 INCLUSive - INtegrated CLustering,
Upstream Sequence retrieval and motif
Sampler
 SAM – finding statistical significant
differentially expressed gene
Specific analysis software
 Advantages
– Better statistical background reference, usually
with literature support
 Disadvantages
– Non-standardized environment – java, web,
excel… etc
– Data compatibility problem
– Data preprocessing problem
Extension/ accessory of other
software
 Definition: extension of other software’s
capability
 Examples:
– Freeview: Visualization and Optimization of
Gene Clustering Dendrograms for Cluster
– ArrayMiner: extension of GeneSpring
Statistics softwares
 Excel
 MATLAB
 Octave
 SAS
 SPSS
 S-PLUS
 Statistica
R
Statistics softwares
 Advantages
– Highly flexible
– High level, multivariate analyses are either
standard or easily programmable
 Disadvantages
– Usually command line driven, impossible to
learn intuitively (a disadvantage??)
– Require a much better understanding of the
statistical data analysis to follow the steps (a
disadvantage??)
R-packages
 A language and environment for statistical
computing and graphics.
 Highly compatible to S/ S-plus
 Open source under GNU General Public
Licence
 Runs on many UNIX/ Linux/ windows
family and MacOS platform
 There are growing number of microarray
analysis softwares (packages) written in R
R-packages
 Dedicated for
microarray analysis
–
–
–
–
–
–
–
–
–
affy
Bioconductor
SMA extension
Cyber T
GeneSOM
Permax
OOMAL (S-Plus)
SMA
YASMA
 General packages
– cclust
– cluster
– mclust
– multiv
– mva
– …etc!
R-packages
 SMA - Statistical Microarray Analysis
(Terry Speed, UC Berkeley)
 Bioconductor
R-packages
 SMA
– perform intensity and spatial dependent
normalization
– Replicated array data analysis by an empirical
bayes approach
R-packages
 Result of replicated data output B vs M plot
R-packages
 Bioconductor
– open source software project to provide infrastructure
in terms of design and software to assist biologists and
statisticians for analysing genomic data, with primary
emphasis on inference using DNA microarrays
– Most software produced by the Bioconductor project
will be in the form of R libraries
• Variation 1: provide basic infrastructure support that will help
other developers produce high quality software
• Variation 2: provide innovative methodology for analyzing
genomic data
– Provide some form of graphical user interface
for selected libraries
– A mechanism for linking together different
groups with common goals
Future Data mining software
 Standardized, open-source (free) platform?
– EMBOSS - European Molecular Biology Open
Software Suite.
 More supervised analysis package and
pathway prediction package?
 Plugin modules
– J-express
– GeneSpring
Mutation analysis software
 Chip based SNP or chromosomal aberration
analysis (arrayCGH)
 Various forms of protocols, e.g. primer
extension, ligase chain reaction, MALDITOF-MS, hybridization..etc
 Result is in the form of base calling or
allelic imbalance
 Example – genorama
Database
 Definition: large collection of data organized
especially for rapid search and retrieval
 Two categories
– Within laboratory/ institute database; LIMS
– Public expression database
 Standardized definition of data
– Minimum Information About a Microarray Experiment
(MIAME)
•
•
•
•
•
•
Experimental design
Array design
Samples
Hybridizations
Measurements
Normalization controls
Database/ LIMS software
 The database within your lab/ institute
 The quality of in house data management
will affect the quality of final public data
repository
 Database structure may be relatively
simple
Major Database/ LIMS software









AMAD
ARGUS
ArrayDB
ArrayInformatics
Clonetracker
GeNet
Genetraffic
GeneX
MAD
 Maxd
 NOMAD
 Partisan Array LIMS
 Phoretix Array2
Database
 Rosetta Resolver
 SMD
Public Expression Database
 Necessities
– Provide raw data to validate published array
result and develop new analysis tools
– Further understanding of your data
– Compare among different groups, meta-data
mining
– Source for specialty array design
 Different categories
– Generic
– Species specific
– Disease specific
 The importance of data standardization
Major public gene expression
databases
 3D-GeneExpression







Database
ArrayExpress
BodyMap
ChipDB
ExpressDB
Gene Expression Omnibus
(GEO)
Gene Expression Database
(GXD)
Gene Resource Locator
 GeneX
 Human Gene Expression






Index (HuGE Index)
RIKEN cDNA Expression
Array Database (READ)
RNA Abundance Database
(RAD)
Saccharomyces Genome
Database (SGD)
Standford Microarray
Database (SMD)
TissueInfo
yeast Microarray Global
Viewer (yMGV)
Primer/ probe design
 Array designer
 GAP (Genome- wide Automated Primer
finder servers)
 OligoArray
 Primer3
 ProbeWiz Server
Other useful software for further
data mining
 Data annotation
–
–
–
–
DRAGON
Gene Ontology
PubGene
Resourcerer
 Promoter analysis
–
–
–
–
AlignACE
INCLUSive
MEME
Sequence Logo
 Pathway reconstruction
– GenMAPP
– PathFinder
 Data annotation
– Link GI to a particular name
– Literature mining to infer network
 Network reconstruction
– Cluster + promoter analysis
– statistical inference from experimental data
Some suggestions for biologists who
are serious in microarray study
 Communicate or even collaborate with
Statisticians, Mathematicians and
bioinformaticians
 Learn a high level statistical language, e.g. R
 Learn programming, e.g. C
 Learn database, e.g. SQL
 Learn Linux
 Revise your statistics, probability and may be even
calculus
 Lucky…?!
Picture omitted because of copyright reason
Conclusion – the future
 A unified open environment for standard
analysis and development
 The best microarray analysis software?
~ Exploratory data analysis can never be the
whole story, but nothing else can serve as
the foundation stone -- as the first step. ~
John. W. Tukey