Poster - Predictive Models for Biomedicine & Environment

Download Report

Transcript Poster - Predictive Models for Biomedicine & Environment

MGED 7
Interfacing predictive models with MIAME compliant databases
Cesare Furlanello, Maria Serafini, Silvano Paoli, Giuseppe Jurman
September 8-10, 2004
Toronto, ON, Canada
ITC-irst, Trento, Italy -- http://mpa.itc.it
INTRODUCTION
(b) EXAMPLE: Interfacing to sample-tracking profiles. We study
the influence of gene panel sizes on predictive classification error, on
a sample-by-sample basis. Errors are accumulated on multiple
replicated runs in which the sample is in test, and plotted for
increasing panel sizes. Specific sample-tracking profiles may be
investigated to discover patterns (potential outliers, subtypes).
GOAL: to provide novel types of interaction between
classification systems and MIAME-compliant databases
Zoom on plot
How to automate the discovery of patterns and interconnect
the investigation to experimental, biological and clinical data
about the microarray?
We present a prototype module aimed at providing graphical interaction
between systems for gene-profiling and MIAME-compliant databases.
Interface: Profile Browser, Working Area, Query Tools
Browse through the samples,
then select/remove the
current curve from the
working area
The prototype has been developed to support outlier analysis and semisupervised class discovery in microarray data experiments.
The module is designed to integrate the newly developed PostgreSQL
porting of the GUS/RAD platform [1,2] with a display automatically built
in Scalable Vector Graphics (SVG).
The display organizes the graphical outputs from a predictive
classification system, supporting query construction and retrieval of
MIAME annotation linked to automatically or manually selected curves.
(a) Gene profiling tasks require intensive computational
resources. Our E-RFE system for gene profiling [2] is
currently implemented on a high-throughput computing
facility, the MPA-HTC Linux Cluster.
Build query
Save in JPG format
the selected (blue) curve or all
those displayed in the working
area
Discovery of outlier patterns and of potential subtypes,
and analysis of gene importance may be derived as a byproduct of the computation (e.g. as needed by a complete
validation setup to avoid selection bias).
Query the Database for info
on the selected (blue) sample,
or for all those listed in the
working area or displayed in
the image:
QUESTIONS
1. Interact with the resources (Cluster+Algorithms)
for understanding and refining machine learning
results
2. Provide access to the gene profiling algorithms and
their outcomes through a web service
3. Connect to MIAME-compliant information to
support investigation and discovery
Automating discovery: DTW-based clustering
THE PROTOTYPE
This first version provides an interface to sample-tracking curves
(profiles of classification errors of single samples as a function of gene
panel sizes), as derived from the ERFE-SVM gene ranking system [3].
We automatically cluster these curves according to a Dynamic Time
Warping (DTW) metric [4], obtaining hypotheses on the potential
presence of outliers and of subtypes. The analysis is a by-product of the
ERFE-SVM complete cross-validation set-up, which is run on a Open
Mosix Linux cluster facility.
Scalable Vector Graphic
SVG is a language for describing twodimensional graphics and graphical
applications in XML.
Choose the cluster
you are interested in
and display the curves
for the selected
cluster
SVG 1.1 is a W3C Recommendation and
forms the core of the current SVG
developments.
Selection of sample-tracking curves is obtained from DTW-based clustering.
Curves from selected cluster are added to the sample analysis area and are
ready for query.
Scripts based on the trellis (lattice) graphics library of the R computing
environment are interfaced to the classification system.
The SVG directives providing the interactive display are also directly
built by R, according to an adaptation of the RSVG driver package.
REFERENCES
[1] Manduchi, E., Pizarro, A., Stoeckert, C. (2001). RAD (RNA Abundance Database): an
infrastructure for array data analysis. Proc. SPIE, vol 4266, pp. 68-78.
[2] Manduchi E. et al. RAD and the RAD Study-Annotator: an approach to collection,
organization, and exchange of all relevant information for high-throughput gene expression
studies. Bioinformatics, 20(4):452-459.
[3] Furlanello, C., Serafini, M., Merler, S., and Jurman, G. (2003). Entropy-based gene ranking
without selection bias for the predictive classification of microarray data. BMC Bioinformatics,
54(4).
[4] Aach, J. and Church, G. M. (2001). Aligning gene expression time series with time warping
algorithms. Bioinformatics, 17(6):495-508.
[5] Furlanello, C., Merler, S., Jurman, G., and Serafini, M. Unsupervised Discovery from Gene
Tracking with RFE Classification Systems. ISMB/ECCB 2004.
FEATURES
The user may pick up one or more curves from the display, or consider
indication from unsupervised hierarchical clustering (from the standard
R clustering package), and construct specific queries.
In particular, given a potential outlier sample [5], the user may retrieve
information on the biomaterial, or on the experimental conditions.
We plan to fit the new module within the RAD (RNA Abundance
Database) schema and to further support the interaction with the
classification setup.
The prototype is currently interfaced to a standalone PostgreSQL
database, and a few elementary features have been implemented in
order to covariate the selected samples with phenotype information
possibly present in the dataset.
DATA
In this example, the prototype is connected to PostgreSQL data
tables.
Microarray data: mouse model of Myocardial Infarction from the
Cardiogenomics PGA - Genomics of Cardiovascular Development,
Adaptation, and Remodeling - NHLBI Program for Genomic
Applications, Harvard Medical School.
http://www.cardiogenomics.org
In its final version, GUS/RAD will become its natural interface to
the data.
The development of the PostgreSQL porting of GUS is on its way.
The MPBA group at ITC-irst is a member of the team involved in the
project.