InforSense Company Overview - The International Conference on

Download Report

Transcript InforSense Company Overview - The International Conference on

Bioinformatics workflow integration
Yike Guo/Jiancheng Lin
InforSense Ltd.
3 April 2016
Life Science Challenges



Information resides on different:
 Granularity levels (individual records vs. massive repositories)
 Abstraction levels (models ranging from entire systems to
compound patterns)
 Domain levels (clinical, sequence, instrument…)
Researchers
 Grouped in Virtual Organizations (VOs)
 Working on the Grid
 Need to communicate across physical and scientific/cultural
barriers
Tools
 Legacy, well-established in the process
 Novel, essential to innovation
 In need of a consistent infrastructure to connect the two
groups
Discovery Informatics in Post-Genome Era
secondary structure
tertiary structure
polymorphism
patient records
epidemiology
expression patterns
physiology
sequences
alignments
ATGCAAGTCCCT
AAGATTGCATAA
GCTCGCTCAGTT
linkage maps
cytogenetic maps
physical maps
receptors
signals
pathways
Integrative Analytics Workflow Environment
Portal
Workflow
Warehouse
Informatician
Deployed
Web App for
End Users
Data
Analysis
Group
Integrative
Analytics
Workflow
Environment
Files
Data
R
Inbuilt
Analytics
Applications
Components
Matlab
Oracle Data
Preprocess
DB
WEKA
Oracle
DM
S-Plus
SAS
KXEN
Web
Services
BioTeam
iNquiry
3rd Party & Custom Apps
MDL
Spotfire
Daylight
Healthcare
InforSense Workflow Life Cycle
 Constructing a ubiquitous workflow : by
scientists
 Integrate your information
resources/software applications crossdomain
 Support innovation and capture the best
practice of your scientific research
 Warehousing workflows: for scientists
 Manage discovery processes in your
organisation
 Construct an enterprise process
knowledge bank
 Deployment workflow: to scientists
 Turn your workflows into reusable
applications
 Turn every scientist into a solution
builder
Workflow Creation, Integration, and Deployment
1 Select:
Data Sources
Data Mining / Statistics
Data Processing / Transformation
3rd Party applications (e.g.Haploview)
Interactive data visualization / reporting
2 Connect:
Connect data and components in GUI
Workflow describes complex
data processing and analysis
3 Execute:
“In database” processing & analytics
“Cluster / Grid” execution
4 Deploy:
Define parameters of workflow to expose
Publish as: portlet, web application,
SOAP service, command line app
Biology to Chemistry

Novel sequences are compared to known protein structures

The resulting set of ligands on these matching structures is used to search
small molecule databases for similar compounds

Compounds are then analyzed using KDE tools such as PCA and clustering
to provide a diverse, representative subset for further assays
Navigating KEGG pathways

Gene names from EMBL are used to query KEGG via their
Webservice API for appropriate pathways

Further Webservice API calls allow navigation of the data to find:

Pathway compounds

Other genes in the pathways

Visualization of query genes on their pathways
cDNA sequence annotation and alignment

A novel cDNA is annotated using EMBOSS tools, and a
BLAST similarity search perfomed against human
proteins

Annotations used to aid identification of predicted
proteins derived from the cDNA
Ortholog analysis using BLAST

Sequence libraries from 2 organisms are cross-compared
using BLAST to determine the best bi-directional
matches of sufficient quality
Clustering of Affymetrix data with R

Native Affymetrix CEL files are loaded using R/Bioconductor

Differentially expressed genes calculated using KDE statistical nodes

The resulting list of genes is then clustered using HCLUST in R
Microarray analysis using text mining



Microarray data normalized in KDE
Upregulated genes annotated from Pubmed to obtain a set of related
scientific papers
Text mining used to mine the paper collection and extract information most
relevant to the researcher
BAIR project
Biological Atlas of Insulin Resistance
Normal Diet
6 to 10
animals
Time 
•Genetic data
•Mouse ID
•Cage ID
•Environmental conditions
•Management records
•Sampling conditions
•Sample Storage conditions
•Ref of Biological assays
used across the study
Fat Fed
Physiological
Data prior change
In Diet
•Weight
•Blood analysis
•Urine analysis
Similar data will be recorded
regarding experiments performed
with cells lines cDNA arrays
ATF, GAL files
Endpoint
Culling or death
Physiological Data after change In Diet.
One time point in end-point experiment
Several time points in longitudinal study
•Weight
•Blood analysis
•Physiological parameters
•Metabonomics
•Urine analysis
•Physiological parameter
•Metabonomics
•Tissue sampling
•Liver,Fat, Muscle, Kidney
•Metabonomics
•Proteomics (general, glyco-,
proteomics)
•Transcriptomics
•Culling conditions
Data Formats
Affymetrix
XLS files
Chromatograms
Filemaker Pro
Metabonomics
NMR spectra
•Raw Data
•Normalised Data
•Processed Data
phospho-
Collaborative Visualisation
Literature mining and compound analysis
Grid Computing
BAIR Portal
Integrative support
 Information:
 Data models to support individual domains (sequences, NMR
profiles…) and methods to map them into generic analysis (tables,
text)
 Annotation databases integrated through Web Service APIs
 Researchers
 Sharing of work and knowledge through reusable workflow
components
 Aim for minimum technical overhead when linking new resources
 Tools
 Focus on integration methods rather than one-off tool linkage
 Researchers able to link to standard tools without the need for an
IT specialist
 Databases accessed through aggregators (SRS, BioMart…)