BeeSpace Software

Download Report

Transcript BeeSpace Software

BeeSpace Software
Plans, Design, and Development
Outline








Goals
Context
Approach
Software Process
Functionality
Design
Implementation Details
Future Prospects
Project Goals & Parameters






“This project will analyze social behavior… using Apis Mellifera as the model organism”.

Goal: support research and analysis of the Western honey bee.
Using “biology research (that) will generate a unique database of gene expressions…” and
“microarray experiments (that will) utilize the recently sequenced genome, supported by
state-of-the-art statistics.”

Goal: support application of biological methods and techniques for exploratory
analysis.
And using “informatics research (that) will develop an interactive environment to analyze all
information sources relevant to bee social behavior.”

Goal: support application of language processing methods for exploratory analysis.
“The BeeSpace environment will enable users to navigate a uniform space of diverse
databases and literature sources for hypothesis development and testing. (Ref:
http://www.beespace.uiuc.edu/)

Goal: support dual analysis methodologies via an integrated analysis environment.
Parameter: 5 years to complete project, includes research, development, deployment,
outreach and documentation.
Parameter: annual milestones and workshops expected.
Context

There are voluminous amounts of biomedical and genomic literature containing
valuable knowledge and research results.


There exist novel language processing techniques that have been primarily applied
in niche applications.


Implication: Emerging technologies (NLP, TM, etc.) can provide backbone for strategic
solution, but their risks must be mediated thru controlled developmental cycles.
There exist numerous, but currently isolated, tools for data processing of
bioinformatics.


Implication: Too much for human processing; and not in a machine-ready format for
reasoning based systems.
Implication: Opportunities exist for interoperability with disparate systems, but
success hinges on standardization.
The web is seeing an increase in smaller, highly focused communities-of-interest.

Implication: Opportunities exist for supporting the creation and management of
localized “knowledge-spaces”.
Context – Related Tools & Projects




3rd Millennium Inc. – “…development of an integration framework for genomic, gene
expression, and interaction data (protein-protein well as protein-DNA) from multiple
sources and model organisms that can enable the display of the relationships between
biochemical objects into the context of biological pathways and networks.”
iHOP – Information Hyperlinked Over Proteins: supports lookup and summarization of
genes/proteins. “In general more than 90% of all active relations between proteins in the
literature are expressed syntactically as ‘protein verb protein’”. Ref.
IntAct Database – “IntAct provides a freely available, open source database system and
analysis tools for protein interaction data. All interactions are derived from literature
curation or direct user submissions and are freely available.”
Entrez eUtils – A web services (SOAP) interface for programmatically querying and
interacting with NCBI databases.
Software Process
System Development Life Cycle (SDLC)







Identify project goals and critical success factors.
Investigate current methodologies and tools that have functional or domain
overlap with project objectives.
Research the applicability of novel analysis techniques for extracting deeply
embedded and stratified knowledge structures.
Build an integrated software suite that will allow for interactive analysis and
augmentation of rich data sets.
Test and deploy software to focused user groups.
Document and publish research results.
Re-iterate above process for continuous quality improvement.
Functionality






Should be web-based system supporting lightweight GUI components and having
minimal end-user requirements.
Should accommodate user-directed query-by-navigation (QBN) of “concept
space”.
Should extract and normalize concepts as “equivalence classes” of things with
highly similar meaning. Should recognize and denote entities.
Should allow user to drill-down, drill-up and drill-across concept space. E.g. textto-concept, concept-to-concept, concept-to-theme, and the reverse directions as
well.
Should allow user to perform encyclopedia-style lookup of entities.
Should provide hooks for tie-in to 3rd party bioinformatics tools.
Design Principles










Maintainability
Portability
Extensible
Efficiency
Organized
Interoperability
Configurability
Ease-of-use
Trusted
“Quality without a Name”
References: “Code Complete”, 2nd ed., “Pattern-Oriented Software Architecture”, volume 1.
Design – Use Case Diagram
Design - Component Diagram
BeeSpace Design
Application Layer
BeeSpace
Navigator
Query & Data Access Layer
Fuzzy
Query
Engine
Data
Access
Component
Annotated Data, Meta-Data and Indices
XML
Schemas
XML
Data
Indices
Data Processing Layer
Entity
Recognizer
POS
Tagger
NP
Chunker
Inverter
Data Sources
Text
Bases
Concept
Normalizer
Concept
Generator
Design - Deployment Scenarios
BeeSpace Software Packaging
Web
App
Standalone
GUI App
Core Library
Data Processing
Components
Query/Access
Components
Agents/P2P
Clients
Extension Library
Communication
Components
Design – Class Diagram
Implementation Details
The current system is being constructed as follows:

The (v1.0) application is being developed as a web-based application.


The output of the data processing pipeline is a set of indices and annotated data files that the
client application depends on.


Design Decision: There is a clear separation-of-concerns between the server-side processing and the
client-side interface. XML is being fully utilized to as a data interchange format between software
components.
The pipeline is composed of independent software components, but these components need
to be inter-connected.


Design Decision: The interface is built on top of lightweight technologies (e.g. HTML, DHTML &
JavaScript). Typical web-app challenges, such as sessioning and security, need to be addressed.
Design Decision: Components are called as executables with defined interfaces.
Some components need to be able to store their data aggregations persistently (and other
components may need access to this data).

Design Decision: Currently each component handles this problem independently. Better, long term
solution is to extract out this concern and address it globally; for example, using ORDBMS.
Future Implementation Details





Support both a web interface (HTML, CSS, DHTML, JavaScript) and a full-blown GUI
interface (Java Web Start app).
Consistent Java implementation for portability, maintainability, RAD, etc.
Incorporate a DBMS for consistent handling of “persistent storage”.
Library extensions for communication between distributed, heterogeneous applications
(perhaps KIF).
Optimized data processing and communication.
Climbing the Pyramid
Pyramid of Knowledge
Text Mining
Data Mining
? ?
Computer Automated Research (Success)
Kn o
Intelligent-driven Research (Profit)
Hidden Relationships (Network)
Semantics (Nodes)
Re
la
Co
nc
Raw Text (Lit.)
Tex
t
w. Know
tion
ept
s
Computer Automated Business (Success)
s
.
Intelligent-driven Business (Profit)
s
tern
t
a
P
tion
a
m
r
Info
ta
Da
Predictions (Trends)
Aggregations (Reports)
Raw Data (Txns)
Future Prospects







Generalize the system so that it is NOT domain-specific and can be readily applied to other domains.
Allow for persistent sessioning and sharing of sharing of knowledge-spaces amongst communities-ofinterest.
Support a visual query system (VQS) interface and/or a query-by-example (QBE) interface.
Support all kinds of hypothesis generation: deduction, abduction & induction.
Support personalized annotations. (What constitutes a “good” KR structure: clarity, logic, expressive?).
Smooth the integration between the BeeSpace Navigator and the myriad number of web-based tools.
Support n-ary, semantically rich relations as opposed to just dyadic.
Visual Query in Text Mining Application
Org: bee
Org: fly
Found-In
Found-In
Gene: ?x
Gene:
Glued
HasProduct
Protein: ?y
Threshold:
0.9
SimilarTo
HasProduct
Polypetptide:
p150Glued
Future BeeSpace Components
Future BeeSpace Design
Application Layer
BeeSpace
Analyzer
BeeSpace
Workflow
Manager
BeeSpace
Navigator
Query & Data Access Layer
Q/A
Component
Expert
Shell
Component
Fuzzy
Query
Engine
Data
Access
Component
Text
Miner
Entity
Mapper
Concept
Generator
Central Knowledge Base
ORDBMS
Data Processing Layer
Entity
Recognizer
POS
Tagger
NP
Chunker
Inverter
Concept
Normalizer
Topic
Detection
Relation
Extractor
Rule
Miner
Ontology
Detector
Data Sources
Data
Bases
Text
Bases
Web
Bases
Snake Space?