Transcript Document

(def functor BeeSpace v3)
Core BSv3 Features
• Personalized Collections
– All functions operate on virtual collections.
• Gene Analysis Functions
– Gene Annotation, Summarization, etc.
• Topic Exploration
– Evolve/extract/expand/map/compare topics.
Challenges
New problems:
• Modifying all functions to
operate on virtual
collections.
• Build Intelligent Gene
Retrieval functionality.
• DB-supported apps.
• Optimize & parallelize
implementations.
• Mod Indexing strategies.
Constraint: 5 month timeline
Old problems:
• Better tokenization needed.
• Teaming and code sharing.
• Upgrade of Lemur & Indri.
• Structured Queries.
• Multiple languages; diverse
skill sets.
Big Hill, Little Time
Accomplishments
• Gene-focus tokenization scheme implemented.
• Intelligent Gene Retrieval function near
completion.
• Optimized & parallelized (EM) Theme
Clustering.
• Developed DB infrastructure for application
support: multiple DBs, tables, DAO access.
• Developed Common Library (4K lines C++) for
sharing across applications.
Accomplishments cont…
• Upgraded Lemur/Indri and normalized
indexing.
• Developed (6) Collection operations and
Boolean Query support.
Development Life Cycle
Project Goals
Documentation &
Req’s Tracking
Requirements Definition
Deployment
& Backup
Conceptual
Models
Deployment
Analysis
Sys Dev Life Cycle
(Waterfall)
Testing
Design
Logical/Physical
Models
Scaffolding &
Testing Software
Implementation
Software
Libraries &
Applications
Logical View
Application Layer
BeeSpace
Navigator
Query & Analysis Layer
Search
Engine
Gene
Analysis
Topic
Exploration
Clustering
Common Software Library
Data & Knowledge Layer
XML
Full Text
Indexes
RDBMS
Data Processing Layer
Tokenization
Named
Entity
Recognition
Classifiers
Indexers
Mining & Analysis of EAR Graphs
• Need to be able to quickly analyze and mine
knowledge nets and EAR graphs with ad-hoc
operations. User-driven exploration via 4GL
query language.
• Many data models already fit within an
Entity/Attribute/Relationship (EAR) model. Very
flexible design.
• 4GL approach to operations: increases target
audience and allows for query optimization:
select src, sum(wt) from edge group by trg order
by src;
Sample EAR Data Model
Entity
Topic: 11
Collection: fly
Weight: 0.10
Joint: 0.11
Relation
MI: 0.023
Class: gene
Year: 2001
Term: amfor
Prior: 0.07
Doc: pmid:123
Length: 133
DFreq: 4
Attribute
Class: behavior
Year: 1995
Prior: 0.12
Term: forage
Doc: pmid:444
Length: 401
DFreq: 83
Weight: 0.10
Joint: 0.03
Topic: 23
MI: 0.012
Collection: bee
Approach
• Adopt a layered system (stack) approach: Data
layer, core software layer, interpreter, GUI.
Possibility for administration client.
• Data Layer: currently implemented on top of
RDBMS. Achieves flexibility, outreach/reuse, and
is often cluster-compatible.
• Core Software Layer: C++ STL implementation
with functional programming paradigm.
• Best-of-Worlds Effect: combines salient features
of relational modeling with functional power and
adaptive-object modeling methodology.
Motivation #1
(define (deriv exp var)
(cond ((constant? exp) 0)
((variable? exp) (if (same-variable?
exp var) 1 0)
((sum? exp) (make-sum (deriv
(addend exp) var) (deriv augend exp)
var))) ….)
Ref: “Structure and Interpretation of Computer
Programs”, Abelson et al.
Motivation #2
• C++ STL:
• class myFunctor : std::unary_functor<int,
double>
• { double operator()(int x) { return 2.0*x; } };
• std::for_each(list.begin(), list.end(), myFunctor);
• std::set_intersection(a.begin(), a.end(),
b.begin(), b.end, result.begin());
Applications
•
•
•
•
•
•
•
Concept switching.
Theme extraction.
Theme expansion, shrinking, morphing.
Path finding; net flow analysis.
Support for propagation nets, belief nets.
Clustering, clique finding, etc.
*** Not just standard CS/statistical algorithms,
but utilizing semantic information and userdirectives.
Modeling a Concept Space
• Alternative definitions:
F  { f : CS  CS}
Case 1: powerset: implies F is monoid CS : 2C F  CS
wrt funct comp (*)
k
CS
k  j  Z  f  f j
coset( g ) : { f  g | f  F}
Case 2: finite vector space: F is R^n
(choose generalized velocities)
F  L(CS , CS )
Case 3: random vectors: F can be
modeled w/ functs over rvs.
F ~ f ( x1 , xn ;)
Theme =>point;
Region => set-of-subset/compact
set/distribution => need to capture
variances ~ N(mu, sigma)
Other ideas…
• EM ~ separation: w
ind. d | c => matrix
factorization. Golub
discusses Jacobi
iterations (parallelizable)
• Cons inference markov
net w/ Gibbs over max.
cliques.
• What are min# of
operators needed for
mining of EAR…
 p( w , d )
  p(wi | ck ) diag( p(ck )  p(d j | ck )
 p( w , w ) 
  p(wi | ck ) diag( p(ck )  p(w j | ck )
i
i
i 1..n
j 1..m
j
j
i , j 1..n

p(ci ) : g (ci ) /  g (ci )


t

t