A Risk Minimization Framework for Information Retrieval

Download Report

Transcript A Risk Minimization Framework for Information Retrieval

BeeSpace Informatics Research:
From Information Access to
Knowledge Discovery
ChengXiang Zhai
Nov. 14, 2007
BeeSpace Technology: From V3 to V4
Query Docs Genes Function
Search &
Navigation
Function
Analysis
Literature
Question Answers
Question Answers
ER Graph
Mining
Inference
Engine
Entities
Relations
Knowledge
Base
Expert
Knowledge
New Functions in V4
• Massive Entity/Relation Extraction
• Graph Indexing and Mining
• Integration of Expert Knowledge & Reasoning
• Personalization & Info/Knowledge Sharing
• “Plug and Play” (PnP)
Massive Entity Recognition
• Class1: Small Variation (Dictionary/Ontology)
– Organism, Anatomy , Biological Process, Pathway,
Protein Family
• Class2: Medium Variation
– Gene, cis Regulatory Element
• Class3: Large Variation
– Phenotype, Behavior
Massive Relation Extraction
•
Expression Location
–
•
•
•
the expression of a gene in some location (tissues, body
parts)
Homology/Orthology
–
one gene is homologous to another gene
Biological process
–
one gene has some role in a biological process
Genetic/Physical/Regulatory Interaction
–
–
one gene interacts with another gene in a certain fashion (3
types of relations)
a simple case: Protein-Protein Interaction (PPI)
Entity Relation Graph Mining
• The extracted entities and relations form a
weighted graph
• Need to develop techniques to mine the graph
for knowledge
– Store graphs
– Index graphs
– Mining algorithms (neighbor finding, path finding,
entity comparison, outlier detection, frequent
subgraphs,….)
– Mining language
Integration of Expert Knowledge
• How can we combine expert knowledge with
knowledge extracted from literature?
• Possible strategies:
– Interactive mining (human knowledge is used to
guide the next step of mining)
– Trainable programs (focused miner, targeting at
certain kind of knowledge)
– Inference-based integration
Inference-Based Discovery
•
•
•
Encode all kinds of knowledge in the same knowledge
representation language
Perform logic inferences
Example
– Regulate (GeneA, GeneB, ContextC). [Literature mining]
– SeqSimilar(GeneA,GeneA’) [Sequence mining]
– Regulate(X,Y,C) Regulate(Z,Y,C) & SeqSimilar(X,Z) [Human
knowledge]
–  Regulate(GeneA’,GeneB,ContextC)
– ADD: InPathway(GeneB, P1)
– InPathway(X,P) Regulate(X,Y,C) & InPathway(Y,P) [Human
knowledge]
–  InvolvedInPathway(GeneA’,P1)
Personalization & Workflow Management
• Different users have different tasks 
personalization
– Tracking a user’s history and learning a user’s
preferences
– Exploiting the preferences to customize/optimize
the support
– Allowing a user to define/build special function
modules
• Workflow management
Information/Knowledge Sharing
• Different users may perform similar tasks 
Information/Knowledge sharing
– Capturing user intentions
– Recommend information/knowledge
– How do we solve the problem of privacy?
• Massive collaborations?
– Each user contributes a small amount of knowledge
– All the knowledge can be combined to infer new
knowledge
Plug and Play
• Users’ tasks vary significantly
• Need flexible combinations of basic modules
• Need to move toward a “discovery
workbench”
– How do we design basic modules?
– How do we support synthesis of information and
knowledge?
BeeSpace V4
User
Vertical
Search
Services
Search &
Navigation
User
PnP Function
Analyzers
Text
Mining
Literature
User
Customized
Knowledge Base
ER Graph
Mining
Inference
Engine
Entities
Relations
Knowledge
Base
Expert
Knowledge
Discussion
• Task Model?
• PnP Modules?
• Massive Collaboration?
BeeSpace V4: System Architecture
User
User Interface/ Workflow Manager
User
Special
Search
Search & Navigation
Topic Modelng
Literature
Inference
Engine
Modeling & Personalization
PnP Function
Analyzers
Machine
Learning
NLP
Information
Extraction
ER Graph
Mining
Entities
Relations
Hypothesis
Knowledge
Base
Expert
Knowledge
…
NCBI
Genome
Databases
BeeSpace V4: System Architecture
User
Yuanhua,
Moushumi
User Interface/
Workflow Manager
User
Inference
Yue,
Xin,
Engine
Moushumi
Yuanhua
Modeling
& Personalization
Xu,
Yue
Special
Search
Moushumi
Yuanhua
Search
& Navigation
PnP Function
Xin, Xu,
Moushumi
Analyzers
Peixiang
Xin,
Yuanhua
Topic
Modelng
Machine
Xin,
Xu, Yue
Learning
NLP
Literature
Yue
Information
Extraction
Peixiang
ER Graph
Mining
Entities
Relations
Hypothesis
Knowledge
Base
Expert
Knowledge
…
NCBI
Genome
Databases
Modules
• Navigation & Search (Improve V3) [Yuanhua]
• Information Extraction [Yue]
• ER Graph Mining [Peixiang]
• Specialized Search [Xu]
• Function Analyzers [Xin]
• User Modeling, Personalization, Workflow
[Yuanhua]
• Inference Engine [Yue]
Informatics Research Themes
•
•
•
•
•
•
Specialized Search
– Hypothesis search
Information Extraction
– Entities, relations
Graph Mining
– Indexing, query language, mining algorithms
Function analyzers
– Gene set annotator
Personalization
– User model
Inference engine
– Knowledge representation language, uncertainty
Example of Interactive Graph Mining
Behavior B2
isa
Co-occur-fly
Gene A1
Orth-mos
Gene A1’
Reg
isa
Co-occur-bee
Behavior B1
Behavior B3
Co-occur-mos
Co-occur-fly
Gene A2
Gene A3
Reg
Reg
Reg
Gene A4’
orth
Gene A4
Gene A5
1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}
2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}
3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}
4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}
X= PathBetween({A4,A4’}, B4, {co-occur, reg,isa})
Behavior B4
Inference-Based Discovery
•
•
•
Encode all kinds of knowledge in the same knowledge
representation language
Perform logic inferences
Example
– Regulate (GeneA, GeneB, ContextC). [Literature mining]
– SeqSimilar(GeneA,GeneA’) [Sequence mining]
– Regulate(X,Y,C) Regulate(Z,Y,C) & SeqSimilar(X,Z) [Human
knowledge]
–  Regulate(GeneA’,GeneB,ContextC)
– ADD: InPathway(GeneB, P1)
– InPathway(X,P) Regulate(X,Y,C) & InPathway(Y,P) [Human
knowledge]
–  InvolvedInPathway(GeneA’,P1)
PnP Function Analyzers
• Basic objects
– GeneSet, DocSet, SentSet, TermSet
• Basic operators
– Gene summarizer
– GeneSet annotator
–…
Splitter
Filter/Attractor
Converter ….
EntitySet
GeneSet
BehaviorSet
…
Doc/SentSet
ModelOrg
….
GeneSearch: GeneSetDoc/SentSet
DocSplitter: Doc/SentSet{Set1, …,Setk}