A General Optimization Framework for Smoothing Language

Download Report

Transcript A General Optimization Framework for Smoothing Language

A Researcher’s Workbench in 2020:
Intelligent Information Systems for
Knowledge Synthesis and Discovery
ChengXiang (“Cheng”) Zhai
Department of Computer Science
Institute for Genomic Biology
Graduate School of Library and Information Science
Department of Statistics
University of Illinois at Urbana-Champaign
http://www.cs.uiuc.edu/homes/czhai
1
Assuming data sharing isn’t a problem,
what kind of systems are needed
to effectively support
representing, integrating, and reasoning over
human knowledge?
What are the key computational challenges?
2
Computer-Aided Research (CAR) in 2020
Public
data/Info/
knowledge
…
…
Public
data/Info/
knowledge
Network
1. Multi-level integration of data/info/knowledge
2. Multimode info access
5. Collaborative research
3. Research task support
Personal
data/info/
knowledge
4. Personalized CAR Personal
data/info/
knowledge
3
1. We need multiple levels of
integration
4
Five Levels of Integration
• Level 1: “Syntactic” integration of multiple sources
– Scalable, robust, but minimum support for discovery
• Level 2: Semantic integration (ontology)
– Scalable, less robust, better support for discovery
• Level 3: Synthesis of knowledge (entities, relations)
– Less scalable, not robust, support for interactive discovery
• Level 4: Synthesis of knowledge + Inference rules
– Only applicable to a limited domain, but potentially support
automatic discovery
• Level 5: Specialized discovery model
– Automatic hypothesis testing, but limited to a special
discovery/prediction task
5
Multi-level support is needed
because…
• Knowledge extraction is far from 100%
accurate (NLP is difficult)
• Interpretation of knowledge is inherently
context-sensitive and low-level support is
needed for context and provenance
• Automation-scalability tradeoff will not
disappear (soon)
• …
6
Automation-Scalability Tradeoff
Goal
Automation of discovery
Specialized statistical
prediction models
Logic-based
Inference systems
“Beyond ontology” ER graph
integration
analysis engine
Ontology-based
semantic integration
“Ontology-Free” integration
Federated search
engines
Scalability/Generality
7
Interactive ER Graph Analysis
• The extracted entities and relations form a
weighted graph
• Need to develop techniques to mine the
graph for knowledge
– Store graphs
– Index graphs
– Mining algorithms (neighbor finding, path
finding, entity comparison, outlier detection,
frequent subgraphs,….)
– Mining language
8
Example of Interactive Graph Mining
Behavior B2
isa
Co-occur-fly
Orth-mos
Reg
Co-occur-bee
Behavior B1
Gene A1
Gene A1’
isa
Behavior B3
Co-occur-mos
Co-occur-fly
Gene A2
Gene A3
Reg
Reg
Reg
Gene A4’
orth
Behavior B4
Gene A4
Gene A5
1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}
2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}
3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}
4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}
X= PathBetween({A4,A4’}, B4, {co-occur, reg,isa})
9
Inference-Based Discovery
• Encode all kinds of knowledge in the same knowledge
representation language
• Perform logic inferences
• Example
Regulate (GeneA, GeneB, ContextC). [Text mining]
SeqSimilar(GeneA,GeneA’) [Sequence mining]
Regulate(X,Y,C) Regulate(Z,Y,C) & SeqSimilar(X,Z)
[Human knowledge]
 Regulate(GeneA’,GeneB,ContextC)
ADD: InPathway(GeneB, P1)
InPathway(X,P) Regulate(X,Y,C) & InPathway(Y,P)
[Human knowledge]
 InvolvedInPathway(GeneA’,P1)
10
Integration of Expert Knowledge
• How can we combine expert knowledge
with knowledge extracted from literature?
• Possible strategies:
– Interactive mining (human knowledge is used
to guide the next step of mining)
– Inference-based integration
– Trainable programs (focused miner, targeting
at certain kind of knowledge)
11
2. We need multiple-mode information
access
Querying/Browsing
Researcher
Recommendation
How can we connect the right information
with the right user at the right time?
12
Collaborative Surfing [Want et al. 09]
Browsing and querying are tightly integrated
Search log organized as a topic map
A sustained way of collaborative surfing
13
News Recommender for Facebook
[Gupta et al. 09]
Recommendation of research papers?
14
3. We need to go beyond
information access to support tasks
• Research topic identification
– “hot topic” retrieval, interdisciplinary topic
retrieval, topic recommendation
• Literature review
– automatic survey generation
• Collaborator recommendation
– To work on an emerging interdisciplinary topic
– To work on a joint grant proposal
• Hypothesis generation & testing (question
answering)
15
Topical Trends in KDD [Mei & Zhai 05]
Normalized Strength of Theme
0.02
Biology Data
0.018
Web Information
0.016
Time Series
0.014
Classification
Association Rule
0.012
Clustering
0.01
Bussiness
0.008
0.006
0.004
0.002
0
1999
2000
2001
2002
2003
2004
Time (year)
16
gene 0.0173
expressions 0.0096
probability 0.0081
microarray 0.0038
…
marketing 0.0087
customer 0.0086
model 0.0079
business 0.0048
…
rules 0.0142
association 0.0064
support 0.0053
…
Theme Evolution Graph [Mei & Zhai 05]
1999
2000
2001
2002
SVM 0.007
criteria 0.007
classifica –
tion
0.006
linear 0.005
…
decision 0.006
tree
0.006
classifier 0.005
class
0.005
Bayes
0.005
…
web 0.009
classifica –
tion 0.007
features0.006
topic 0.005
…
2003
mixture 0.005
random 0.006
cluster 0.006
clustering 0.005
variables 0.005
…
…
…
…
Classifica
- tion
text
unlabeled
document
labeled
learning
…
17
0.015
0.013
0.012
0.008
0.008
0.007
…
Informa
- tion 0.012
web
0.010
social 0.008
retrieval 0.007
distance 0.005
networks 0.004
…
2004
T
topic 0.010
mixture 0.008
LDA 0.006
semantic
0.005
…
Comparing News Articles [Zhai et al. 04]
Iraq War (30 articles) vs. Afghan War (26 articles)
The common theme indicates that “United Nations” is involved in both wars
Cluster 1
Common
Theme
Iraq
Imagine
Theme
Afghan
Theme
united
nations
…
Cluster 2
0.042
0.04
killed
0.035
month
0.032
deaths
0.023
…
n
0.03
troops
0.016
Weapons
0.024
hoon
0.015
we
can compare
literature
Inspections 0.023
sanches 0.012
related areas…
…
…
Northern 0.04
taleban
0.026
alliance
0.04
rumsfeld 0.02
kabul
0.03
hotel
0.012
taleban
0.025
front
0.011
aid
0.02
…
…
Cluster 3
…
…
in two
…
Collection-specific themes indicate different roles of “United Nations” in the two wars
18
BeeSpace System [He et al. 10]
Task support + ER Question answering
4. Personalization & Workflow
Management
• Different users have different tasks 
personalization
– Tracking a user’s history and learning a user’s
preferences
– Exploiting the preferences to
customize/optimize the support
– Allowing a user to define/build special function
modules
• Workflow management
20
UCAIR: User-Centered Adaptive IR [Shen et
When a user clicks on the “back” button after viewing a document,
UCAIR reranks unseen results to
pull up documents similar to the
21 one the user has viewed
al. 05]
5. Collaborative Research
Information/Knowledge/Workflow Sharing
• Different users may perform similar tasks 
Information/Knowledge/workflow sharing
– Capturing user intentions
– Recommend information/knowledge/workflow
– How do we solve the problem of privacy?
• Massive collaborations?
– Each user contributes a small amount of
knowledge
– All the knowledge can be combined to infer new
knowledge
– An ESP-like online game for discovery? 22
Knowledge Synthesis & Discovery Game
(inspired by the ESP game)
Which of the following genes is
likely associated with
foraging behavior?
Hypothesis
Selection
Ontology
Mapping
…
Which of the following
concepts can also
describe “car”?
Bonus score
based on validation
in publication
Immediate Scoring
based on Consensus
Hypothesis
Selection
Ontology
Mapping
…
…
23
Big Challenges
1. What’s the Public
right system architecture Public
(= sharing model?)?
data/Info/
data/Info/
centralized
vs. distributed, client vs.
server
knowledge
knowledge
2. How can we sustain sharing and massive collaboration?
open system, “plug and play”, KSD game …
3. How can we seamlessly support multiple-level integration?
4. Specific computationalNetwork
challenges:
Multi-level integration of data/info/knowledge
-- Large-scale 1.NLP,
particularly information extraction
2. Multimode
info access machine learning and knowledge base?)
( Large-scale
5. Collaborative
research
-- Large-scale semantic
mapping
(ontology)
-- Interactive fuzzy ER graph mining
3. Research task support
4. Personalizeddatalog)
CAR Personal
-- Personal
Scalable
inference engines (probabilistic
data/info/
data/info/
…knowledge
knowledge
…
…
24
A Possible System Architecture
User
User Interface/ Workflow Manager
User
Special
Search
Search & Navigation
InformationRetrieval
Data/Info +
Ontology
Inference
Engine
Modeling & Personalization
Analysis Engine
NLP
Machine
Learning
Information
Extraction
ER Graph
Mining
Entities
Relations
Hypothesis
Knowledge
Base
Expert
Knowledge
…
NCBI
Genome
Databases
25
References
[1] Xuanhui Wang, Bin Tan, Azadeh Shakery, ChengXiang Zhai, Beyond Hyperlinks: Organizing
Information Footprints in Search Logs to Support Effective Browsing, Proceedings of the
18th ACM International Conference on Information and Knowledge Management ( CIKM'09),
pp.1237-1246, 2009. http://doi.acm.org/10.1145/1645953.1646110
[2] Manish Agrawal, Maryam Karimzadehgan, and ChengXiang Zhai. An Online News
Recommender System for Social Networks. In Proceedings of ACM SIGIR 2009 workshop on
Search in Social Media, 2009. http://times.cs.uiuc.edu/czhai/pub/sigir09ssm-facebook.pdf
[3] Qiaozhu Mei, ChengXiang Zhai, Discovering Evolutionary Theme Patterns from Text -- An
Exploration of Temporal Text Mining, Proceedings of the 2005 ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining , (KDD'05 ), pages 198-207, 2005
http://doi.acm.org/10.1145/1081870.1081895
[4] ChengXiang Zhai, Atulya Velivelli, Bei Yu, A cross-collection mixture model for comparative
text mining, Proceedings of ACM KDD 2004 ( KDD'04 ), pages 743-748, 2004.
http://doi.acm.org/10.1145/1014052.1014150
[5] Xin He, Yanen Li, Radhika Khetani, Barry Sanders, Yue Lu, Xu Ling, ChengXiang Zhai, Bruce
Schatz. BSQA: integrated text mining using entity relation semantics extracted from
biological literature of insects, Nucleic Acids Research, 2010 38(Web Server issue):W175W181. http://nar.oxfordjournals.org/cgi/content/full/38/suppl_2/W175
[6] Xuehua Shen, Bin Tan, and ChengXiang Zhai, Implicit User Modeling for Personalized Search
, In Proceedings of the 14th ACM International Conference on Information and Knowledge
Management ( CIKM'05), pages 824-831. 2005, http://doi.acm.org/10.1145/1099554.1099747
[7] Qiaozhu Mei, ChengXiang Zhai. Generating Impact-Based Summaries for Scientific Literature
, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies ( ACL-08:HLT), pages 816-824, 2008.
http://www.aclweb.org/anthology/P/P08/P08-1093.pdf
26