Case Studies and Future - Artificial Intelligence Laboratory

Download Report

Transcript Case Studies and Future - Artificial Intelligence Laboratory

Knowledge Management Systems:
Development and Applications
Part III: Case Studies and Future
Hsinchun Chen, Ph.D.
McClelland Professor,
Director, Artificial Intelligence
Acknowledgement: NSF DLI1, DLI2,
Lab and Hoffman ENSDL, DG, ITR, IDM, CSS, NIH/NLM,
NCI, NIJ, CIA, NCSA, HP, SAP
Commerce Lab
美國亞歷桑那大學, 陳炘鈞 博 The University of Arizona
士
Founder, Knowledge
Computing Corporation
Knowledge Management Systems:
Case Studies
Multi-lingual Knowledge Portal
(1M):
Meta searching, post-retrieval
analysis, summarization,
categorization, AI Lab tooolkits
• Knowledge Portals are online searching systems that
provide large amount of information resources and
services within a specific domain.
– Providing frequently updated and highly domain-specific
information.
– Providing efficient and precise searching service.
– Providing advanced analysis functionalities which can help
users find the information needed among huge amount of
data.
– Providing additional tools such as Personalization and
Alerting System to facilitate the searching tasks.
NanoPort: Knowledge Portal for Nanotechnology Researchers
•
Goal:
– Providing information services to nanotechnology researchers.
– The design of the content and function is based on the feedback of Nanoscale Science and
Engineering (NSSE) experts.
•
Content:
– 1,000,000 high quality nanotechnology-related webpages in database.
– Meta-search 4 search engines, 5 online databases and 3 online journals
•
Key Features:
–
–
–
–
•
Dynamic summarization
Folder display
Visualization using self-organizing map (SOM)
Patent nalysis
Funding:
– US National Science Foundation (NSF) Nano Initiative
•
Demo:
– http://nanoport.org/
Folder display
Visualization using SOM
Folder display
Visualization with SOM
The original page
Input keywords
Summary
Select search engines
Select online databases
Summarize result dynamically
Select online journals
Highlight the summary
in the original page
with corresponding color
Click on the summary
sentence and jump to
its position in the
original page
MedTextus: English Medical Intelligence
• Goal:
– Providing information services to researchers in medical domain.
• Content:
– Meta-search 5 large medicine-related online databases and journals.
• Key Features:
– Keyword suggester
– Folder display
– Visualization using SOM
• Funding:
– US National Library of Medicine (NLM)
• Demo:
– http://ai23.bpa.aizona.edu/medtextus/
Folder display
Visualization with SOM
Result page
Select databases
Input keywords
Keyword suggested
by the system
Keyword suggester
Advanced search options
eBizPort: English Business Intelligence
• Goal:
– Providing business, trading and financial information services to
commercial users.
• Content:
– 500,000 high quality webpages in database.
– Meta-search 10 authoritative online business magazines.
• Key Features:
–
–
–
–
–
Search by date
Keyword suggester
Dynamic summarization
Folder display
Visualization using SOM
• Demo:
– http://ai18.bpa.arizona.edu:8080/ebizport/
Result page
Folder display and SOM
Keyword suggester
Keyword suggested
by the system
Limit the date of
the result pages
Date of the result page
Chinese Medical Intelligence (CMI)
• Goal:
– Providing medical and health information services to both researchers and
public.
• Content:
– 350,000 high quality medical-related webpages collected from mainland China,
Hong Kong and Taiwan.
– Meta-search 3 large general Chinese search engines.
• Key Features:
–
–
–
–
Built-in Simplified/Traditional Chinese encoding conversion
Dynamic summarization for both Simplified and Traditional Chinese
Automatic categorization
Visualization using SOM
• Demo:
– http:// 128.196.40.169:8000/gbmed/
Simplified Chinese summary
Chinese folder display
Chinese visualization
with SOM
Results are from both Simplified
and Traditional Chinese
Select websites from mainland
China, Hong Kong and Taiwan
Traditional Chinese summary
Original encoding of the result
Simplified/Traditional
Chinese summarization
Select search engines from mainland
Chinese
results
China,Traditional
Hong Kong
and Taiwan
haven been converted
into simplified Chinese
Chinese Business Intelligence (CBI)
• Goal:
– Providing business, trading and financial information services to Chinese
commercial users.
• Content:
– 300,000 high quality webpages collected from Mainland China, Hong Kong
and Taiwan.
• Key Feature:
–
–
–
–
Built-in Simplified/Traditional Chinese encoding conversion
Dynamic summarization for both Simplified and Traditional Chinese
Folder display
Visualization using SOM
• Demo
– http://ai14.bpa.arizona.edu:8081/nanoport/
The largest business, trading and
financial websites in mainland
China, HongBoth
KongSimplified
and Taiwan
and Traditional
Chinese
display
results folder
are retured
Simplified Chinese summary
Chinese summarizer
Traditional Chinese summary
Chinese visualization with SOM
Spanish Business Intelligence Portal
Keyword:
comercio
electronico
Keyword suggestion
from
Scirus and Concept Space
Detailed directory of
Spanish business
resources on the Web
Search, Organize,
Search
, Organize,or
Organize
,
Visualize
or Visualizeresults
results
Meta searches 7 major
sources and provides
searching of its own
collection (PIN)
Supports boolean searching
and allows the display of 10,
20, 30, 50, or 100 results per
each meta searchers
Search Page
Summarizer
Result Page
Web pages
visualized by selforganizing map
(SOM) algorithm
Categorizer
Automatic keyword
suggestion
Web pages grouped by key
organized
by
phrasesResults
extracted
by mutual
Summarize in 3 orA5three-sentence
meta searchers
information
algorithm (nonsentences
summary
on left categorization)
exclusive
Visualizer
Original page
shown on right
Search Page
Spanish Business Taxonomy
Web sites about the
topic “Electronic
Commerce” in Spanish
speaking countries
Arabic Medical Intelligence Portal
Search Page
Result Page
Categorizer
Provides a virtual
Arabic keyboard to
facilitate input
Visualizer
Lessons Learned
• The content selection and functionality design of knowledge
portal should meet the need of real users.
• Using meta-search together with other traditional data
collecting methods can improve the recall without sacrificing
the precision of the knowledge portal.
• The structure of the webpage may introduce noise into the
dynamic summary.
• The AI Lab toolkits support scalable multi-lingual spidering,
indexing, searching, summarization, and categorization
• New Spanish and Arabic portals completed
• New cross-lingual web retrieval engine completed
Biomedical Informatics (10M):
Biomedical content,
biomedical ontologies,
linguistic phrasing,
categorization, text mining
HelpfulMED Search of Medical Websites
HelpfulMED search of Evidence-based Databases
What does database cover?
Search which databases?
How many documents?
Enter search term
Consulting HelpfulMED Cancer Space (Thesaurus)
Enter search term
Select relevant search terms
New terms are posted
Search again...
Or find relevant webpages
Browsing HelpfulMED Cancer Map
1
Visual Site Browser
Top level map
2
3
Diagnosis, Differential
4
Brain Neoplasms
5
Brain Tumors
Genescene Overview
Knowledge Base
Integrate gene relations from
literature and outside databases
and provide knowledge for
learning and evaluation in data
mining
Text Mining
Process Medline abstracts
and extract gene relations
automatically from the text
Data Mining
Process gene expression data
(and existing knowledge) and
use different algorithms to
extract regulatory networks
Interface & Visualization
Allow searching for keywords, display a map of the
relations extracted from the text and/or from the
microarray
Genescene Overview
JIF
Ontologies
External
Databases
HUGO
Publications
Medline
XML Parser
Publications &
GO
Meta Information
UMLS
Knowledge
Base
Titles & Abstracts
GeneScene
Text Mart
Relation Parsers
Lexical
lookup
UMLS
AZ Noun
Phraser
POS
Tagging
Adjuster &
Tagger
Full
Parser
FSA
Relation
Grammar
Relations in
flat files
Concept
Space
Relations in
flat files
Co-occurrence
relations
Feature Structures
GeneScene
Data Mart
Text Mining
GeneScene
Information
Retrieval
Visualization
Data Mining
Spring
Algorithm
Micro
Array
Data
Bayesian
Networks
Association
Rule Mining
Problem: Gene Pathway
•Title Key roles for E2F1 in signaling p53-
dependent apoptosis and in cell division within
developing tumors.
•Abstract: Apoptosis induced by the p53 tumor
suppressor can attenuate cancer growth in
preclinical animal models. Inactivation of the
pRb proteins in mouse brain epithelium by the
T121 oncogene induces aberrant proliferation
and p53-dependent apoptosis. p53 inactivation
causes aggressive tumor growth due to an
85% reduction in apoptosis. Here, we show
that E2F1 signals p53-dependent apoptosis
since E2F1 deficiency causes an 80% apoptosis
reduction. E2F1 acts upstream of p53 since
transcriptional activation of p53 target genes is
also impaired. Yet, E2F1 deficiency does not
accelerate tumor growth. Unlike normal cells,
tumor cell proliferation is impaired without
E2F1, counterbalancing the effect of apoptosis
reduction. These studies may explain the
apparent paradox that E2F1 can act as both an
oncogene and a tumor suppressor in
experimental systems
Action
Protocols
Graphic
Representation
p53
reads
"E2F1 signals p53-dependent
apoptosis"
E2F1
apoptosis
p53
infers
So, I'm assuming... a straight
line pathway...
E2F1
apoptosis
Expert
errs and
corrects
E2F1
reads
"E2F1 acts upstream of p53"
p53
apoptosis
E2F1
p53
reads
"E2F1 deficiency does not
accelerate tumor growth"
apoptosis
tumor growth
Final
graph
Prepositions: OF/BY/IN
OF
BY
IN
q0
Nominalization
(-ion)
q5
Adjective,
noun,
verb (-ed)
Adjective,
Noun,
verb (-ed)
Nominalization
(-ion)
Nominalization
(-ion)
Negation
q4
NP, 5: str1
NP
q1
Aux, 1: tr13
OF
q6
OF
Nominalization
(-ion)
q7
mod
Aux
mod
Negation
q2
Adjective,
noun,
verb (-ed)
q18
q13
NP
verb
aux
OF
verb
verb
q14
verb
Nominalization
(-ion)
q15
q3
mod
OF
q8
BY
q9
NP
q11
BY
q10
q12
NP
IN
IN
NP
NP
BY
IN
q16
NP
q17
IN
Example Map (one abstract)
Select interesting
relations to
visualize
Overview
Double click to
expand
Expanded node
Finding the truth: p38
acts as a negative
feedback for Ras
signaling
Lessons Learned:
•
Biomedical information is precise but terminologies
fluid
•
SOM performance for medical documents = 80%
•
Biomedical professionals need search and analysis
help
•
Biomedical linguistic parsing and ontologies are
promising for biomedical text mining
•
The need for integrated biomedical data (gene
microarray) and text mining (literature)
•
New testbeds completed: p53, AP1, and yeast
COPLINK Crime Data Mining
(10M):
Intelligence and security
informatics, crime association,
crime network analysis and
visualization
COPLINK Connect
Consolidating & Sharing Information promotes
problem solving and collaboration
Records
Management
Systems (RMS)
Gang Database
Mugshots
Database
COPLINK Connect Functionality
• Generic, common XML based criminal elements
representation
• Data migration (batch and incremental) and mapping for all
major databases and legacy systems
• Database independent: ODBC compliance data warehouse
• Multi-layered Web-based architecture: database server, Web
server, browser
• Powerful and flexible search tools for various reports, e.g.,
incidents, warrants, pawns, etc.
• Graphical browser-based GUI interface for ease of use,
training and maintenance
H. Chen, J. Schroeder, R. V. Hauck, L. Ridgeway, H. Atabakhsh, H. Gupta, C. Boarman, K.
Rasmussen, and A. W. Clements, “COPLINK Connect: Information and Knowledge Management
for Law Enforcement,” Decision Support Systems, Special Issue on Digital Government, 2003.
COPLINK Detect
Consolidated information enables targeted problem solving
via powerful investigative criminal association analysis
COPLINK Detect Functionality
• Simple association rule mining applied to criminal
elements relationships
• Generic, common XML based representation for
criminal relationships
• Incremental data migration and association analysis
on databases
• Support powerful, multi-attribute queries using
partial crime information
• Graphical browser-based GUI interface for simple
crime relationship analysis and case retrieval
H. Chen, D. Zeng, H. Atabakhsh, W. Wyzga, J. Schroeder, “COPLINK: Managing
Law Enforcement Data and Knowledge,” Communications of the ACM, 2003.
COPLINK Detect 2.0/2.5
COPLINK Connect/Detect Status
• Systems stable and shown useful. Commercialized
and supported by KCC
• Systems deployed at: TPD, UAPD, PPD, Phoenix,
Huntsville (TX), Des Moines (Iowa), Ann Arbor
(Michigan), Boston (Massachusetts), Montgomery
county (sniper investigation)
• Systems under deployment: Salt River (AZ),
Cambridge (Massachusetts), Redmond
(Washington), many others
• COPLINK acclaims at LA Times and New York
Times, Newsweek (sniper investigation)
COPLINK Visual Data Mining
Research
COPLINK Criminal Network Analysis: Association Tree,
Association Network Analysis, Temporal-Spatial Visualization
• P1000: A Picture is worth 1000 words.
• Use visual representations and effective HCI to assist in more
efficient and effective crime analysis
• Leverage different representations and algorithms: hyperbolic
trees, network placement algorithms, structural analysis, geospatial mapping, time visualization
H. Chen, D. Zeng, H. Atabakhsh, W. Wyzga, J. Schroeder, “COPLINK: Managing
Law Enforcement Data and Knowledge,” Communications of the ACM, 2003.
A 9/11 Terrorist Network
COPLINK Association Tree and Network (2nd generation)
Figure 1a: Relations among multiple criminal elements are
shown on both a hyperbolic tree (right) and a hierarchical
list (left).
Figure 1b: A
hyperbolic tree with
multiple levels of
investigative leads.
Figure 2c: A user may
choose only the type that is
of interest (e.g., person)
and view crime
associations (e.g., person
name, address).
Figure 2a: The initial layout
of a criminal network before
analysis.
Figure 2b: The network is
analyzed and automatically
adjusted to reflect subgroups
and central criminal figures.
COPLINK Criminal Structural Analysis (3rd
generation)
• Criminal association identification
– Using shortest-path algorithms to find the
strongest associations between two or more
criminals in a network
• SNA (Social Network Analysis)
– Using blockmodel analysis to detect subgroups
and patterns of interactions between groups
– Identifying leaders, gatekeepers, and outliers
from a criminal network
J. Xu & H. Chen, “Criminal Network Analysis: A Data Mining Perspective,” Decision
Support Systems, 2004, forthcoming.
The proposed framework
COPLINK SNA Experiment
• Data Sets
– TPD incident summaries
• Time period—Narcotics: 2000-present; Gangs: 1995-present
• Size
Total #
# subSize of sub-
– Two testing networks
• Narcotics (60 individuals)
• Gang (24 individuals)
individuals
networks
newtorks
Narcotics
12,842
2,628
1-10: 2,587
11-20: 31
21-100: 9
502: 1
Gangs
4,376
289
1-10: 264
11-20: 20
21-100: 4
2,595: 1
A narcotic network example
A bubble represents
a subgroup labeled
by its leaders name
Switch between narcotic
network and gang network
A line impliesAthat
some
point
represents an
individuals inindividual
one grouplabeled
interactby
with some individuals
his name in the other
group. The thicker the link, the
more individual interactions
between the two groups
A line represents a link
between two persons
The rankings of the
members of a selected
group (green).
The size of a bubble is
proportionalShow
to the
network and
number of individuals
reset network
in the group
Adjust level of
details
A gang network example
The leader
The reduced
network structure
A clique
A gatekeeper
Patterns Found
• The chain structure of the
narcotic network
• Implications: disrupt the
network by breaking the
chain
• The star structure of the
gang network
• Implications: disrupt the
network by removing the
leader
Expert
Validation
A
group
of
black
gangs
White
gangs
who
involved
in
murders
and
shootings
White
gangs
who sold
crack
cocaine
“(211)
and (173)
are best
friends”
“Yes, these
two groups
are together
very often”
“He is very
important. He has a
lot of money and
sells drugs. His girl
friend brings a lot of
dancers in the city
and buy drugs.”
Lessons Learned:
•
•
•
•
•
•
Data warehousing and gateway approaches are
needed for information consolidation
XML and data normalization are critical
Co-occurrence analysis and link analysis are
extremely useful for crime investigation
Visual data mining is essential for criminal network
analysis
Wireless (laptop, PDA, cell phone) application is
essential
KM techniques may create unintended cultural and
practice side effects
GetSmart Concept Maps:
Knowledge creation, transfer
and mapping
Meaningful Learning
A Continuum
Meaningful
Learning
Creative
Production
Most
School
LearningRote
• Substantive synthesis
• Relate to experiences
• Intentionally connect to prior
knowledge
• Practice, rehearsal and
thoughtful replication
contribute to meaningful
learning.
• Memorization
• Unrelated to experience
• No effort to link to existing
Learning
knowledge
* Adapted from Novak’s model of meaningful learning
Six Steps of Information Search:
A Constructivist Approach
Learners are actively involved in building on what they already
know to come to a new understanding of the subject under study.
Introduce a
problem.
Identify a general area
for investigation.
Initiation
Selection
Presentation
Explore information to form a
focus.
Exploration
Collection
Formulation
Summarize the topic
and prepare to present
to the intended
audience.
Gather information that
defines/supports the
focus.
GetSmart Learning Tools
Digital Library
Curriculum
Keyword Suggestion
Filtered Material
A Place to Store Work
Assignments
Announcements
Linked Resources
Knowledge
Representation
Concept Map
Customized Resources
A Concept Map about Concept Maps
GetSmart Interface
Navigation bar
Search
tools
Concept
map
management
tools
Meta
search
options
1
By right clicking on a node you
can delete the node, change the
properties of the node, or add a
resource to the node. Resources
can be URLs, Maps, or Notes.
2
You can either type a URL, or click the
“Add From URL Clipboard” button.
3
4
This is the clipboard. Simply
highlight the URL you would like to
add to a node and Click OK.
Your URL will appear in the window,
click the Done button to add it to your
map.
Printing
Choosing the Print option will
cause a new window to open.
This map will show your map,
the title of the map, and any
URL’s, notes, or maps you
have linked to your map.
Usage: Overall at UA and VT
• 114 student users – all UA students (54) turned in all
assignments (VT assignments still pending)
• 4,000+ user sessions
• 1,000+ maps created for homework and presentations
• 600+ searches performed
• 50+ maps created as a group
• 40,000+ relationships represented in the maps
Results (1)
• 120 cue phrases were used to extract 37,674 links,
which accounted for 93% of the pool.
• These cue phrases were categorized into the proposed
link types:
– About 50 cue phrases map to the five previously
determined link types: hierarchical,
componential, comparative, influential, and
procedural.
– Over 50% of cue phrases expressed hierarchical and
componential relationships.
– Descriptive relationships accounted for a large
portion (30%), which were analyzed further.
Link Type Distribution
35.00%
32.67%
Over 50% of the links
expressed hierarchical or
componential relationships
30.00%
29.60%
25.00%
21.30%
Descriptive relationships
accounted for a large portion at
30%, so we further analyze this
link type
20.00%
15.00%
9.65%
10.00%
3.86%
5.00%
2.91%
0.00%
Hierarchical
Link types
Hierarchical
Componential
Componential
Comparative
Number* Percentage
8,026
12,307
Influential
Procedural
Descriptive
Representative cue phrases
21.30% example, such as, case, type, member, is a
32.67% consist, contain, include, compose, part, made of
Comparative
1,455
3.86% like, compare, similar, differ, alternative
Influential
3,635
9.65% lead to, cause, result, influence, determine
1,097
11,153
37,673
2.91% next, go to, procedure
29.60% use/implement/present/advantages/feature
100.00%
Procedural
Descriptive
Sum
* The number of links which had those identified cue phrases in them
Lessons Learned:
•
•
•
•
Digital library and concepts maps support
meaningful learning
Digital library systems provide support for
community knowledge creation.
Semi-open link systems are useful for
capturing knowledge and learning process
NSDL is not a “library.” It should be a
learning or knowledge creation
environment.
Knowledge Management Systems:
Future
Other Emerging Categorization
Challenges/Opportunities:
• Multilingual terminology and semantic issues
• Web analysis and categorization issues
• E-Commerce information (transactions)
classification issues
• Multimedia content and wireless delivery
issues
• Future: semantic web, multilingual web,
multimedia web, wireless web!
The Road Ahead
•
•
•
•
The Semantic Web: XML, RDF, Ontologies
The Wireless Web: WML, WIFI, display
The Multimedia Web: content indexing and
analysis
The Multilingual Web: cross-lingual MT and
IR
Requirements For Successful KMS
Implementation (General)
• Sponsor for the application
• Business case for the application clearly
understood and measurable
• High likelihood of having a significant impact
on the business
• Good quality, relevant data in sufficient
quantities
• The right people – business domain, data
management, and data mining experts
Requirements For Successful KMS
Implementation (KM Specific)
• Information overload is more than anyone can
handle
• Productivity gained and decision improvements
evident among knowledge workers
• Organization’s IT infrastructure ready
• Need to integrate with consulting, process,
content, and policy considerations
For Project Information at AI Lab:
• http://ai.bpa.arizona.edu
• [email protected]