Transcript Slide 1

Indexing Mathematical
Abstracts by Metadata and
Ontology
IMA Workshop, April 26-27, 2004
Su-Shing Chen, University of Florida
[email protected]
Abstract



OAI extensions to federated search and other
services for MathML-based metadata
indexing and subject classification of
mathematical abstracts.
Construction of ontology or conceptual
maps of mathematics. Mathematical formulas
are considered as elements of the ontology.
Ontology indexing by clustering
mathematical abstracts or full papers into an
information visualization interface so that
users may select using ontology as well as
metadata.
A DL Server with OAI Extensions:
Managing the Metadata Complexity
Harvest API
Harvester
OAI_DC
Data Mining
Data
Provider
Service
Provider
DL
Server
Service
Provider
Federated Search
Data
Provider
OAI_XXX
Data
Data
Provider
Provider
User
User
Service
Service
Provider
Provider
Internet
Server
Harvester
Harvester
Data
Data
Provider
Provider
Service
…
Service
Provider 1
Provider 1
Java DataBase Connectivity (JDBC)
Java DataBase Connectivity (JDBC)
Digested
Digested
Metadata
Metadata
Service
ServiceProviders’
Providers’
Data
Data
Harvested
Harvested
Metadata
Metadata
Service
Service
Provider N
Provider N
A DL Server with OAI Extensions:
Managing the Metadata Complexity
Built in capabilities:
 Harvester – harvest various OAI compliant
data providers
 Data provider – expose harvested and
existing metadata sets
 Service provider – federated search and
data mining capabilities on metadata sets
Harvester
Harvest API
Data Providers
Harvester Interface:
• URL to harvest
• Selective harvesting
parameters
parameters
Harvester
Harvested
metadata
…
DL
Server
Harvester Interface
Harvester Interface
Data Provider

Expose single or combined metadata sets
harvested to other harvesters

Reformat metadata from different data
providers to be harvested by other service
providers (e.g., originally Dublin Core,
reformat to MARC before exposing)
Service Provider: Federated Search

Emulating a federated search service on
existing and combined harvested metadata
sets

Federated search across potentially other
search protocols
Federated Search
Federated Search
Federated Search
Service Provider: Data Mining
Knowledge discovery on harvested
metadata sets
 Metadata classification using the SelfOrganizing Map (SOM) algorithm
 Improving retrieval effectiveness by
providing concept browsing and search
services

Self-Organizing Map Algorithm
Competitive and unsupervised learning
algorithm
 Artificial neural network algorithm for
visualizing and interpreting complex data
sets
 Providing a mapping from a highdimensional input space to a twodimensional output space

Data Mining Service Provider
System Architecture
Browser
Concept browsing
request
Browser
Concept search
request
Response
Request
Response
Concept Harvester
SOM Categorizer
Input Vector Generator
Noun Phraser
Fetch metadata
Save SOM
Metadata Database
Response
Concept Harvester

Screenshot of the SOM Categorizer
Construction of Two-level Concept
Hierarchy


Constructing the SOM for each harvested metadata set
SOMs of the lower layer are added to the upper-layer
SOM.
VTETD
Top-level Concept Browsing
Bottom-level Concept Browsing
MEDLINE Database
 Developed
by the National Library of Medicine (NLM)
 Bibliographic citations and abstracts from more than
4,600 biomedical journals published in the United
States and 70 other countries.
 Covering the fields of medicine, nursing, dentistry,
veterinary medicine, the health care system, and the
preclinical sciences.
 Over 12 million citations
 Searchable via PubMed or the NLM Gateway
MeSH (Medical Subject Headings)
 MEDLINE
uses MeSH as its controlled
vocabulary for indexing database articles
 Indexers scan an entire article and assign
MeSH headings (or MeSH descriptors) to
each article
 MeSH descriptors are arranged in both an
alphabetic list and a hierarchical structure.
 Updated annually to reflect the changes in
medicine and medical terminology
Our Experimentation

Problems
 It
is well known that searching by descriptors will
greatly improve the search precision.
 However, it is very difficult for naïve users to know
and use exact MeSH descriptors to search.
 In addition, as the database of MEDLINE grows,
information overload would prevent users from finding
relevant information of their interest.

Proposed Approach
 Categorizations
according to MeSH terms, MeSH
major topics, and the co-occurrence of MeSH
descriptors
 Clustering using the results of MeSH term
categorization through the Knowledge Grid
 Visualization of categories and hierarchical clusters
Data Access Services
MeSH Major Topic Tree View
SOM Tree View
Knowledge Grid

Knowledge Grid Architecture
High level K-Grid layer
DA
TAAS
Data
Access Service
Tools and Algorithms
Access Service
EPM
Execution Plan
Management
RPS
Result
Presentation Serv.
Core K-Grid layer
KBR
KDS
Knowledge
Directory Service
RAEM
KEPR
Resource Alloc.
Execution Mng.
KMR
Generic and Data Grid Services
Courtesy of Cannataro and Talia
(Knowledge Grid: An Architecture for Distributed Knowledge Discovery)
Future Directions




Develop a federated search service for OAIcompliant mathematical abstracts.
Develop an ontology or conceptual maps for
mathematics.
Develop an ontology search service for
mathematical abstracts and full papers.
Develop an interoperable architecture with
other services, such as OCR of mathematical
formulas.
Acknowledgement
Many thanks to the NSF NSDL Program.
 Collaborators – Joe Futrelle (NCSA), Ed
Fox (Virginia Tech)
 Student Team – Hyunki Kim, Chee Yoong
Choo, Xiaoou Fu, Yu Chen
