Transcript kurt-maly
Archon - A Digital Library that Federates Physics Collections
with Varying Degrees of Metadata Richness
Department of Computer Science
Old Dominion University, Norfolk, VA 23529
K. Maly, M. Zubair, M. Nelson
In Collaboration With
Los Alamos National Laboratory (R. Luce)
&
American Physical Society (M. Doyle)
JISC/NSF PI Meeting, June 24-25
Motivation
Lack of a federation service that
provides an unified interface to diverse
collections in the physics domain
having metadata that differ in richness,
syntax, and semantics
Motivation
• Dissemination and discovery of Physics resources
• Contributors
LANL, APS, AIP, CERN
researchers, teachers
• Users
Students, teachers, researchers
Arc: The Basic Federation Engine
Harvester
User Interface
Data Normalization
Search Engine
(Servlet)
Cache
History
Harvest
JDBC
Oracle
MySQL
Data
Provider
Daily
Harvest
Data
Provider
Arc: The Basic Federation Engine
Grouper
Local Query Cache and
Session Related Date
Session Manager
Database
(Metadata &
Index)
Searcher
Displayer
Challenges
• Resource Discovery
–
–
–
–
–
Diversity in metadata richness
Lack of controlled vocabulary
Ease of discovering (formula based discovery)
Cross linking support
Classification
• Creation and Maintenance
– Freshness of metadata
– Dynamic nature of collections
– Filtering
• Economic Sustainability
– Rights management
– Who pays? For what?
Issues – No controlled vocabulary
• Different subject classifications
• Same authors but different rendering
• Same affiliation but different form
Interactive resource discovery approach components
Harvester
Harvested
Metadata
1
User interact to identify all the collections to be
searched and with what all options.
2
User execute search based on the selected
options
Index
Generator for
Union of Key
Metadata Fields
2
Indexed
field
contents
1
Search Engine
User Interface
Issues - Equation based search
• Representing search query
• Rendering of equations and embedding them into the
HTML display
• Integrating into search interface
• Identifying equations inside the metadata
• Filtering equations
• Equation storage
Servlet
oai.search.Search
EqnSearch
Image Converter
DisplayEqn
Eqn2Gif
MathEqn
Formula
Extractor
DC Metadata
cHotEqn
Eqn Data
Img2Gif
EqnExtractor
EqnRecorder
EqnCleaner
EqnFilter
Acme.JPM.Encoders.GifEncoder
Formula Filter
Filtering Equations
• Errors in equation encoding, some examples:
–
–
missing "$" in LaTeX representation
illegal LaTeX symbols
• Simple equations like "n=3"
Filtering/categorizing Equations
Approach:
Use of "Stop Equation File" similar to "Stop Word File" used for
indexing.
In equation filtering context, the stop equation file consists of rules in
form of regular expressions, which describe the LaTeX string to be
dropped. The regular expression approach gives us the flexibility to
describe easily variety of strings to be filtered.
How to search for records using
equations?
Three search alternatives (or any combination of these) for the user:
•Search for docs containing all formulae found in a) abstracts b)
subject fields of documents containing user input ‘keywords’
•Search for docs containing formulae defined by category (e.g.
integrals, moments, limits)
• Browse formulae by various categorizations and search for docs
containing selected formulae
Issues - Cross Linking References
• Obtaining references from full-text
documents or parallel metadata sets
• Bad format of such references when
obtained from full text
• Needed standard way to represent across
collections
Issues – Name similarity
• Authors use different names for themselves
and their affiliation
• Could use authority files, difficult to create
and maintain across different collections
Similarity approach
Clustering
Iterative refinement approach:
•Coarse level clusters based on approximate string matching
(edit-distance, soundex, n-gram)
•Refining clusters based on affiliation where available
Presentation
Allow user to follow search by clicking authors and then
selecting appropriate, i.e., no authority files
Homogenizing User Space
• Enabling Web users to discover information
in OAI collections (DP-9 Service)
– http://arc.cs.odu.edu:8080/dp9/
• Enabling OAI users to discover information
in Web enabled non-OAI compliant
collections/databases/web sites
DP-9 Service for Exposing OAI
Collections to Web
Vac: Gateway Service for Harvesting Non-OAI
Collections
Web Enabled
Non-OAI Compliant
Collections/Databases/
Web Sites
Web Enabled
Non-OAI Compliant
Collections/Databases/
Web Sites
Web Enabled
Non-OAI Compliant
Collections/Databases/
Web Sites
WIDL Description
(XML based language)
WIDL Description
(XML based language)
WIDL Description
(XML based language)
Gateway to Non-OAI
Collections
OAI Service Provider
Sample Description in WIDL of a Web enabled NonOAI Collection
<WIDL NAME=‘’NonOAIGateway" Template=‘’TRcollector"
BASEURL="http://www.princeton.edu" VERSION="2.0">
<SERVICE NAME=‘’getURL" METHOD="GET" URL="" INPUT=‘’"
OUTPUT=‘’urlOutput" />
</BINDING> <BINDING NAME="urlOutput" TYPE="OUTPUT">
<VARIABLE NAME=‘’link" TYPE="String" REFERENCE="doc.p[1].text" />
<VARIABLE NAME=‘’title" TYPE="String" REFERENCE=‘’title" />
<VARIABLE NAME=‘’author" TYPE="String" REFERENCE=‘’author" />
<VARIABLE NAME=‘’descriptionr" TYPE="String"
REFERENCE=‘’abstract" />
</BINDING>
</WIDL>
Federation/archives Consistency
Harvester
User Interface
Data Normalization
Search Engine
(Servlet)
Cache
History
Harvest
JDBC
Oracle
MySQL
Data
Provider
Daily
Harvest
Data
Provider
Future Tasks
• Post processing of search results for easier navigation
• Exploiting richer metadata and handling diversity in metadata
across all participating collections
• Concentrate on interactive search interface for resource
discovery
• Data normalization, authority files, filtering
• Investigating different schemes for maintaining
federation/archives consistency
• More high level services beyond formula based search and
cross-linking
• User testing!!!!
Links
• ODU DL research group:
– http://dlib.cs.odu.edu/
• Main federation engine:
– http://arc.cs.odu.edu/
• NSDL research:
– http://archon.cs.odu.edu/
• ITR/IM research
– http://kepler.cs.odu.edu/
Not used
Los Alamos
Collection
American
Physical
Society
Collection
Arc
Service Provider
TRI
Service Provider
OAI Layer
OAI Layer
OAI Layer
OAI Layer
Registration Server
(XML mapping for
each DP)
Harvester
Harvested
Metadata
Metadata
Processor
Search Engine
User Interface
Unified and
Normalized
Metadata
Name
authority
file
Automated metadata mapping approach