OAI: Past, Present and Future

Download Report

Transcript OAI: Past, Present and Future

OAI Overview
Michael L. Nelson
Old Dominion University
Norfolk Virginia, USA
[email protected]
http://www.cs.odu.edu/~mln/
Bioinformatics Seminar
ODU CS 791/891
Feb 3 2003
The Rise and Fall of
Distributed Searching
• wholesale distributed searching, popular at
the time, is attractive in theory but
troublesome in practice
– Davis & Lagoze, JASIS 51(3), pp. 273-80
– Powell & French, Proc 5th ACM DL, pp. 264-265
• distributed searching of N nodes still viable,
but only for small values of N
• NCSTRL: N > 100; bad
• NTRS/NIX: N<=20; ok (but could be better)
The Rise and Fall of
Distributed Searching
• Other problems of distributed searching (from STARTS)
– source-metadata problem
• how do you know which nodes to search?
– query-language problem
• syntax varies and drifts over time between the various nodes
– rank-merging problem
• how do you meaningfully merge multiple result sets?
• Temptations:
– centralize all functions
• “everything will be done at X”
– standardize on a single product
• “everyone will use system Y”
Universal Preprint Service
• A cross-archive DL that that provides services on a collection of
metadata harvested from multiple archives
– based on NCSTRL+; a modified version of Dienst
• support for “clustering”
• support for “buckets”
• Demonstrated at Santa Fe NM, October 21-22, 1999
– http://ups.cs.odu.edu/
– D-Lib Magazine, 6(2) 2000 (2 articles)
• http://www.dlib.org/dlib/february00/02contents.html
– UPS was soon renamed the Open Archives Initiative (OAI)
http://www.openarchives.org/
Data and Service Providers
• Data Providers
– publishing into an archive
– providing methods for metadata “harvesting”
• provide non-technical context for sharing information
also
• Service Providers
– harvest metadata from providers
– implement user interface to data
• Self-describing archives
– Much of the learning about the constituent UPS
archives occurred out of band…
Even if these
are done by
the same DL,
these are
distinct roles
Metadata Harvesting
• Move away from distributed searching
• Extract metadata from various sources
• Build services on local copies of metadata
– data remains at remote repositories
all searching, browsing,
etc. performed on
the metadata here
user
individual nodes can
still support direct user
interaction
metadata
harvested
offline
search for “cfd
applications”
local copy of
metadata
metadata
harvested
offline
metadata
harvested
offline
metadata
harvested
offline
...
each node
independently
maintained
Result… OAI
• http://www.openarchives.org/
• The OAI was the result of the demonstration and discussion
during the Santa Fe meeting
• Initial focus was on federating collections of scholarly e-print
materials…
• …however, interest grew and the scope and application of OAI
expanded to become a generic bulk metadata transport protocol
• Note:
– OAI is only about metadata -- not full text!
– OAI is neutral with respect to the nature of the metadata or the resources
the metadata describes
• read: commercial publishers have an interest in OAI too...
Santa Fe
convention
OAI-PMH
v.1.0/1.1
OAI-PMH
v.2.0
nature
experimental
experimental
stable
verbs
Dienst
OAI-PMH
OAI-PMH
requests
HTTP GET/POST
HTTP GET/POST
HTTP GET/POST
responses
XML
XML
XML
transport
HTTP
HTTP
HTTP
metadata
OAMS
unqualified
Dublin Core
about
eprints
unqualified
Dublin Core
document
like objects
model
metadata
harvesting
metadata
harvesting
metadata
harvesting
resources
Dublin Core
• Dublin Core Metadata Initiative
– http://www.dublincore.org/
– from 1994-1995, recognizing the need for simple, interoperable metadata
for resource discovery
–
good overview of metadata & DC: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html
– 15 elements (qualifiers possible)
Title
Creator
Subject
Description
Publisher
Contributor
Date
Typ e
Format
Identifier
Source
Language
Relation
Coverage
Rights
Overview of OAI Verbs
Verb
archival
metadata
harvesting
verbs
Function
Identify
description of archive
ListMetadataFormats
metadata formats supported by archive
ListSets
sets defined by archive
ListIdentifiers
OAI unique ids contained in archive
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
Argument Summary
metadataPrefix
from
until
set
resumptionToken
identifier
Identify






ListMetadata
Formats





optional
ListSets




exclusive

ListIdentifiers

optional
optional
optional
exclusive

ListRecords

optional
optional
optional
exclusive

GetRecord






Error Summary
Identify
BA
ListMetadata
Formats
BA
ListSets
BA
BRT
ListIdentifiers
BA
BRT
CDF
NRM
NSH
ListRecords
BA
BRT
CDF
NRM
NSH
GetRecord
BA
NMF
IDDNE
NSH
CDF
Generate badVerb on any input not matching the 6 defined verbs
this is an inversion of the table in section 3.6 of the OAI-PMH specification
IDDNE
Flow Control
• ListSets, ListIdentifiers, ListRecords are all
allowed to return partial responses, via a
combination of:
– resumptionToken – an opaque, archive-defined data
string that when passed back to the archive allows the
response to begin where it left off
• each archive defines their own resumptionToken syntax; it may
have visible semantics or not
– 503 http status code – “retry after”
• up to the harvester to understand this code and respect it, and
up to the archive to enforce it
resumptionToken
scenario: harvesting
277 records in 3 separate
100 record “chunks”
ListRecords
harvester
Records 1-100, resumptionToken=AXad31
ListRecords, resumptionToken=AXad31
Records 101-200, resumptionToken=pQ22-x
ListRecords, resumptionToken=pQ22-x
Records 201-277
RDBMS
OAI Links & Demos
• Data providers
– not really meant for end-user interaction, but Suleman’s
“Repository Explorer” is an excellent tool
• http://purl.org/net/oai_explorer
• ~100 registered data providers
– http://oaisrv.nsdl.cornell.edu/Register/BrowseSites.pl
– many being used for internal purposes; not registered
• Service providers
– Arc, the first known SP harvesting from OAI data providers
• http://arc.cs.odu.edu/
• ~20 registered service providers
– http://www.openarchives.org/service_provider/oai_sp.htm
– several more known to be in testing or creation
Field of Dreams
• It should be easy to be a data provider, even if it makes more work for
the service provider.
– if enough data providers exist, the service providers will come (DPs >>
SPs)
• Open-source / freely available tools
– “drop-in” data providers:
• industrial strength: http://www.eprints.org/
• personal size: http://kepler.cs.odu.edu/
– tools to make your existing DL a data provider:
• http://www.openarchives.org/tools/tools.htm
• also: OAI-implementers mailing list / mail archive!
– service providers:
• only bits and pieces currently publicly available...
OAI Observation: Front-End Only
• No input/registry mechanism
– OAI harvesting protocol is always a front-end for something else
• filesystem, Dienst, RDBMS, LDAP, etc.
– convenient for pre-existing DLs, but does not address “new” DLs
• e.g., “we want to do OAI”
• Bounds the scope of OAI
– responsibilities and domain of OAI are still be discussed
– tension between functionality and simplicity
OAI Observation: No T&C
• No terms & conditions provisions in protocol
– assumes all metadata has uniform access rights
• how to restrict metadata to certain hosts?
– introducing T&C would increase the scope of
application, but at the expense of simplicity
• how expensive do we want to make a “just-a-front-end
protocol” ?
• maybe T&C is a good application for sets?
OAI Observation: No T&C
• Possible to use multiple OAI servers in a
DMZ-like configuration…
OAI requests
from arbitrary hosts
Public OAI
Server
OAI requests
from trusted hosts
Private OAI
Server
Source database
could even use a separate copy of the database…
OAI Observation: No T&C
• Possible to use OAI harvesting protocol in
closed, restricted systems
OAI 1
OAI 2
OAI 4
OAI 3
all OAI requests originate from these 4 DLs
OAI Observation: Monolithic
• An OAI server has no protocol-defined
concept of “other” OAI servers
– backups, mirrors, etc. have to be resolved
outside of the scope of OAI
• scope vs. complexity again
– fully connected graph of DLs harvesting from
each other is unnecessary
• cf. web crawlers vs. “gathers” in U of Colorado’s
Harvest System
– 3rd party harvesting interfaces raise more T&C and data
coherency issues
302 Load Balancing
• Interactive users on main DL machine should not be
impacted by metadata harvesting
– don’t take deliveries through the front door
– not part of the protocol; defined outside the protocol
if load > 0.05
redirect request
http://blah/oai/?verb=ListIdentifiers
harvester
HTTP Status Code 302
OAI
Server
naca.larc.nasa.gov/oai/
http://blah/oai/?verb=ListIdentifiers
<?xml version=“1.0” encoding=“UTF-8”?>
…
<ListIdentifiers>
…
</ListIdentifiers>
OAI
Server
buckets.dsi.internet2.edu/naca/oai/
OAI Observation: Data Coherency
• In the interest of OAI implementer simplicity,
several issues are left for the service provider
to interpret
– what is an update vs. addition?
• in the NACA OAI interface, they are reported as the
same and its up to the harvesting system to figure it out
– deletions?
• it is currently optional for OAI systems to mark records
as deleted or not…
– still left to the harvester to interpret
OAI Observation: Harvest Model
• Frequency of harvests
– all-at-once harvests?
• initial harvest
• resolving data coherency
– frequent incremental harvests?
• far more efficient for both service and data providers
• Webcrawling vs. digital library models
– webcrawlers: little to no a priori information about target
– DLs: frequent harvesting of a small number of known targets
• Realization: we know very little about how harvesting
behavior…
– are we optimizing for all-at-once, when incremental will be more
common?
Other Uses For the OAI-PMH
• Assumptions:
– Traditional DLs / SPs will continue on their
present path of increasing sophistication
• citation indexing, search results viz, personalization,
recommendations, subject-based filtering, etc.
– growth rates remain the same (5x DPs as SPs)
• Premise: OAI-PMH is applicable to any
scenario that needs to update / synchronize
distributed state
– Future opportunities are possible by creatively
interpreting the OAI-PMH data model
OAI-PMH Data Model
set-membership is
item-level property
item = identifier
Dublin Core
metadata
resource
all available metadata
about David
MARC
metadata
SPECTRUM
metadata
item
records
record = identifier + metadata format + datestamp
Typical Values
•
repository
•
resource
– scholarly publication
item
– all metadata (DC + MARC)
record
– a single metadata format
datestamp
– last update / addition of a record
metadata format
– bibliographic metadata format
set
– originating institution or subject categories
•
•
•
•
•
– collection of publications
Repositories…
• Stretching the idea of a repository a bit:
– contextually sensitive repositories
• “personalization for harvesters”
• communication between strangers, or communication
between friends?
– OAI-PMH for individual complex objects?
• OAI-PMH without MySQL?!
– Fedora, Multi-valent documents, buckets
– tar, jar, zip, etc. files
Resource
• What if resource were:
– computer system status
• uptime, who, w, df, ps, etc.
– or generalized “system” status
• e.g., sports league standings
– people
• personnel databases
• authority files for authors
Item
• What if item were:
– software
• union of versions + formats
– all forms of metadata
• administrative + structural
• citations, annotations, reviews, etc.
– data
• e.g., newsfeeds and other XML expressible content
– metadataPrefixes or sets could be defined to be different
versions
Record
• What if record were:
– specific software instantiations / updates
– access / retrieval logs for DLs (or computer systems)
– push / pull model inversion
• put a harvester on the client behind a firewall, the client
contacts a DP and receives “instructions” on how to submit
the desired document (e.g., send email to a specified
address)
Datestamp
• semantics of datestamp are strongly influenced by
the choice of resource / item / record /
metadataPrefix, but it could be used to:
– signify change of set membership (e.g., workflow: item
moves from “submitted” to “approved”)
– change datestamp to reflect access to the DP
• e.g., in conjunction with metadataPrefixes of “accessed” or
“mirrored”
metadataPrefix
• what if metadataPrefix were:
– instructions for extracting / archiving / scraping the
resource
• verb=ListRecords&metadataPrefix=extract_TIFFs
– code fragments to run locally
• (harvested from a trusted source!)
– XSLT for other metadataPrefixes
• branding container is at the repository-level, this could be
record- or item-level
Set
• sets are already used for tunneling OAI-PMH
extensions (see Suleman & Fox, D-Lib 7(12))
• other uses:
– in aggregators, automatically create 1 set per baseURL
– have “hidden” sets (or metadataPrefix) that have
administrative or community-specific values (or triggers)
• set=accessed>1000&from=2001-01-01
• set=harvestMeWithTheseARGS&until=2002-0505&metadataPrefix=oai_marc
Interesting Services
• DP9
– gateway to expose repository contents in HTML suitable
for web crawlers
• Celestial
– OAI “cache”, also 1.1 -> 2.0 converter
• Static (mini-) repositories
– XML files, based on OLAC work
• OpenURL metadata format registries
– record = metadata format