oai-icsep - Old Dominion University

Download Report

Transcript oai-icsep - Old Dominion University

New Digital Library Possibilities Using the
Open Archives InitiativeProtocol for
Metadata Harvesting (OAI-PMH)
Michael L. Nelson
Old Dominion University
Norfolk Virginia, USA
[email protected]
http://www.cs.odu.edu/~mln/icsep/
International Conference on Scientific Electronic Publishing
in Developing Countries
Valparaiso, Chile
October 2, 2002
Several Slides Also from Van de Sompel & Warner
Random Thoughts
1.
Thanks to the Organizing Committee for
inviting me
2. Me deseo habla prestado la atencion a mis
clases del Espanol de la escuela
secundaria…
3. Publishers & Editors: if you want
increased coverage, exposure and
readership, you must “do” OAI…
Outline
• OAI-PMH history and technical highlights
– a full technical review is out of the scope of
this presentation
•
•
•
•
Example data provider user
Example service provider uses
Implicatations for authors and editors
Looking to the future
Open Archives Initiative
The protocol is openly
documented, and metadata
is “exposed” to at least some
peer group (note: rights
management can still apply!)
Archive defined as a
“collection of stuff” -not the archivist’s
definition of “archive”.
“Repository” used in
most OAI documents.
OAI is happening
at break-neck speed...
The Rise and Fall of
Distributed Searching
• wholesale distributed searching, popular at
the time, is attractive in theory but
troublesome in practice
– Davis & Lagoze, JASIS 51(3), pp. 273-80
– Powell & French, Proc 5th ACM DL, pp. 264-265
• distributed searching of N nodes still
viable, but only for small values of N
• NCSTRL: N > 100; bad
• NTRS/NIX: N<=20; ok (but could be better)
The Rise and Fall of
Distributed Searching
• Other problems of distributed searching
(from STARTS)
– source-metadata problem
• how do you know which nodes to search?
– query-language problem
• syntax varies and drifts over time between the various nodes
– rank-merging problem
• how do you meaningfully merge multiple result sets?
• Temptations:
– centralize all functions
• “everything will be done at X”
– standardize on a single product
• “everyone will use system Y”
Santa Fe Convention [02/2000]
• goal: optimize discovery of e-prints
http://www.dlib.org/dlib/february00/vandesompel-oai/02vandesompel-oai.html
• input:
• the UPS prototype
http://www.dlib.org/dlib/february00/vandesompel-ups/02vandesompel-ups.html
• RePEc /SODA “data provider / service
provider model”
• Dienst protocol
• deliberations at Santa Fe meeting [10/99]
Data and Service Providers
• Data Providers
– publishing into an archive
– providing methods for metadata “harvesting”
• provide non-technical context for sharing information
also
• Service Providers
– harvest metadata from providers
– implement user interface to data
• Self-describing archives
– Much of the learning about the constituent UPS
archives occurred out of band…
Even if these
are done by
the same DL,
these are
distinct roles
Metadata Harvesting
• Move away from distributed searching
• Extract metadata from various sources
• Build services on local copies of metadata
– data remains at remote repositories
all searching, browsing,
etc. performed on
the metadata here
user
individual nodes can
still support direct user
interaction
metadata
harvested
offline
search for “cfd
applications”
local copy of
metadata
metadata
harvested
offline
metadata
harvested
offline
metadata
harvested
offline
...
each node
independently
maintained
OAI-PMH v.1.0 [01/2001]
• low-barrier interoperability specification
• metadata harvesting model: data provider / service
provider
• focus on document-like objects
• autonomous protocol
• HTTP based
• XML responses
• unqualified Dublin Core
• experimental: 12-18 months
Santa Fe
convention
OAI-PMH
v.1.0/1.1
OAI-PMH
v.2.0
nature
experimental
experimental
stable
verbs
Dienst
OAI-PMH
OAI-PMH
requests
HTTP GET/POST
HTTP GET/POST
HTTP GET/POST
responses
XML
XML
XML
transport
HTTP
HTTP
HTTP
metadata
OAMS
unqualified
Dublin Core
about
eprints
unqualified
Dublin Core
document
like objects
model
metadata
harvesting
metadata
harvesting
metadata
harvesting
resources
OAI-PMH 2.0
• Good news: OAI-PMH is still
Six Verbs + Dublin Core
• Incremental improvements
–
–
–
–
single XML schema
ambiguities removed
more expressive options
cleaner separation of roles & responsibilities
• Bad news: not backwards compatible with 1.1
Dublin Core
• Dublin Core Metadata Initiative
– http://www.dublincore.org/
– from 1994-1995, recognizing the need for simple, interoperable
metadata for resource discovery
–
good overview of metadata & DC:
http://www.dlib.org/dlib/january01/lagoze/01lagoze.html
– 15 elements (qualifiers possible)
Title
Creator
Subject
Description
Publisher
Contributor
Date
Typ e
Format
Identifier
Source
Language
Relation
Coverage
Rights
Request is encoded in http
Response is encoded in XML
XML Schemas for the
responses are defined
in the OAI-PMH document
OAI Mechanics
Overview of OAI-PMH Verbs
Verb
metadata
about the
repository
harvesting
verbs
Function
Identify
description of archive
ListMetadataFormats
metadata formats supported by
archive
ListSets
sets defined by archive
ListIdentifiers
OAI unique ids contained in archive
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
protocol vs periphery
• clear
distinction between protocol and
periphery
• fixed protocol document
• extensible implementation guidelines:
• e.g. sample metadata formats, description
containers, about containers
• allows for OAI guidelines and community
guidelines
OAI-PMH vs HTTP
• clear separation of OAI-PMH and HTTP
• OAI-PMH error handling
• all OK at HTTP level? => 200 OK
• something wrong at OAI-PMH level? =>
OAI-PMH error (e.g. badVerb)
• http codes 302, 503, etc. still available to
implementers, but no longer represent OAI-PMH
events
resource – item - record
set-membership is
item-level property
item = identifier
Dublin Core
metadata
resource
all available metadata
about David
MARC
metadata
SPECTRUM
metadata
item
records
record = identifier + metadata format + datestamp
other general changes
• better definitions of harvester,
repository, item, unique identifier, record,
set, selective harvesting
• oai_dc schema builds on DCMI XML
Schema for unqualified Dublin Core
• usage of must, must not etc. as in RFC2119
• wording on response compression
other general changes
• all protocol responses can be validated with
a single XML Schema
• easier for data providers
• no redundancy in type definitions
• SOAP-ready
• clean for error handling
response no errors
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request verb=“GetRecord”… …>http://arXiv.org/oai2</request>
<GetRecord>
<record>
<header>
<identifier>oai:arXiv:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
note no http encoding
</header>
of the OAI-PMH request
<metadata>
…..
</metadata>
</record>
</GetRecord>
</OAI-PMH>
response with error
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request>http://arXiv.org/oai2</request>
<error code=“badVerb”>ShowMe is not a valid OAI-PMH verb</error>
</OAI-PMH>
with errors, only the correct
attributes are echoed in
<request>
resumptionToken
scenario: harvesting
2770 records in 3 separate
1000 record “chunks”
ListRecords
harvester
Records 1-1000, resumptionToken=AXad31
ListRecords, resumptionToken=AXad31
Records 1001-2000, resumptionToken=pQ22-x
ListRecords, resumptionToken=pQ22-x
Records 2001-2770
RDBMS
resumptionToken
• idempotency of resumptionToken: return same incomplete
list when rT is reissued
• while no changes occur in the repo: strict
• while changes occur in the repo: all items with unchanged
datestamp
•new, optional attributes for the resumptionToken:
•expirationDate
•completeListSize
•cursor
harvesting granularity
• harvesting granularity
• mandatory support of YYYY-MM-DD
• optional support of YYYY-MM-DDThh:mm:ssZ
• other granularities considered, but ultimately rejected
• granularity of from and until must be the
same
Identify
• Identify more expressive
<Identify>
<repositoryName>Library of Congress 1</repositoryName>
<baseURL>http://memory.loc.gov/cgi-bin/oai</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>[email protected]</adminEmail>
<adminEmail>[email protected]</adminEmail>
<deletedRecord>transient</deletedRecord>
<earliestDatestamp>1990-02-01T00:00:00Z</earliestDatestamp>
<granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
<compression>deflate</compression>
header
• header contains set membership of item
<record>
<header>
<identifier>oai:arXiv:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
…..
</metadata>
</record>
eliminates the need for the “double
harvest” 1.x required to get all records
and all set information
ListIdentifiers
• ListIdentifiers returns headers
<?xml version="1.0" encoding="UTF-8"?>
<OAI-PMH>
<responseDate>2002-0208T08:55:46Z</responseDate>
<request verb=“…” …>http://arXiv.org/oai2</request>
<ListIdentifiers>
<header>
<identifier>oai:arXiv:hep-th/9801001</identifier>
<datestamp>1999-02-23</datestamp>
<setSpec>physic:hep</setSpec>
</header>
<header>
<identifier>oai:arXiv:hep-th/9801002</identifier>
<datestamp>1999-03-20</datestamp>
<setSpec>physic:hep</setSpec>
<setSpec>physic:exp</setSpec>
</header>
……
provenance
• introduction of provenance container to
facilitate tracing of harvesting history
<about>
<provenance>
<originDescription>
<baseURL>http://an.oa.org</baseURL>
<identifier>oai:r1:plog/9801001</identifier>
<datestamp>2001-08-13T13:00:02Z</datestamp>
<metadataPrefix>oai_dc</metadataPrefix>
<harvestDate>2001-08-15T12:01:30Z</harvestDate>
<originDescription>
… … …
</originDescription>
</originDescription>
</provenance>
</about>
friends
• introduction of friends container to
facilitate discovery of repositories
<description>
<friends>
<baseURL>http://cav2001.library.caltech.edu/perl/oai</baseURL>
<baseURL>http://formations2.ulst.ac.uk/perl/oai</baseURL>
<baseURL>http://cogprints.soton.ac.uk/perl/oai</baseURL>
<baseURL>http://wave.ldc.upenn.edu/OLAC/dp/aps.php4</baseURL>
</friends>
</description>
NASA <friends> example (1)
• A light weight, DP-centric method
to communicate the existence of
“others”
http://techreports.larc.nasa.gov/ltrs/oai2.0/?verb=Identify
..
<description>
<friends ..namespace stuff..>
<baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL>
<baseURL>http://ntrs.nasa.gov/oai2.0</baseURL>
<baseURL>http://horus.riacs.edu/perl/oai/</baseURL>
<baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL>
</friends>
</description>
..
NASA <friends> example (2)
harvester
Identify
<friends>…</friends/
http://techreports.larc.nasa.gov/ltrs/oai2.0/
http://naca.larc.nasa.gov/oai2.0/
http://ston.jsc.nasa.gov/collections/TRS/oai/
http://ntrs.nasa.gov/oai2.0/
http://horus.riacs.edu/perl/oai/
branding
• introduction of branding container for
DPs to suggest rendering & association hints
<branding xmlns="http://www.openarchives.org/OAI/2.0/branding/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/branding/
http://www.openarchives.org/OAI/2.0/branding.xsd">
<collectionIcon>
<url>http://my.site/icon.png</url>
<link>http://my.site/homepage.html</link>
<title>MySite(tm)</title>
<width>88</width>
<height>31</height>
</collectionIcon>
<metadataRendering
metadataNamespace="http://www.openarchives.org/OAI/2.0/oai_dc/"
mimeType="text/xsl">http://some.where/DCrender.xsl</metadataRendering>
<metadataRendering
metadataNamespace="http://another.place/MARC"
mimeType="text/css">http://another.place/MARCrender.css</metadataRendering>
</branding>
oai-identifier
• revision of oai-identifier
<description>
<oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oaiidentifier"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oaiidentifier
http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">
<scheme>oai</scheme>
<repositoryIdentifier>oai-stuff.foo.org</repositoryIdentifier>
<delimiter>:</delimiter>
<sampleIdentifier>oai:oai-stuff.foo.org:5324</sampleIdentifier>
</oai-identifier>
</description>
domain based
repository names
did not make it into OAI-PMH v.2.0
•
•
•
•
•
•
SOAP implementation
Result set filtering
Multiple / “best” metadata
GetRecord -> GetRecords
Machine readable rights management
XML format for “mini-archives”
So What Does OAI-PMH Mean
for Your Digital Library?
• Resources on DL projects are typically
spent in 2 areas:
– creating & maintaining the collection
• data provider
– developing access services for the collection
(searching, browsing, etc.)
• service provider
• OAI-PMH allows for specialization based
on resources / interest
NACA Report 1345
as seen through its native DL
http://naca.larc.nasa.gov/
NACA Report 1345
as seen through MAGiC
http://www.magic.ac.uk/
NACA Report 1345
as seen through its Scirus
(Elsevier)
http://www.scirus.com/
NACA Report 1345
as seen through my.OAI
(FS Consulting)
http://www.myoai.com/
Scientific Communication
• With only some exceptions, which interface is used
for discovery is not as important as the fact that
discovery occurred in the first place…
– “control” of the discovered objects is not “lost” by data
providers
• however, higher level mirroring services can be built on top of
OAI (cf. NACA & ARC mirroring between NASA LaRC and
MAGiC)
• The real power of OAI-PMH derives as much from
what it does not do as what it actually does
What Does OAI-PMH Mean
for Authors?
• On the surface, absolutely nothing!
– the ideal OAI deployment should be absolutely invisible to
normal DL operations
– uninterested users should not even notice or care
• Indirectly, they should enjoy the benefits of the
critical mass of current and developing DL tools &
systems
– personal, institutional data providers
– proliferation of targetted, value-added service providers
What Does OAI-PMH Mean
For Editors?
• Absolutely everything…
• The decoupling of SPs and DPs will have significant and
profound implications on scientific and technical information
exchange
– OAI-PMH is actually just one component in a larger engineering
effort for scholarly communication (e.g. OpenURL)
• Service and resource integration will be the focus of journals,
professional societies, universities, etc.
– OAI-PMH will be a basic, core technology for scientific publishing
as http & XML
Field of Dreams
• It should be easy to be a data provider, even if it
makes more work for the service provider.
– if enough data providers exist, the service providers will
come (DPs >> SPs)
• Open-source / freely available tools
– “drop-in” data providers:
• industrial strength: http://www.eprints.org/
• personal size: http://kepler.cs.odu.edu/
– tools to make your existing DL a data provider:
• http://www.openarchives.org/tools/tools.htm
• also: OAI-implementers mailing list / mail archive!
– service providers:
• Arc: http://sourceforge.net/projects/oaiarc/
OAI Observation:
Front-End Only
• No input/registry mechanism
– OAI harvesting protocol is always a front-end for
something else
• filesystem, Dienst, RDBMS, LDAP, etc.
– convenient for pre-existing DLs, but does not address
“new” DLs
• e.g., “we want to do OAI”
• Bounds the scope of OAI
– responsibilities and domain of OAI are still be discussed
– tension between functionality and simplicity
OAI Observation: No T&C
• Possible to use multiple OAI servers in a
DMZ-like configuration…
OAI requests
from arbitrary hosts
Public OAI
Server
OAI requests
from trusted hosts
Private OAI
Server
Source database
could even use a separate copy of the database…
OAI Observation: No T&C
• Possible to use OAI harvesting protocol in
closed, restricted systems
OAI 1
OAI 2
OAI 4
OAI 3
all OAI requests originate from these 4 DLs
Metadata
– Q: “Which format should I use?”
• A: any/all of them…
– lowest common denominator: unqualified Dublin
Core
– Again, little known about actual behavior
• will DC be actually be useful? or too lossy?
• will communities create/adopt specific formats?
• will native (presumably richer) formats be harvested?
“The Return of MARC” ?!
we very much want
this to happen...
The Future: Community Building
• Ultimately, protocols and metadata formats are
not what makes a difference
• Rather, the critical mass afforded by a common
set of utilities (cf. http, Dublin Core, XML)
• The best current example: The Open Language
Archives Community
– http://www.language-archives.org
• OAI-PMH provides the basis for communication
between strangers, but allows even richer
communication between friends
http://www.openarchives.org
[email protected]
Backup Slides
Detailed Review of the
OAI-PMH 2.0 Verbs
1.1
• Arguments
– none
• Errors
– none
Identify
2.0
• Arguments
– none
• Errors
– badArgument
ListMetadataFormats
1.1
• Arguments
– identifier
(OPTIONAL)
• Errors
– id does not exist
2.0
• Arguments
– identifier
(OPTIONAL)
• Errors
– badArgument
– noMetadataFormats
– idDoesNotExist
1.1
ListSets
• Arguments
– resumptionToken
(EXCLUSIVE)
• Errors
– no set hierarchy
2.0
• Arguments
– resumptionToken
(EXCLUSIVE)
• Errors
– badArgument
– badResumptionToken
– noSetHierarchy
1.1
ListIdentifiers
• Arguments
–
–
–
–
from (OPTIONAL)
until (OPTIONAL)
set (OPTIONAL)
resumptionToken
(EXCLUSIVE)
• Errors
– no records match
2.0
•
Arguments
•
Errors
–
–
–
–
from (OPTIONAL)
until (OPTIONAL)
set (OPTIONAL)
resumptionToken
(EXCLUSIVE)
– metadataPrefix (REQUIRED)
–
–
–
–
–
badArgument
cannotDisseminateFormat
badResumptionToken
noSetHierarchy
noRecordsMatch
1.1
ListRecords
2.0
•
Arguments
•
Arguments
•
Errors
•
Errors
–
–
–
–
from (OPTIONAL)
until (OPTIONAL)
set (OPTIONAL)
resumptionToken
(EXCLUSIVE)
– metadataPrefix (REQUIRED)
– no records match
– metadata format cannot be
disseminated
–
–
–
–
from (OPTIONAL)
until (OPTIONAL)
set (OPTIONAL)
resumptionToken
(EXCLUSIVE)
– metadataPrefix (REQUIRED)
–
–
–
–
–
noRecordsMatch
cannotDisseminateFormat
badResumptionToken
noSetHierarchy
badArgument
1.1
GetRecord
• Arguments
– identifier (REQUIRED)
– metadataPrefix
(REQUIRED)
• Errors
– id does not exist
– metadata format cannot
be disseminated
2.0
• Arguments
– identifier (REQUIRED)
– metadataPrefix (REQUIRED)
• Errors
– badArgument
– cannotDisseminateFormat
– idDoesNotExist
Argument Summary
metadataPrefix
from
until
set
resumptionToken
identifier
Identify






ListMetadata
Formats





optional
ListSets




exclusive

ListIdentifiers

optional
optional
optional
exclusive

ListRecords

optional
optional
optional
exclusive

GetRecord






Error Summary
Identify
BA
ListMetadata
Formats
BA
ListSets
BA
BRT
ListIdentifiers
BA
BRT
CDF
NRM
NSH
ListRecords
BA
BRT
CDF
NRM
NSH
GetRecord
BA
NMF
IDDNE
NSH
CDF
Generate badVerb on any input not matching the 6 defined verbs
this is an inversion of the table in section 3.6 of the OAI-PMH specification
IDDNE