Introduction to Digital Libraries Week 1: What is a DL?

Download Report

Transcript Introduction to Digital Libraries Week 1: What is a DL?

OAI: Past, Present and Future
Michael L. Nelson [email protected]
several slides stolen from Herbert Van de Sompel
Open Archives Meeting
Institute of Mechanical Engineers
London
07/11/01
Outline
• Past
– original goals, participants
• Present
– evolution of goals, terms, definitions, current status
• Future
– observations, use in the U.S., next steps
Background
• I met Herbert Van de Sompel in April 1999...
– we spoke of a demonstration project he had in
mind and had received sponsorship from Paul
Ginsparg and Rick Luce
– We wanted to demonstrate a multi-disciplinary DL
that leveraged the large number of high quality, yet
often isolated, tech report servers, e-print servers,
etc.
• most DLs had grown up along single disciplines
– little to no interoperability, “gardens” of DLs
The Rise and Fall of
Distributed Searching
• wholesale distributed searching, popular at
the time, is attractive in theory but
troublesome in practice
– Davis & Lagoze, JASIS 51(3), pp. 273-80
– Powell & French, Proc 5th ACM DL, pp. 264-265
• distributed searching of N nodes still viable,
but only for small values of N
• NCSTRL: N > 100; bad
• NTRS/NIX: N<=20; ok (but could be better)
The Rise and Fall of
Distributed Searching
• Other problems of distributed searching (from STARTS)
– source-metadata problem
• how do you know which nodes to search?
– query-language problem
• syntax varies and drifts over time between the various nodes
– rank-merging problem
• how do you meaningfully merge multiple result sets?
• Temptations:
– centralize all functions
• “everything will be done at X”
– standardize on a single product
• “everyone will use system Y”
Universal Preprint Service
• A cross-archive DL that that provides services on a
collection of metadata harvested from multiple archives
– based on NCSTRL+; a modified version of Dienst
• support for “clustering”
• support for “buckets”
• Demonstrated at Santa Fe NM, October 21-22, 1999
– http://ups.cs.odu.edu/
– D-Lib Magazine, 6(2) 2000 (2 articles)
• http://www.dlib.org/dlib/february00/02contents.html
– UPS was soon renamed the Open Archives Initiative (OAI)
http://www.openarchives.org/
UPS Participants
Archive / DL
Records in DL
Buckets in UPS
Buckets Linked to
Full Content
arXiv
128943
85204
85204
743
742
659
3036
3036
3036
29680
25184
9084
1590
1590
951
71359
71359
13582
235361
187115
112516
www .arxiv.o rg
CogPrints
cogp rints.soton. ac.uk
NACA
nac a.larc.nasa .gov
NCSTRL
www .ncs trl.org
NDLTD
www .ndlt d.org
RePEc
netec.mcc.ac.uk
Totals:
totals ca. July 1999
Metadata Harvesting
• Getting metadata out of archives
– not all archives support metadata extraction
• some archives have undocumented metadata
extraction procedures
– not all archives support rich criteria for
extraction
• single dump concept only
• Intellectual property and use rights not
always clear
– many policies akin to “don’t ask, don’t tell”
Metadata Formatting and Quality
• Quality problems with:
–
–
–
–
record duplication
crucial missing fields
internal errors
ambiguous references to people and places,
publications
• Different formats!
arXiv
CogPrints
NACA
RePEc
NDLTD
NCSTRL
(local)
(local)
refer
ReDIF
MARC
RFC-1807
unproven intuition :
n digital libraries
results in O(n)
metadata formats
Buckets: Information
Surrogates in UPS
• Limitations on intellectual property,
file size, transmission time, system
load, etc. caused us to focus on
metadata only
• Metadata was collected into
“buckets”, with pointers back to the
data files (still at the original sites)
Value Added
Services Attached
to the Buckets
SFX Reference Linking
Service, developed at
Univ of Ghent, Belgium.
- provides a layer of
indirection between
reference services
available at a local site
and the object itself
SFX “buttons” are attached
to the buckets themselves
- communication occurs
between SFX server and
the bucket
Adding other services to the
buckets is easy...
Data and Service Providers
• Data Providers
– publishing into an archive
– providing methods for metadata “harvesting”
• provide non-technical context for sharing information also
• Service Providers
– harvest metadata from providers
– implement user interface to data
• Even if provided by the same DL, these are
distinct functions
Data and Service Providers
• Self-describing archives
– Much of the learning about the constituent UPS archives occurred
out of band…
– Given an unknown archive, we should be able to algorithmically
determine the nature of the archive
Native
harvesting
interface
Input
interface
Provider
Native
end-user
interface
Input
interface
Provider
Native
end-user
interface
No machine based way to
extract metadata…
Machine and user interfaces
for extracting metadata….
Data and Service Providers
Native
end-user
interface
Service
Provider
Input and harvesting
interfaces optional
Native
harvesting
interface
Input
interface
Data Provider
Native
harvesting
interface
Input
interface
Native
end-user
interface
Data Provider
Native end-user
interface optional
(e.g., RePEc)
Result… OAI
• The OAI was the result of the demonstration and
discussion during the Santa Fe meeting
• Initial focus was on federating collections of scholarly
e-print materials…
• …however, interest grew and the scope and application
of OAI expanded to become a
generic bulk metadata transport protocol
• Note:
– OAI is only about metadata -- not full text!
– OAI is neutral with respect to the nature of the metadata or
the resources the metadata describes
• read: commercial publishers have an interest in OAI too...
OAI Timeline Highlights
• October 21-22, 1999 - initial UPS meeting
• February 15, 2000 - Santa Fe Convention published in D-Lib Magazine
– precursor to the OAI metadata harvesting protocol
• June 3, 2000 - workshop at ACM DL 2000 (Texas)
• August 25, 2000 - OAI steering committee formed, DLF/CNI support
• September 7-8, 2000 - technical meeting at Cornell University
– defined the core of the current OAI metadata harvesting protocol
• September 21, 2000 - workshop at ECDL 2000 (Portugal)
• November 1, 2000 - Alpha test group announced (~15 organizations)
• January 23, 2001 - OAI protocol 1.0 announced, OAI Open Day in the
U.S. (Washington DC)
– purpose: freeze protocol for 12-16 months, generate critical mass
• February 26, 2001 - OAI Open Day in Europe (Berlin)
• July 3, 2001 - OAI protocol 1.1 announced
– to reflect changes in the W3C’s XML latest schema recommendation
• September 8, 2001 - workshop at ECDL 2001 (Darmstadt)
Open Archives Initiative
The protocol is openly
documented, and metadata
is “exposed” to at least some
peer group (note: rights
management can still apply!)
Archive defined as a
“collection of stuff” -not the archivist’s
definition of “archive”.
“Repository” used in
most OAI documents.
OAI is happening
at break-neck speed...
Open Archives Initiative
Open Archival Information System
exposure of metadata for harvesting
insuring long-term preservation of
archival materials
OAIS
OAIS w/
an OAI interface
http://www.dlib.org/dlib/april01/04editorial.html
http://www.dlib.org/dlib/may01/05letters.html
http://ssdoo.gsfc.nasa.gov/nost/isoas/us/overview.html
OAI Metadata Harvesting Protocol
• Then:
– OAI harvesting protocol originally a subset of the
Dienst (NCSTRL) protocol
• and originally called the “Santa Fe Convention”
– originally defined an OAI-specific metadata format
• Now:
– OAI metadata format dropped in favor of unqualified
Dublin Core
• other formats possible, but DC is required as lowest common
denominator
– No longer dependent on Dienst
• defined independently (though still easily mappable)
Overview of OAI Verbs
Verb
archival
metadata
harvesting
verbs
Function
Identify
description of archive
ListMetadataFormats
metadata formats supported by archive
ListSets
sets defined by archive
ListIdentifiers
OAI unique ids contained in archive
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
supporting protocol requests
service provider
harvester
data provider
repository
Identify
Identify / Time / Request
• Repository identifier
• Base-URL
• Admin e-mail
• OAI protocol version
• Description
herbert van de sompel
r
e
p
o
s
i
t
o
r
y
supporting protocol requests
service provider
harvester
data provider
repository
ListMetadataFormats
* identifier=oai:mlib:123a
ListMetadataFormats / Time / Request
REPEAT
• Format prefix
• Format XML schema
/REPEAT
herbert van de sompel
r
e
p
o
s
i
t
o
r
y
supporting protocol requests
service provider
harvester
ListSets
data provider
repository
* resumptionToken
ListSets / Time / Request
REPEAT
• SetSpec
• SetName
/REPEAT
herbert van de sompel
r
e
p
o
s
i
t
o
r
y
harvesting requests
* from=a
data provider
* until=b
repository
* set=klm
ListRecords * metadataPrefix=dc
r
* resumptionToken
e
p
o
s
ListRecords / Time / Request
i
REPEAT
t
• Identifier
o
• Datestamp
r
• Metadata
y
/REPEAT
service provider
harvester
herbert van de sompel
harvesting requests
service provider
harvester
data provider
* from=a
repository
* until=b
* set=klm
r
ListIdentifiers * resumptionToken
e
p
o
s
ListIdentifiers / Time / Request
i
REPEAT
t
• Identifier
o
• Datestamp
r
/REPEAT
y
herbert van de sompel
harvesting requests
service provider
harvester
data provider
repository
GetRecord * identifier=oai:mlib:123a
* metadataPrefix=dc
GetRecord / Time / Request
• Identifier
• Datestamp
• Metadata
herbert van de sompel
r
e
p
o
s
i
t
o
r
y
Flow Control
• ListSets, ListIdentifiers, ListRecords are all
allowed to return partial responses, via a
combination of:
– resumptionToken – an opaque, archive-defined data
string that when passed back to the archive allows the
response to begin where it left off
• each archive defines their own resumptionToken syntax; it may
have visible semantics or not
– 503 http status code – “retry after”
• up to the harvester to understand this code and respect it, and
up to the archive to enforce it
resumptionToken
scenario: harvesting
277 records in 3 separate
100 record “chunks”
ListRecords
harvester
Records 1-100, resumptionToken=AXad31
ListRecords, resumptionToken=AXad31
Records 101-200, resumptionToken=pQ22-x
ListRecords, resumptionToken=pQ22-x
Records 201-277
RDBMS
OAI Demos
• Data providers
– not really meant for end-user interaction, but
Suleman’s “Repository Explorer” is an excellent tool
• http://purl.org/net/oai_explorer
• 30+ registered data providers
– http://oaisrv.nsdl.cornell.edu/Register/BrowseSites.pl
– many being used for internal purposes; not registered
• Service providers
– Arc, the first known SP harvesting from OAI data
providers
• http://arc.cs.odu.edu/
• 3 registered service providers
– http://www.openarchives.org/service_provider/oai_sp.htm
– several more known to be in testing or creation
Field of Dreams
• It should be easy to be a data provider, even if it
makes more work for the service provider.
– if enough data providers exist, the service providers
will come (DPs >> SPs)
• Open-source / freely available tools
– “drop-in” data providers:
• industrial strength: http://www.eprints.org/
• personal size: http://kepler.cs.odu.edu/
– tools to make your existing DL a data provider:
• http://www.openarchives.org/tools/tools.htm
• also: OAI-implementers mailing list / mail archive!
– service providers:
• only bits and pieces currently publicly available...
OAI Observation: Front-End Only
• No input/registry mechanism
– OAI harvesting protocol is always a front-end for
something else
• filesystem, Dienst, RDBMS, LDAP, etc.
– convenient for pre-existing DLs, but does not address
“new” DLs
• e.g., “we want to do OAI”
• Bounds the scope of OAI
– responsibilities and domain of OAI are still be
discussed
– tension between functionality and simplicity
OAI Observation: No T&C
• No terms & conditions provisions in
protocol
– assumes all metadata has uniform access rights
• how to restrict metadata to certain hosts?
– introducing T&C would increase the scope of
application, but at the expense of simplicity
• how expensive do we want to make a “just-a-frontend protocol” ?
• maybe T&C is a good application for sets?
OAI Observation: No T&C
• Possible to use multiple OAI servers in a
DMZ-like configuration…
OAI requests
from arbitrary hosts
Public OAI
Server
OAI requests
from trusted hosts
Private OAI
Server
Source database
could even use a separate copy of the database…
OAI Observation: No T&C
• Possible to use OAI harvesting protocol in
closed, restricted systems
OAI 1
OAI 2
OAI 4
OAI 3
all OAI requests originate from these 4 DLs
OAI Observation: Monolithic
• An OAI server has no protocol-defined
concept of “other” OAI servers
– backups, mirrors, etc. have to be resolved
outside of the scope of OAI
• scope vs. complexity again
– fully connected graph of DLs harvesting from
each other is unnecessary
• cf. web crawlers vs. “gathers” in U of Colorado’s
Harvest System
– 3rd party harvesting interfaces raise more T&C and data
coherency issues
302 Load Balancing
• Interactive users on main DL machine should not
be impacted by metadata harvesting
– don’t take deliveries through the front door
– not part of the protocol; defined outside the protocol
if load > 0.05
redirect request
http://blah/oai/?verb=ListIdentifiers
harvester
HTTP Status Code 302
OAI
Server
naca.larc.nasa.gov/oai/
http://blah/oai/?verb=ListIdentifiers
<?xml version=“1.0” encoding=“UTF-8”?>
…
<ListIdentifiers>
…
</ListIdentifiers>
OAI
Server
buckets.dsi.internet2.edu/naca/oai/
OAI Observation: Data Coherency
• In the interest of OAI implementer simplicity,
several issues are left for the service provider
to interpret
– what is an update vs. addition?
• in the NACA OAI interface, they are reported as the
same and its up to the harvesting system to figure it out
– deletions?
• it is currently optional for OAI systems to mark records
as deleted or not…
– still left to the harvester to interpret
OAI Observation: Harvest Model
• Frequency of harvests
– all-at-once harvests?
• initial harvest
• resolving data coherency
– frequent incremental harvests?
• far more efficient for both service and data providers
• Webcrawling vs. digital library models
– webcrawlers: little to no a priori information about target
– DLs: frequent harvesting of a small number of known
targets
• Realization: we know very little about how
harvesting behavior…
– are we optimizing for all-at-once, when incremental will
be more common?
Potentially Good Ideas
(but we’re not sure yet)
• Sets
– intuition: we’ll be glad we included them
– arXiv the first to implement sets
• their DL is roughly built on “sets”, so it was an easy mapping for
them
• a few other repositories have since adopted sets
• Flow control
– harvesting == denial of service attack ?
– is “resumptionToken” solution not enough? too much?
• need data providers with large collections and enough service
providers to generate a load
Potentially Good Ideas
(but we’re not sure yet)
• Metadata
– Q: “Which format should I use?”
• A: any/all of them…
– lowest common denominator: unqualified
Dublin Core
– Again, little known about actual behavior
• will DC be actually be useful? or too lossy?
• will communities create/adopt specific formats?
• will native (presumably richer) formats be
harvested?
“The Return of MARC” ?!
we very much want
this to happen...
XML Observations
• Not too much of a problem for data providers
– XML is easier to write than read
• Service providers…
– XML can be pretty picky… a large “ListRecords” result
can be invalidated with a single error
• harvest in chunks? individual records?
– author contributed metadata particularly a problem (e.g.
control characters from copy-n-paste)
– one advantage of resumptionToken is that it
compartmentalizes bad data
Current NTRS / NIX Architecture
• NASA-wide page that federates N center/project
specific servers through distributed searching
user
search for “cfd
applications”
http://techreports.larc.nasa.gov/cgi-bin/NTRS
http://nix.nasa.gov/
NTRS/NIX
search for“cfd
applications”
search for“cfd
applications”
search for“cfd
applications”
search for“cfd
applications”
...
each node
independently
maintained
Current NTRS / NIX Architecture
• Or users can interact directly with the nodes
of NTRS/NIX…
user
search for“cfd
applications”
NTRS/NIX
search for“cfd
applications”
...
Proposed Strategy: Data Providers
• Reduce the high interoperability expectations of
distributed searching…
• Each current node of NTRS, NIX and other NASA
DLs become an OAI “data provider”
– LTRS & NACA already have test OAI interfaces
• LTRS http://techreports.larc.nasa.gov/ltrs/oai/
• NACA http://naca.larc.nasa.gov/oai/
– each node is free to run their own software /
architecture / system / etc., but the method of metadata
exposure is standardized
• very low interoperability requirements
• each node can continue to have a “user interface”
Proposed Strategy: Service Providers
• NTRS, NIX and other well known,
“destination DLs” become OAI service
providers
– no longer relying on distributed searching
– harvest metadata from their constituent data
providers
– provide their value added services on local copies
of the metadata
• data remains resident at the local data providers
NTRS OAI Architecture
all searching, browsing,
etc. performed on
the metadata here
user
individual nodes can
still support direct user
interaction
search for “cfd
applications”
NTRS
local copy of
metadata
metadata harvested
offline, through
OAI interface
LTRS
ATRS
GTRS
...
CASITRS
content (reports) remain archived at the local sites
each node
independently
maintained
Additional Models
• First step
– OAI interfaces for data providers
– DLs use OAI interfaces to move from distributed
searching to metadata harvesting
• Other possibilities
– hierarchical harvesting
• exposing metadata to other, possibly non-NASA DLs
• harvesting from other, possibly non-NASA DLs
– multi-genre DLs
– re-apply the OAI protocol for harvesting / replicating
content (not just metadata)
– 3rd party service providers
NASA DLs in the Larger STI Realm
Publishers
Universities
International
DOD
...
DOE
this could be a fully
connected graph
NTRS could also be a
data provider from the
point of view of other
DLs; allowing the
harvesting of NASA
report metadata.
NTRS could also harvest
metadata from other DLs,
and provide access to
non-NASA content.
NTRS
LTRS
ATRS
…
CASITRS
We hope to influence
the direction of the
science.gov effort to use
OAI.
New Kinds of DLs
• Drawing from the same pool of DPs
– different interfaces, capabilities and collection policies for:
•
•
•
•
public affairs
K-12 education
science & research
authors / librarians / managers
– NTRS and NIX could harvest from the same sources…
• be the same DL, but with different interfaces?
• be replaced with a new, all-encompassing DL?
– DL creators can now focus on collection management
• “ala carting” their collections and sub collections
• instead of fussing over syntax synchronization of remote search
services
A Generic Harvesting Protocol
• The actual uses of OAI depend on your relative
position and concerns:
– What is metadata vs. data?
– Who is a SP vs. a DP?
• Multiple OAI interfaces make many things
possible:
– restricted / public interfaces
– Arc-like description of harvested archives
– updates of log files, authority lists, etc.
• Additional services can be built on top of OAI
– content replication
– awareness services
OAI Impact
• Lightweight interoperability protocol
– an OAI layer is added to your existing DL
• Separation of responsibilities
– service providers
– data providers
• http://www.openarchives.org/