Interoperable DLs

Download Report

Transcript Interoperable DLs

Metadata Harvesting
Interoperable digital collections
Distributed libraries
• The reality in most digital libraries is that no
one location has all the materials that may be
of interest.
• It is often more efficient to allow a number of
sites each to retain some of the materials.
• How can we assure clients that they will see
all relevant resources, regardless of which
library they search?
Two basic approaches
• One service provider with access to
resources stored in multiple locations
– Information about all the resources located at the
service provider.
– Services (DL scenarios) use the information to
provide connections to resources at multiple
locations
• Distributed services
– Information kept with the resources
– Services, local to each collection, interact with
other collection sites
Two protocols
• Z39.50
– Developed before the web
– Protocol for communicating with collection
holders in order to provide services.
• Open Archives Initiative
– Relatively recent innovation
– Central service provider gathers
information from collection holders
Z39.50 - briefly
• Information Retrieval Service Definition and
Protocol Specifications for Library Applications
• Initially developed over the OSI network
standards
• Protocol for information exchange
– Free the information seeker from the need to know the
details of the target database configuration
• Each site provides services
– Each service queries remote sites for needed
information
• Information requests mapped to database queries at the
collection site.
• Some inconsistency in the interpretation of queries.
Distributed Resources
Multiple Services
Approach 1 - One service
provider gathers information
about data and uses it to
provide services
Data provider
Data provider
Data provider
Service provider -search, browse,
compare, etc.
Data provider
Data provider
Distributed data and services
Approach 2:
Each system is
both a data
repository and a
service provider.
Services query
other data
providers as
needed.
Search,
browse
Search,
browse,
compare
Hybrid systems
Each server likely to have its own clients. Difference
is whether the information exchange is periodic or ad
hoc
Data provider
Data provider
Data provider
Service provider -search, browse,
compare, etc.
Data provider
Data provider
Open Archives Initiative (OAI)
• Web-based
– Uses HTTP to communicate between sites
• Centralized server
– Services provided from a site that has
already gathered the information it needs
for those services from a distributed
collection of sites.
OAI PMH
• Interoperability through Metadata Exchange
• The Open Archives Initiative Protocol for Metadata
Harvesting (OAI-PMH) is a low-barrier mechanism
for repository interoperability. Data Providers are
repositories that expose structured metadata via
OAI-PMH. Service Providers then make OAI-PMH
service requests to harvest that metadata. OAIPMH is a set of six verbs or services that are
invoked within HTTP.
http://www.openarchives.org/pmh/
OAI - ORE
• Aggregations of Web Resources
• Open Archives Initiative Object Reuse and Exchange (OAI-ORE)
defines standards for the description and exchange of aggregations of
Web resources. These aggregations, sometimes called compound
digital objects, may combine distributed resources with multiple media
types including text, images, data, and video. The goal of these
standards is to expose the rich content in these aggregations to
applications that support authoring, deposit, exchange, visualization,
reuse, and preservation. Although a motivating use case for the work
is the changing nature of scholarship and scholarly communication,
and the need for cyberinfrastructure to support that scholarship, the
intent of the effort is to develop standards that generalize across all
web-based information including the increasing popular social
networks of “web 2.0”.
http://www.openarchives.org/ore/
OAI-ORE example
1. The URI http://arxiv.org/abs/astro-ph/0601007
of the human start page.
2. The formats in which the document is
available, i.e. PostScript, PDF, etc. These are
effectively the constituents of the aggregation
that is the arXiv document. For the remainder
of this example we will consider this human
start page, the splash page, as also a
constituent of the aggregation
3. The title of the arXiv document.
4. The authors of the arXiv document.
5. The creation and last modification date of the
arXiv document.
6. Identifiers of entities that are in some manner
comparable to this arXiv document. For
example, a version of this document was later
published as an article in a peer-reviewed
journal, and the Digital Object Identifier of that
article is shown.
7. The versions of this document.
8. Links to other arXiv documents in the same
collection (i.e., astro-ph).
9. Citations made by this arXiv document, and
citations it received from other documents.
The problem is that this URI
does not really represent
the resource, although this
is the human readable
landing page.
http://www.openarchives.org/ore/1.0/primer
http://www.openarchives.or
/ore/1.0/primer.html#Exam
e
OAI - ORE
• ORE allows aggregation of related web
pages to form a logical unit
– The representation allows access to all of
the components of a resource at once.
Our focus
• We will concentrate on OAI – PMH
– Allowing us to know about other resources
of interest to our societies
– Allowing others to know about the
resources we have available
Spot check
• What sort of resources are handled by your site?
Are the resources well represented by the
landing page? Do you have complex resources
that need structural description as well as the
usual Dublin Core fields?
• Spend a few minutes talking to someone not on
your team about the resources you have and
what it takes to describe them. Then switch and
listen to the other person’s analysis of their
resources.
• Report your conclusions
Older approaches - 1
• Z39.50
– Special purpose protocol (machine to
machine, not web interface)
– Gathers information when it is requested,
not on a scheduled basis.
OAI Compared to Z39.50
Z39.50
OAI
Content (Objects)
Distributed
Distributed
World View
Bibliographic
Bibliographic
Object
Presentation
Data provider
Data provider
Searching is
Distributed
Centralized
Search done by
Data provider
Service provider
Metadata
searched is
Up to date
Stale
Semantic Mapping When searching
Metadata delivery
Source: oai.grainger.uiuc.edu/FinalReport/JCDL_2003_OAI_Intro.ppt
Open Archives Initiative Protocol for
Metadata Harvesting -- OAI-PMH
Implemented as CGI,
ASP, PHP, or other
HTTP req
(OAI verb)
OAI
Metadata
Provider
OAI
HTTP resp
(XML)
Harvester
Repository
OAI PMH
defines an
interface
between the
Harvester and
any number of
Repositories
Service
Provider
Any system may serve as a harvester, repository, or both
OAI - PMH components
Service
Providers
and Data
Providers
Requests
and
Responses
http://www.oaforum.org/tutorial/english/page3.htm#section3
Records
• Metadata of a resource.
• Three parts
– Header (required)
•
•
•
•
Identifier (required: 1 only)
Datestamp (required: 1 only)
setSpec elements (optional: 0, 1, or more)
Status attribute for deleted item
– Metadata (required)
• XML encoded metadata with root tag, namespace
• Repositories must support Dublin Core, other formats optional
– “About” statement (optional)
• Right statements
• Provenance statements
Identifiers
• Globally unique identifier
• Valid URI
– Examples
• oai:<archiveId>:<recordId>
• oai:etd.vt.edu:etd-1234567890
– Must resolve to one item
• No duplicates
• No reuse of previously used identifiers
Datestamps
• Date of last modification of a record
– Used only for harvesting (meta metadata?)
• Mandatory for each item in the repository
• Two levels of granularity possible
– YYYY-MM-DD
– YYYY-MM-DDThh:mm:ssZ
• T … Z = Time zone -- must be GMT
• Allows harvesting incrementally -- get only
what is new since last visit
– Accessed by arguments from and until
The question of time
• What time is it?
– How do you represent this moment in time
in a message that goes to people in several
different places around the world?
• There is a standard for that.
– Look up (Wikipedia will do) the ISO 8601
standard for unambiguous specification of
time.
– Write down what time it is right now (use
minutes, but not seconds) Yes, the time will
change during our discussion.
The OAI-PMH verbs
• Each requests a specific response from
a data repository
Identify
•
•
•
•
Function: Description of the archive
Example: http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify
Parameters: none
Errors/exceptions:
– badArgument (there should not be any)
• Response format:
Element
Example
Ordinality ‡
repositoryName
My Archive
1
baseURL
http://archive.org/oai
1
protocolVersion
2.0
1
earliestDatestamp
1999-01-01
1
deleteRecords
no, transient, persistent
1
granularity
YYYY-MM-DD, YYYY-MM-DDThh:mm:ssZ 1
adminEmail
[email protected]
+
compression
deflate, compress
*
description
oai-identifier, eprints, friends, …
*
‡ Ordinality: 1 = mandatory, 1 only; + = mandatory, 1 only; * = optional, 0 or more
Actual response from
http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2012-03-28T21:30:33Z</responseDate>
<request verb="Identify">http://www.language-archives.org/cgi-bin/olaca3.pl</request>
<Identify>
<repositoryName>OLAC Aggregator</repositoryName>
<baseURL>http://www.language-archives.org/cgi-bin/olaca3.pl</baseURL>
<protocolVersion>2.0</protocolVersion>
<adminEmail>[email protected]</adminEmail>
<earliestDatestamp>1873-04-18</earliestDatestamp>
<deletedRecord>no</deletedRecord>
<granularity>YYYY-MM-DD</granularity>
<!-- maybe later <compression>identity</compression> -->
These expand
<description>...</description>
<description>...</description>
</Identify>
</OAI-PMH>
Continued
First expansion
<description>
<oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oaiidentifier" xmlns:xsi="http://www.w3.org/2001/XMLSchemainstance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/o
ai-identifier http://www.openarchives.org/OAI/2.0/oaiidentifier.xsd">
<scheme>oai</scheme>
<repositoryIdentifier>OLACA.languagearchives.org</repositoryIdentifier>
<delimiter>:</delimiter>
<sampleIdentifier>oai:ethnologue.com:aaa</sampleIdentifier>
</oai-identifier>
</description>
Continued
<description>
<olac-archive xmlns="http://www.language-archives.org/OLAC/1.1/olac-archive"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="institutional"
xsi:schemaLocation="http://www.language-archives.org/OLAC/1.1/olac-archive
http://www.language-archives.org/OLAC/1.1/olac-archive.xsd" currentAsOf="2012-03-28">
<archiveURL>http://www.language-archives.org/archive_records/</archiveURL>
<participant name="Steven Bird" role="Curator" email="[email protected]"/>
<participant name="Gary Simons" role="Curator" email="[email protected]"/>
<participant name="Haejoong Lee" role="Administrator" email="[email protected]"/>
<institution>Open Language Archives Community</institution>
<institutionURL>http://www.language-archives.org/</institutionURL>
<shortLocation>Philadelphia, U.S.A.</shortLocation>
<location/>
<synopsis>
This repository contains all records from OLAC-registered archives. It is intended to be used
by services which do not want to harvest individual OLAC archives.
</synopsis>
<access>
Metadata may be used only subject to the access permissions given by the individual
archives.
</access>
</olac-archive>
</description>
ListMetadataFormats
• Function: retrieve available metadata formats
from archive
• Parameters: identifier (optional)
• Errors/exceptions:
– badArgument
– idDoesNotExist
– noMetadataFormats
Response to http://www.languagearchives.org/cgi-bin/
olaca3.pl?verb=ListMetadataFormats
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2012-03-28T21:38:46Z</responseDate>
<request verb="ListMetadataFormats">http://www.language-archives.org/cgibin/olaca3.pl</request>
<ListMetadataFormats>
<metadataFormat>
<metadataPrefix>olac</metadataPrefix>
<schema>http://www.language-archives.org/OLAC/1.1/olac.xsd</schema>
<metadataNamespace>http://www.language-archives.org/OLAC/1.1/</metadataNamespace>
</metadataFormat>
<metadataFormat>
<metadataPrefix>olac_display</metadataPrefix>
<schema>http://www.language-archives.org/OLAC/1.1/olac.xsd</schema>
<metadataNamespace>http://www.language-archives.org/OLAC/1.1/</metadataNamespace>
</metadataFormat>
<metadataFormat>
<metadataPrefix>olac_dla</metadataPrefix>
<schema>http://www.language-archives.org/OLAC/1.1/olac.xsd</schema>
<metadataNamespace>http://www.language-archives.org/OLAC/1.1/</metadataNamespace>
</metadataFormat>
<metadataFormat>
<metadataPrefix>oai_dc</metadataPrefix>
<schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>
<metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace>
</metadataFormat>
</ListMetadataFormats>
ListSets
• Function: retrieve set structure of a repository
• Example: archive.org/oaiscript?verb=ListSets
• Parameters: resumptionToken (exclusive)
• Errors/exceptions:
– badArgument
– badResumptionToken
– noSetHierarchy
Sets are optional and are used to divide
a repository into separate units that will
be of interest to different harvesters.
ListIdentifiers
• Function: abbrieviated form of ListRecords, retrieve only
headers
• Parameters:
–
–
–
–
–
from (optional)
until (optional)
metadataPrefix (required)
set (optional)
resumptionToken (exclusive)
• Errors/exceptions:
–
–
–
–
–
badArgument
badResumptionToken
cannotDisseminateFormat
noRecordsMatch
noSetHierarchy
ListRecords
• Function: harvest records from a repository
• Parameters:
– from (optional)
– until (optional)
– metadataPrefix (required)
– set (optional)
– resumptionToken (exclusive)
• Errors/exceptions:
–
–
–
–
–
badArgument
badResumptionToken
cannotDisseminateFormat
noRecordsMatch
noSetHierarchy
GetRecord
• Function: retrieve an individual metadata record
from a repository
• Parameters:
– Identifier (required)
– metadataPrefix (required)
• Errors/exceptions:
– badArgument
– cannotDisseminateFormat
– idDoesNotExist
Spot Check
• Use the site from which we retrieved
some information and use the other
PMH verbs there.
Interoperability
• The goal: communication, without human
intervention, between information sources
– Books that “talk to each other”
• Live links for references
• Knowledge of how to find relevant resources
when needed
• Ability to query other information locations
Protocols
• Precise rules for interactions between
independent processes
– Format of the messages
• Both structure and content
– Specified behavior in response to specific
messages
• Many ways to accomplish the same result,
but both sides must have the same
understanding of the rules of engagement.
Protocol Types
• RPC model
– Point to point
– Completely open to definition by developer
• Verbs (methods)
• Nouns (objects, resources)
– Useful to closed community or group who
know about the availability of the resource.
SOAP
• Initial words of the acronym have been
discontinued. (Simple Object Access Protocol)
• Initially developed as part of the Microsoft .NET
paradigm
– Now in W3C committee
• Stateless, one-way message exchange paradigm
• XML encoded
• Flexibility of RPC, but more constrained in the
way communication is formatted.
SOAP is a lightweight protocol intended for
exchanging structured information in a
decentralized, distributed environment. SOAP
uses XML technologies to define an extensible
messaging framework, which provides a
message construct that can be exchanged over
a variety of underlying protocols. The framework
has been designed to be independent of any
particular programming model and other
implementation specific semantics.
http://msdn.microsoft.com/en-us/library/ms995800.aspx
REST
• REpresentational State Transfer
• An after-the-fact definition of the architecture of
the World Wide Web
• The model is
–
–
–
–
Client/server
Stateless
Cacheable
Layered
• Resource interface constrained
– Restricted verbs
– Restricted content types
• RESTful applications use HTTP requests to post data
(create and/or update), read data (e.g., make queries), and
delete data. Thus, REST uses HTTP for all four CRUD
(Create/Read/Update/Delete) operations.
• REST is a lightweight alternative to mechanisms like RPC
(Remote Procedure Calls) and Web Services (SOAP,
WSDL, et al.). Later, we will see how much more simple
REST is.
• Despite being simple, REST is fully-featured; there's
basically nothing you can do in Web Services that can't be
done with a RESTful architecture.
http://rest.elkstein.org/
REST and RPC
• RPC provides flexibility for any type of
interaction between any type of
resources
• REST provides consistency to allow
interaction among resources without
prior discovery of accepted actions and
responses.
SOAP and REST
• Debate in the Web community about
which is the better paradigm for
application development
• REST -- restricted, but simple extension
of existing Web processes
• SOAP -- added flexibility with cost in
terms of bandwidth, security, complexity
for development
References
• Giving SOAP a REST
http://www.devx.com/DevX/Article/8155
• SOAP Version 1.2 Part 0: Primer
http://www.w3.org/TR/2003/REC-soap12-part020030624/#L1153
• OAI For Beginners - The Open Archives Forum online
tutorial: http://www.oaforum.org/tutorial/index.php
• Z39.50 Resource Page:
http://www.niso.org/standards/resources/Z3950_Resources
.html
• Z39.50 An Overview of Development and the Future (1995)
http://www.cqs.washington.edu/~camel/z/z.html
Plus a few other sites as noted in the slides