OAI frontend for Unicorn - Bibliothèques de l`Université Libre de

Download Report

Transcript OAI frontend for Unicorn - Bibliothèques de l`Université Libre de

Extracting XML from Unicorn
with OAI and SRU
European Unicorn User Group Conference
Glasgow Caledonian University
September 7th & 8th, 2006
Benoit PAUWELS
Université Libre de Bruxelles (ULB)
Brussels
Agenda
• Introduction – Unicorn interfaces
• Part 1: An OAI frontend for Unicorn
• Part 2: An SRU frontend for Unicorn
– Short description of OAI and SRU protocols
– Overview of technical implementation
– Use cases and demos
Introduction
• OAI and SRU are ‘open’ protocols that permit
exchange of metadata between information
systems
• Well-known Unicorn interfaces:
– Unicorn API server
– Unicorn Webcat/iBistro/iLink server
– Unicorn Z39.50 server
• All comply to the philosophy of
request/response sequences
Unicorn interfaces: API server
API server
SirsiDynix
Catalogue
database
[ Records and
indexes ]
TCPIP/Socket
API request
• Character client
• C Workflows client
• Java Themes client
TCPIP/Socket
API response
API datacodes/values
Client system
Communication protocol
Information exchange protocol
Returned record structure
Unicorn server
TCPIP/Socket
proprietary SirsiDynix API requests/responses
proprietary SirsiDynix format (data-codes and -values)
Unicorn interfaces: iLink
Web Server
iLink
[ Records and
indexes ]
HTTP
iLink request (URL)
• Any Web browser
HTTP
HTML page
HTML
Client system
Communication protocol
Information exchange protocol
Returned record structure
Catalogue
database
Unicorn server
HTTP
URL requests / HTML responses
HTML
Unicorn interfaces: Z39.50
Z39.50
[ Records and
indexes ]
Z39.50
Z39.50 request
• Any Z3950 client
Z3950
Z3950 response
MARC21
Client system
Communication protocol
Information exchange protocol
Returned record structure
Catalogue
database
Unicorn server
Z39.50 specific
Z39.50 specific
typically MARC21
Unicorn interfaces
• API: Proprietary
– low interoperability level
• HTML: Record data not well structured
– low reusability level
• Z39.50: Protocol specific
– more difficult to implement (high learning curve)
– Z39.50 is statefull
Difficult to integrate into today’s web services
environments
communication: use HTTP
information exchange: use open protocols (like OAI and SRU)
record data structure: use XML (according to well-defined XML
Schema)
2 new Unicorn interfaces
• HTTP / Open / XML
• OAI-PMH: Open Archives Initiative –
Protocol for Metadata Harvesting
• SRU: Search and Retrieve via URL
OAI-PMH : the protocol
Web Server OAI Frontend
HTTP embedded
OAI requests
HTTP embedded
OAI responses
Service Provider
Data Provider
Document
Archive
OAI-PMH: the protocol
• ‘Harvester collects metadata from
archives’
• Stateless protocol: sequence of OAI
requests/responses over HTTP
• Just harvesting -- NOT searching
OAI-PMH: the protocol
OAI requests
• HTTP GET|POST requests
• Syntax
– BASE URL
• host + port + path of OAI request handler
– key=value pairs
• Examples:
– http://www.cible.ulb.ac.be:80/
cgi-bin/OAI20/catalog?
verb=Identify _
– http://www.biomedcentral.com/
oai/1.1/bmcoai.asp?
verb=GetRecord&identifier=oai:bmc:1471-2105-11&metadataPrefix=oai_dc
OAI-PMH: the protocol
OAI responses
• XML encoded bytestreams, containing the records
• Record = triplet
– header (unique OAI identifier)
– metadata
– about
• Metadata schemes
– XML Schema
– Minimum: unqualified Dublin Core
– Community specific
• Example of a record (catkey 450000 from ULB catalogue):
– oai_dc marc21 umods
OAI-PMH: the protocol
Simple : 6 OAI requests/responses
• Identify
–
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=Identify _
• ListMetadataFormats
–
[identifier]
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListMetadataFormats
_
• ListSets
–
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?verb=ListSets _
• GetRecord
–
identifier, metadataPrefix
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=marc21 _
OAI-PMH: the protocol
Simple : 6 OAI requests/responses
• ListRecords
metadataPrefix, [from,until,set]
– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=oai_dc _
– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=mhld21&set=elper _
– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=marc21&from=2006-08-01 _
• ListIdentifiers
metadataPrefix, [from,until,set]
– http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListIdentifiers&metadataPrefix=oai_dc _
OAI frontend for Unicorn
• Implementation of the data provider
functionality (2001)
• http://www.openarchives.org/tools/tools.html
pick a template and interface with Unicorn
through Unicorn database tools
• Our choice: Object Oriented Perl frontend (H.
Suleman – Virginia Tech) _
OAI frontend for Unicorn
HTTP server
CGI
HTTP embedded
OAI request
Unicorn
database
OAI
C wrapper
• call the appropriate
OAI request handler
fork in ‘sirsi’
environment
• retrieve metadata from
Unicorn database
• format in XML
HTTP embedded
OAI response
OAI.pl
Unicorn Server
OAI frontend for Unicorn
Example: implementation of the GetRecord request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=GetRecord&identifier=oai:ulbcat:245000&metadataPrefix=oai_dc
1. Get metadata from Unicorn for catkey 245000
$record = `echo $catkey | catalogdump -of | filtermarc
-iALL -od -Ds`; _
@dates = split(‘\|’,`echo $catkey | selcatalog -iK opr`);
2. Convert ANSEL character set into ISO-LATIN-1
3. Map from MARC to oai_dc _
4. Format into XML
OAI frontend for Unicorn
Example: implementation of the ‘set’ parameter of the
ListRecords request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=oai_dc&set=elper
•
Precompile set as a file of catkeys
–
name of file: « name of set_catkeys »
•
•
•
•
–
einstein_albert_catkeys
elper_catkeys
sd_catkeys
all_catkeys
through periodic execution of « mkoaisets » custom report
OAI frontend for Unicorn
Example: implementation of the ‘from/until’ parameters of the
ListRecords request
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=oai_dc&from=2006-08-01&until=2006-08-31
•
•
•
BRS index on creation/modification date?
Every Unicorn record that gets created or modified is ‘touched’
in the ‘textedit’ and ‘browsedit’ directories
Custom report ‘cadutext’
–
–
•
saves catkeys to <ud>/Savedkeys/adutext/rptid
adds line ‘rptid|date|status’ to <ud>/Lastruns/cadutext
Example: « from=2006-08-01&until=2006-08-31 »
–
–
–
obtain report ids for all runs of cadutext after 2006-08-01 and before
2006-08-31 from the file <ud>/Lastruns/cadutext
for each of these report ids: obtain catkeys from
<ud>/Savedkeys/adutext/rptid and save them to
randomnumber_catkeys file
sort and uniq the randomnumber_catkeys file
OAI frontend for Unicorn
• Limitations of implementation:
–
ListRecords/ListIdentifiers:
• The from and until parameters are not permitted if the set
parameter is given on the request
• The from and until parameters are permitted if the set
parameter is not given on the request, but their values
should fall within a certain date range (at this moment
arbitrarily set to ‘today - 2 months’ and ‘today’)
–
Deleted records
• Complete source code and documentation
available on the API Repository
(http://sirsiapi.org)
OAI frontend - use cases @ ULB
Use case 1: Vlink - OpenURL resolver system
joint project with Vrije Universiteit Brussel (VUB)
OpenURL
ISI
Web of Science
OVID
WebSpirs
Vlink
knowledge base
ULB
iLink
Elsevier
ScienceDirect
JSTOR
Vlink
HTML
extended services
OAI frontend - use cases @ ULB
Use case 1: Vlink - OpenURL resolver system
•
OpenURL sent from iLink
http://bibdev.vub.ac.be/cgi-bin/openurlulb?
sid=ULB:Webcat&id=oai:ulbcat:617924
•
This OpenURL does not contain enough metadata for
the specific item ==> Vlink does a fetch back to
Unicorn through an OAI GetRecord request to obtain a
full MARC21 bibliographic description
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=GetRecord&identifier=oai:ulbcat:617924&metadataPrefi
x=marc21
OAI frontend - use cases @ ULB
Use case 1: Vlink - OpenURL resolver system
•
Feed Vlink Knowledge Base through OAI harvesting
Vlink
Knowledge Base
Unicorn
OAI-PMH
VLink
http://www.cible.ulb.ac.be/cgi-bin/OAI20/catalog?
verb=ListRecords&metadataPrefix=mhld21&set=elper
OAI frontend - use cases @ ULB
Use case 2: Unicat - Virtual Union Catalog of Belgium
HTML
Unicat
WWW
Gateway
Search/
Browse
indexes
Union
OAI
Archive
OAI
SRU
OAI
Unicat
Indexer
End User
University
library
Catalog
Central Repository
Unicat
Harvester
Unicorn
Aleph
VIRTUA
VUBIS
Public
Museum
Other
Data providers
SRU : the protocol
Catalogue
database
Web Server SRU Frontend
[ Records and
indexes ]
HTTP
SRU request
HTTP
SRU response
XML
Client System
Communication protocol
Information exchange protocol
Returned record structure
Unicorn Server
HTTP
SRU
XML
SRU: the protocol
• ‘Client searches and retrieves metadata
records from an archive’
• Stateless protocol: sequence of SRU
requests/responses over HTTP
• Search and Retrieve (<-> OAI: harvesting)
SRU: the protocol
SRU requests
• HTTP GET requests
• Syntax
–
BASE URL
• host + port + path of SRU request handler
–
key=value pairs
• 3 possible requests (operations)
–
explain
•
•
•
•
–
serves to record facilities available at an SRU server
used by clients to self-configure
returned explain record is in XML and follows the ZeeRex Schema
Example: http://z3950.loc.gov:7090/voyager?version=1.1&operation=explain _
scan
• allows the client to request a range of the available terms at a given point within a
list of indexed terms
• enables clients to present an ordered list of values and, if supported, how many hits
there would be for a search on that term
–
searchRetrieve
SRU: the protocol
searchRetrieve operation
• searchRetrieve (principal) parameters
–
–
–
–
–
–
Version: (of the request); current protocol version: 1.1
query: query expressed in CQL
startRecord: position within the sequence of matched records of the first
record to be returned
maximumRecords: number of records requested to be returned
recordSchema: schema requested for the records to be returned
stylesheet: URL for an xml stylesheet. The client requests that the server
simply return this URL in the response.
• CQL
« Traditionally, query languages have fallen into two camps: Powerful, expressive languages,
not easily readable nor writable by non-experts (e.g. SQL, PQF, and XQuery);or simple and
intuitive languages not powerful enough to express complex concepts (e.g. CCL and google).
CQL tries to combine simplicity and intuitiveness of expression for simple, every day queries,
with the richness of more expressive languages to accomodate complex concepts when
necessary. »
(http://www.loc.gov/standards/sru/cql)
SRU: the protocol
searchRetrieve operation
Examples of CQL queries:
• dinosaur
title = "complete dinosaur"
title exact "the complete dinosaur"
dinosaur not reptile
dinosaur and bird or dinobird
publicationYear < 1980
• title all "complete dinosaur"
title contains all of the words: ‘complete’, and ‘dinosaur’
• title any "dinosaur bird reptile"
title contains any of the words: ‘dinosaur’, ‘bird’, or ‘reptile’
• ribs prox/distance<=5 chevrons
a more specific proximity query: ‘ribs’ within 5 words of ‘chevrons’
SRU: the protocol
searchRetrieve operation -- examples
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve
&query=author=einstein _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve
&maximumRecords=10&startRecord=1&query=author=einstein _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve
&maximumRecords=10&startRecord=1&query=author=einstein&recordSche
ma=dc _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve
&maximumRecords=10&startRecord=1&query=author all "einstein albert“
_
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve
&maximumRecords=10&startRecord=1&query=title all "einstein albert“ _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve
&maximumRecords=10&startRecord=1&query=title all "einstein
albert“&stylesheet=http://bib49.ulb.ac.be/cibleCanevas.xsl _
• http://bib49.ulb.ac.be:9000/Cible?version=1.1&operation=searchRetrieve
&maximumRecords=10&startRecord=1&query=title all "einstein
albert“&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
SRU frontend for Unicorn
Catalogue
database
Web Server SRU Frontend
HTTP
SRU request
[ Records and
indexes ]
HTTP
SRU response
XML
Client system
Unicorn Server
SRU frontend for Unicorn
Web Server
Z39.50
Frontend
SRU/Z39.50
Gateway
Catalogue
database
[ Records and
indexes ]
HTTP
SRU request
Z3950
Z3950
request
Z3950
Z3950
response
HTTP
SRU response
XML
Client system
SRU/Z39.50
Unicorn Server
SRU frontend for Unicorn
• SRU/Z39.50 Gateway: YAZ Proxy (Index Data)
– Implemented at ULB: 7/2006 (2 days)
– config.xml
<target name="cible" default="1">
<url>bib7.ulb.ac.be:2200</url>
<xi:include href="explain.xml"/>
<cql2rpn>pqf.properties</cql2rpn>
</target>
<target name=“slavko" default="1">
<url>velma.library.mun.ca:2200</url>
<xi:include href="explain.slavko.xml"/>
<cql2rpn>pqf.slavko.properties</cql2rpn>
</target>
– explain.xml
• ZeeRex XML record as response to ‘explain’ operation
– pqf.properties
• specifies the mapping of various CQL indexes, relations,
etc. into Type-1 query attributes
SRU frontend for Unicorn
• YAZ Proxy
– http://bib49.ulb.ac.be:9000/Cible?
version=1.1&operation=searchRetrieve&maximumRecords=10&s
tartRecord=1&
query=title all "einstein albert“&
stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
– http://bib49.ulb.ac.be:9000/Slavko?
version=1.1&operation=searchRetrieve&maximumRecords=10&s
tartRecord=1&
query=title all "einstein albert“&
stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl _
SRU frontend : use case @ ULB
• Seamless integration of catalog searches in CMS
• Typo3
• Example
– HTML page containing biography of famous belgian historian
Henri Pirenne
– frame pointing to the following URL:
http://bib49.ulb.ac.be:9000/Cible?
version=1.1&operation=searchRetrieve&maximumRecords=10&startRe
cord=1&
query=pirenne%20and%20epub-dnu-*
&stylesheet=http://bib49.ulb.ac.be/cibleTypo3.xsl
• Project
– Unicorn contains descriptions of databases, websites, etc with
local thematic classification codes in 653
– create thematic websites within our CMS, containing frames
that list available databases per theme