Protocol for Metadata Harvesting

Download Report

Transcript Protocol for Metadata Harvesting

Open Archives Iniative –
Protocol for Metadata Harvesting
Iztok Kavkler, University of Ljubljana
Some slides by
Stefaan Ternier, KUL
Bram Vandenputte, KUL
Joris Klerkx, KUL
What is OAI?

Harvesting standard, documented at
http://www.openarchives.org/OAI/openarchivesprotocol.html

Seven service verbs
–
–
–
–
–
–

Identify
ListMetadataFormats
GetRecord
ListRecords
ListIdentifiers
ListSets
Allows multiple metadata formats
–
DC (Dublin core) format mandatory
2
How OAI works

OAI “VERBS”
–
–
–
–
–
–
Identify
ListMetadataFormats
GetRecord
ListIdentifiers
ListRecords
ListSets
Service Provider
Metadata Provider
H
HTTP Request
A
(OAI Verb)
R
V
E OAI
S
T
HTTP Response
E
(Valid XML)
R
R
E
P
O
OAI S
I
T
O
R
Y
3
Try it




Install Apache-Tomcat or any other Java
servlet container
Download WAR file from
http://fire.eun.org/Iztok/OAILREApp.war
Deploy WAR
Demo html
http://localhost:8080/OAILREApp/

Or type a service verb, e.g.
http://localhost:8080/OAILREApp/oaiHandler?verb=Identify
4
The raw XML


By default, the resulting XML has stylesheet
attached for pretty rendering
To remove the stylesheet comment the line
OAIHandler.styleSheet=testoai/oaicat.xsl
in file
oaicat.properties (in WAR file or the web-app dir)
5
OAI XML example
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" ...>
<responseDate>2007-06-11T06:48:58Z</responseDate>
<request metadataPrefix="oai_lom"
verb="ListRecords">http://localhost:8080/OAILREApp/oaiHandler</request>
<ListRecords>
<record>
<header>
<identifier>oai:oai.xyz-repository.com:exercises/112553</identifier>
<datestamp>2007-06-09T22:38:28Z</datestamp>
<setSpec>exercises</setSpec>
</header>
<metadata>
<lom xmlns=...> ... </lom>
</metadata>
</record>
....
<resumptionToken expirationDate="2007-06-11T07:48:58Z"
completeListSize="42" cursor="10">1181544538265</resumptionToken>
</ListRecords>
</OAI-PMH>
6
OAICat - a Java implementation

OAICat home at
http://www.oclc.org/research/software/oai/cat.htm

Takes care of
–
–

web service details
OAI XML specification
The implementer has to provide three classes
–
–
–
RepositoryOAICatalog
RepositoryRecordFactory
Repository2oai_dc (lom, ...) - usually more than
one
7
A sample implementation
(Source code and libs in
http://fire.eun.org/Iztok/OAILREApp.zip)


Create a new web module
Add servlet oaiHandler to web.xml
<servlet>
<servlet-name>LreOAIHandler</servlet-name>
<servlet-class>ORG.oclc.oai.server.OAIHandler</servlet-class>
<load-on-startup>5</load-on-startup>
</servlet>
<servlet-mapping>
<servlet-name>LreOAIHandler</servlet-name>
<url-pattern>/oaiHandler</url-pattern>
</servlet-mapping>
8
(cont)
 Define properties file location
<context-param>
<param-name>properties</param-name>
<param-value>oaicat.properties</param-value>
</context-param>

Welcome file for testing
<welcome-file-list>
<welcome-file>testoai/index.html</welcome-file>
</welcome-file-list>
9
Sample record


A record with basic fields
id, url, title, descr and date
SampleOAICatalog contains an array with 3
sample records
10
SampleOAICatalog.listIdentifiers

Parameters
–
from – date to harvest from (String in iso8601
format)

–
–
to – date to harvest to
set – a set name, list only records from this set (if
null, list all records)


–
date or datetime - depends on granularity
set names classify objects in natural groups
every record may belong to multiple sets (or none)
metadaPrefix – list only records that support this
format (sample formats: oai_dc, oai_lom, ...)
11
SampleOAICatalog.listIdentifiers

Must return a map with to fields
–
–

headers – a String iterator of OAI headers
identifiers – a String iterator of OAI identifiers
Both created by the call (rec is a SampleRecord)
String[] header = getRecordFactory().createHeader(rec);
headers.add(header[0]);
identifiers.add(header[1]);

Create result
Map<String, Object> listIdMap = new HashMap<String, Object>();
listIdMap.put("headers", headers.iterator());
listIdMap.put("identifiers", identifiers.iterator());
return listIdMap;
12
getRecordFactory().createHeader(rec)

Creates header by calling the methods in
SampleRecordFactory

String getOAIIdentifier(Object rec)
–

String getDatestamp(Object rec)
–

return full oai identifier “oai:oay.rep.com:id001”
returns date in iso8601 format
Iterator<String> getSetSpecs (Object rec)
ArrayList<String> list = new ArrayList<String>();
list.add(...);
return list.iterator();


Iterator<String> getAbouts (Object rec)
String fromOAIIdentifier(String id)
–
helper method – convert id to a local id
13
SampleOAICatalog.listSets

takes no parameters, returns the list of all
sets in this repository
–
each ListIdentifiers or ListRecords query may
contain a set name, limiting the results to just one
set
14
SampleOAICatalog.getSchemaLocations

like GetRecord, but returns the Vector of all
metadata schema locations the record
supports
–
to obtain them, just call
getRecordFactory().getSchemaLocations(rec);
15
SampleOAICatalog.getRecord

String getRecord(String id, String metadataPrefix)
–
–
–
–
find record and convert it to xml string (<record> element)
id is in global format – to get local value call
getRecordFactory().fromOAIIdentifier(id)
throw IdDoesNotExistException if record not found
to generate XML use constructRecord
constructRecord(rec, metadataPrefix)
16
SampleOAICatalog.listRecords


just like ListIdentifiers, only generates a list of
XML <record> elements
return a map with one element
Map<String, Object> listRecMap = new HashMap<String, Object>();
listRecMap.put(“records", records.iterator());
return listRecMap;
17
Crosswalks


Conversions of native record type to XML like
Sample2oai_lom or Sample2oai_dc
Only two methods per implementation
–
–
boolean isAvailableFor(Object rec)
String createMetadata(Object rec)
SampleRecord record = (SampleRecord) rec;
return LOMFormat.writeStringWithSchema(record.toLOM());

throw CannotDisseminateFormatException if the
metadata not available in this format
18
SampleRecord.toLOM

uses LOM-j lib to quickly hack together LOM
http://sourceforge.net/projects/lom-j/
–

automatic serialization/deserialization of LOM and
DC XML formats
Example
lom.newGeneral().newIdentifier(0).newCatalog().setString("lre");
lom.newGeneral().newIdentifier(0).newEntry().setString("sample:" + id);
lom.newTechnical().newLocation(-1).setString(url);
lom.newGeneral().newTitle().newString(0).newLanguage().setValue("en");
lom.newGeneral().newTitle().newString(0).setString(title);
19
Resumption

A repository usually has fixed limit on the
numer of records to return in one call
–
–
–
if there are more available, it returns a resumption
token, allowing to receive next packet
Implemented by functions
listIdentifiers(String resumptionToken) ,
listRecords(String resumptionToken)
see XYZOAICatalog for details
20
References

http://www.openarchives.org/OAI/openarchivesprotocol.html
http://www.fmf.uni-lj.si/~kavkler/
http://www.oclc.org/research/software/oai/cat.htm
http://www.cs.kuleuven.ac.be/~hmdb/SqiOaiMelt
http://sourceforge.net/projects/lom-j/

SIO/Trubar OAI url




http://sio.edus.si/LreTomcat/
21