slides - Department of Computer Science
Download
Report
Transcript slides - Department of Computer Science
Introduction to Digital Libraries
Week 14: OAI & Complex Objects for Preservation
Old Dominion University
Department of Computer Science
CS 751/851 Fall 2006
Michael L. Nelson [email protected]
Joan A. Smith [email protected]
11/29/06
several slides borrowed from Van de Sompel, Liu, Lagoze, Warner & Harrison
Outline
1.
2.
3.
4.
5.
11/29/2006
Digital Preservation: Concepts & Issues
OAI-PMH Mechanics
Complex Objects
Preservation Using OAI-PMH & Complex Objects
Implementation Example: mod_oai
CS 751/851: OAI-PMH & Complex Objects
2
Digital Preservation
Durable
“Digital information lasts forever -or 5 years, whichever comes first”
-- Jeff Rothenberg
•
Do you still have a copy of your first email?
•
Can you still compile and run the first program you ever wrote? BASIC
compilers are hard to find these days…
•
If lightning fried your computer, how much information would you have
lost?
•
How many versions of your website have you made? How many do you
still have?
Digital information is very fragile
Fragile
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
3
DP Strategy Example: LOCKSS Caches
•
•
•
LOCKSS seeks to ensure long-term availability of digital publications
even if the publisher goes out of business
Peer-to-peer network is used to maintain and repair content
Ensures content is only available to authorized subscribers
In this example, each LOCKSS cache (oval)
collects journal content from the
publisher's web site as it is published.
Readers (circles) can get content from the
publisher site.
When the publisher's web site is not
available (gray) to a local community,
readers
from that community get content from their
local institution's cache. The caches "talk" to
each
other to maintain the content's integrity over
time .
3 Goals of LOCKSS:
1. Preserve content (bits)
2. Preserve access (to bits)
3. Preserve understanding of
bits (as content)
The point of LOCKSS is to ensure rights of publishers and accessibility to subscribers
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
4
DP Strategy Example: VERS
VERS Objects
VERS Process
Note the emphasis on digital signatures:
A key element of official records
The final object contains a
wealth of metadata
The point of VERS is to ensure evidentiary-quality official records
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
5
Web Site Preservation
•
Internet Archive’s Wayback Machine
–
–
•
WARP
–
–
–
•
Search Engine standard (Google, Yahoo, MSN) to map site resources
Not preservation-oriented per se: an entry point to preservation
Today/near-future focus
Search engines are saying: I give up!
Google Groups/Usenet
–
–
•
Japan’s national web archiving program
Japanese-origin sites
Many countries have similar efforts
Sitemaps
–
–
–
–
•
Philanthropic effort by B. Kahle
By-request and general web crawls
Restored ~80% of original Usenet archives
Primarily text-based content
Mirroring Strategies
–
–
Can ease migration of resources
Short-term backup rather than long-term preservation
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
6
Crawling is Complicated
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
7
Web Site Preservation: 2 Problems
The counting problem
The representation problem
How many pages are on that site?
To save it you have to find it
What’s that page all about?
Future use requires understanding
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
8
Digital Preservation Issues
1.
Refreshing: If you don’t have it, you can’t preserve it
–
–
–
2.
Migration: If you don’t upgrade it, you can’t use it
–
–
–
3.
Resources disappear over time (Cong. Foley’s web site)
Resources change over time ( http://www.cs.odu.edu/index.html )
Resources can decay/degrade over time (damaged files, lost links)
Format obsolescence (WordPerfect vs. PDF)
Format modification (XBM vs. JPEG)
System obsolescence (TRS-80 vs PowerPC)
Emulation: If you can’t access it, you can’t use it
–
–
–
Original bits and bytes only work in the original environment (PDP-11)
Obsolete systems can be emulated in a newer environment (Frogger)
Physical characteristics have to be interpreted in new environments
These issues apply to every digital preservation effort, web or DL
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
9
Open Archive Information System: OAIS
A General Reference Model for Preservation (physical or digital)
–
–
–
SIP = Submission Information Package
AIP = Archival Information Package
DIP = Dissemination Information Package
today
from today through all tomorrows
future
Note the complicated, active,
on-going role of the archivist
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
10
Outline: 2
1.
2.
3.
4.
5.
11/29/2006
Digital Preservation: Concepts & Issues
OAI-PMH Mechanics
Complex Objects
Preservation Using OAI-PMH & Complex Objects
Implementation Example: mod_oai
CS 751/851: OAI-PMH & Complex Objects
11
Libraries: Inspiration for a Digital Age
Anatomy of a city library:
•
Organized
–
Grouped
•
•
–
•
Numbered
Searchable
–
–
–
•
Topics
subtopics
By author, title
By topic
By edition
Lots of metadata
Digital library is similar
•
Expands on physical
library concepts
•
Special protocols let
librarians organize and
find resources &
information
•
OAI-PMH is one of these
“library” protocols
11/29/2006
GV943
. 25
.B74
1990
Brenner, Richard J., 1941Make the team. Soccer : a heads up guide to super soccer! / Richard
J. Brenner. -- 1st ed. -- Boston : Little, Brown, c1990.
127 p. : ill. ; 19 cm.
"A Sports illustrated for kids book."
Summary: Instructions for improving soccer skills. Discusses dribbling,
heading, playmaking, defense, conditioning, mental attitude, how to handle
problems with coaches, parents, and other players, and the history of soccer.
ISBN 0316107514 : $12.95
Soccer--Juvenile literature. 2. Soccer. II. Title: Heads up guide to super
soccer. II. Title.
CS 751/851: OAI-PMH & Complex Objects
Dewey Class no.: 796.334/2 -- dc 20
89-48230
MARC
12
OAI-PMH data model
resource
MimeType=pdf
smith.pdf
OAI-PMH sets
OAI-PMH identifier
entry point to all records pertaining to the resource
item
/foo/refs/smith.pdf
OAI-PMH:
identifier
metadataPrefix
datestamp
11/29/2006
Dublin Core
metadata
MARCXML
metadata
CS 751/851: OAI-PMH & Complex Objects
records
metadata pertaining
to the resource
13
Overview of OAI-PMH Verbs
Verb
metadata
about the
repository
harvesting
verbs
Function
Identify
description of repository
ListMetadataFormats
metadata formats supported by repository
ListSets
sets defined by repository
ListIdentifiers
OAI unique ids contained in repository
ListRecords
listing of N records
GetRecord
listing of a single record
most verbs take arguments: dates, sets, ids, metadata formats
and resumption token (for flow control)
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
14
Repositories and Harvesters
Data Providers /
Repositories
REPOSITORY:
•
Network accessible
•
Processes OAI-PMH style requests
•
Exposes metadata to harvesters
11/29/2006
Service Providers /
Harvesters
HARVESTER:
•
Client application
•
Issues OAI-PMH style requests
•
Collects metadata from repositories
SERVICE PROVIDER:
•
Aggregates metadata from multiple
repositories
•
Facilitates discovery of resources
CS 751/851: OAI-PMH & Complex Objects
15
Aggregators
aggregators allow for:
• scalability for OAI-PMH
• load balancing
• community building
• discovery
data providers
(repositories)
11/29/2006
aggregator
CS 751/851: OAI-PMH & Complex Objects
service providers
(harvesters)
16
OAI-PMH Verbs & Special Features
•
Verbs:
–
Identify
•
–
ListIdentifiers
•
•
•
–
Defined locally via scripts to aggregate common record groups
Facilitates selective harvesting of site
MIME-Type sets are automatically supported by mod_oai
GetRecord
•
•
•
Sequential transfer of each record
Can limit to N records (flow control for crawler)
ListSets
•
•
•
–
Specifies types of metadata tracked by the site
Options include Dublin Core, MARC, DIDL, RFC1807, others…
Dublin Core is required by OAI specification
ListRecords
•
•
–
Returns record headers only
Resumption token manages lengthy data set
Unique identifier for each site resource
ListMetadataFormats
•
•
•
–
Provides descriptive metadata about the DL
Selects specific, single record from site
Identified by the OAI unique identifier
Special Features:
–
Datestamp harvesting
•
Example: Give me all records updated between 2005-10-05 and today
“http://www.xyz.us/oai?verb=ListRecords&from=2005-10-05&until=2006-06-11&metadataprefix=oai_dc”
–
Metadata only –or:
•
•
Full record; encapsulated as DIDL –or:
A complete package with all of this information
–
11/29/2006
Akin to OAIS AIP
CS 751/851: OAI-PMH & Complex Objects
17
Example: Identify Verb Response Content
HTTP request: http://beatitude.cs.odu.edu:8080/modoai/?verb=Identify
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
18
Example: ListIdentifiers Verb Response Content
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
19
Resource Harvesting: Use cases
• Discovery: use content itself in the creation of services
– search engines that make full-text searchable
– citation indexing systems that extract references from the full-text
content
– browsing interfaces that include thumbnail versions of high-quality
images from cultural heritage collections
• Preservation:
– periodically transfer digital content from a data repository to one or
more trusted digital repositories
– trusted digital repositories need a mechanism to automatically
synchronize with the originating data repository
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
20
Existing OAI-PMH based approaches
Typical scenario:
1. An OAI-PMH harvester harvests Dublin Core records from
the OAI-PMH repository.
2. The harvester analyzes each Dublin Core record, extracting
dc.identifier information in order to determine the network
location of the described resource.
3. A separate process, out-of-band from the OAI-PMH, collects
the described resource from its network location.
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
21
Existing OAI-PMH based approaches : Issue 1
Locating the resource based on information provided
in dc.identifier
dc.identifier used to convey a variety of identifier:
(simultaneously) URL DOI, bibliographic citation, … Not
expressive enough to distinguish between identifier, locator.
Several dereferencing attempts required
URI provided in dc.identifier is commonly that of a
bibliographic “splash page”
How to know it is a bibliographic “splash page”, not the
resource?
If it is a bibliographic “splash page”, where is the resource?
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
22
Existing OAI-PMH based approaches : Issue 2
Using the OAI-PMH datestamp of the Dublin Core record
to trigger incremental harvesting:
Datestamp of DC record does not necessarily change when
resource changes
no DC datestamp change
no resource update
resource update
11/29/2006
DC datestamp change
OK
unnecessary
resource download
missed
resource update
OK
CS 751/851: OAI-PMH & Complex Objects
23
Existing OAI-PMH based approaches :
Conventions
Cannot really address issue 2 (datestamps) with
metadata conventions
Issue 1 (identifier & locator of the resource) is currently
addressed with a range of conventions
First dc.identifier is locator of the resource
what if the resource is not digital?
Use of dc.format and/or dc.relation to convey locator
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
24
Existing OAI-PMH based approaches :
Conventions
<oai_dc:dc>
<dc:title>A Simple Parallel-Plate Resonator Technique for Microwave.
Characterization of Thin Resistive Films</dc:title>
<dc:creator>Vorobiev, A.</dc:creator>
<dc:subject>ING-INF/01 Elettronica</dc:subject>
<dc:description>A parallel-plate resonator method is proposed for
non-destructive characterisation of resistive films used in
microwave integrated circuits. A slot made in one ... </dc:description>
<dc:publisher>Microwave engineering Europe</dc:publisher>
<dc:date>2002</dc:date>
<dc:type>Documento relativo ad una Conferenza o altro Evento</dc:type>
<dc:type>PeerReviewed</dc:type>
<dc:identifier>http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>
<dc:format>pdf
http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf
</dc:format>
</oai_dc:dc>
locator of resource
splash page
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
25
Existing OAI-PMH based approaches :
Conventions
…
<dc:identifier>http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>
<dc:relation>
http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf
</dc:relation>
…
splash page
11/29/2006
locator of resource
CS 751/851: OAI-PMH & Complex Objects
26
Existing OAI-PMH based approaches :
Conventions
…
<dc:identifier> http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>
<dc:relation>
http://resolver.unibo.it/00000014/
</dc:relation>
<dc:relation>
http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf
</dc:relation>
…
splash page
locator of resource
splash page
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
27
Existing OAI-PMH based approaches :
Other attempts
dc.identifier leads to splash page & splash page contains special
purpose XHTML link to resource(s)
What if there is no splash page?
How does a harvester recognize this situation?
OA-X: protocol extension
OK in local context
Strategic problem to generalize
How to consolidate with OAI-PMH data model
Qualified Dublin Core
Could bring expressiveness to distinguish between locator & identifier
But what about the datestamp issue?
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
28
Outline: 3
1.
2.
3.
4.
5.
11/29/2006
Digital Preservation: Concepts & Issues
OAI-PMH Mechanics
Complex Objects
Preservation Using OAI-PMH & Complex Objects
Implementation Example: mod_oai
CS 751/851: OAI-PMH & Complex Objects
29
Complex Objects
•
•
Representation of a digital object by means of a wrapper XML document
Represented resource can be:
–
–
•
•
Unambiguous approach to convey identifiers of the digital object and its
constituent datastreams.
Include datastream:
–
–
–
•
simple digital object (consisting of a single datastream): foo.txt
compound digital object (consisting of multiple datastreams) foo.asp
By-Value: embedding of base64-encoded datastream
By-Reference: embedding network location of the datastream
not mutually exclusive; equivalent
Include a variety of secondary information
–
–
–
By-Value
By-Reference
Descriptive metadata, rights information, technical metadata, …
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
30
Complex Object Formats: Characteristics
•
•
Representation of a digital object by means of a wrapper XML document.
Represented resource can be:
–
–
•
Include datastream:
–
–
–
•
simple digital object (consisting of a single datastream)
compound digital object (consisting of multiple datastreams)
By-Value: embedding of base64-encoded datastream
By-Reference: embedding network location of the datastream
Descriptive metadata, rights information, technical metadata, …
MPEG-21 DIDL is one type of complex object format
–
–
Can be used in OAI-PMH
Metadata prefix for mod_oai is “oai_didl”
In other words:
–
Instead of just looking at the index card about the book,
we can actually get the book, too
Let’s look at an example GetRecord verb for a very simple resource
( http://beatitude.cs.odu.edu/modoaitest/joan.html )
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
31
MPEG-21 DIDL Data Model
1
DIDL
n
Descriptor
1
n
Container
1
1
n
Item
1
1
1
n
Descriptor
Item
1
n
How to encode Archive?
• 1 file = 1 DID
• 1 archive = 1 container
• 1 archive = 1 component
• 1 file = 1 component
descriptors are used to convey:
Descriptor
• digital item identification (DII)
• digital item processing (DIP)
• rights expression language (REL)
• digital item relations (DIR)
• creation date (DIDT)
1
n
11/29/2006
Container
n
n
Resource
Descriptor
1
n
Component
n
all resources within a component
are equivalent by definition
CS 751/851: OAI-PMH & Complex Objects
32
Example DIDL
<didl:DIDL>
<didl:Item>
<didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8">
<dii:Identifier>
http://amsacta.cib.unibo.it/archive/00000014/
</dii:Identifier>
</didl:Statement></didl:Descriptor>
<didl:Descriptor><didl:Statement mimeType="text/xml; charset=UTF-8">
<oai_dc:dc>
<dc:title>A Simple Parallel-Plate Resonator Technique for
Microwave. Characterization of Thin Resistive Films
</dc:title>
<dc:creator>Vorobiev, A.</dc:creator>
<dc:identifier>
http://amsacta.cib.unibo.it/archive/00000014/</dc:identifier>
<dc:format>application/pdf</dc:format>
…
</oai_dc:dc>
</didl:Statement></didl:Descriptor>
<didl:Component>
<didl:Resource mimeType="application/pdf"
ref="http://amsacta.cib.unibo.it/archive/00000014/01/GaAs_1_Vorobiev.pdf"/>
</didl:Component>
</didl:Item>
</didl:DIDL>
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
33
Complex Object Formats & OAI-PMH
• Resource represented via XML wrapper
=> OAI-PMH <metadata>
•
•
•
•
Uniform solution for simple & compound objects
Unambiguous expression of locator of datastream
Disambiguation between locators & identifiers
OAI-PMH datestamp changes whenever the resource
changes
– data streams & secondary information
– Resource or its metadata
• OAI-PMH semantics apply: “about” containers, set
membership
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
34
GetRecord: Get the Id and the Data
http://beatitude.cs.odu.edu:8080/modoai?verb=GetRecord
&Identifier=http://beatitude.cs.odu.edu:8080/modoaitest/joan.html
&metadataPrefix=oai_didl
• oai_didl metadata format (prefix)
• Complex object response
– Encapsulates resource within the response
– Encodes it as base64
• Everything known about the URL is in the response
– All of the metadata types and the contents
• Dublin Core
• HTTP Headers
• Any others that might be used by that server…
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
35
Example: GetRecord/oai_didl Response
“joan.html” encoded
in base64
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
36
OAI-PMH based approach using Complex Object Formats
Typical scenario:
1. An OAI-PMH harvester checks for support of a locally understood
complex object format using the ListMetadataFormats verb
2. The harvester harvests the complex object metadata. Semantics of the
OAI-PMH datestamp guarantee that new and modified resources are
detected.
3. A parser at the end of the harvesting application analyzes each
harvested complex object record:
•
•
The parser extracts the bitstreams that were delivered By-Value.
The parser extracts the unambiguous references to the network location
of bitstreams delivered By-Reference.
4. A separate process, out-of-band from the OAI-PMH, collects the
bitstreams delivered By-Reference from the extracted network locations.
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
37
Complex Object Formats & OAI-PMH : issues
•
•
•
•
•
Which Complex Object Format(s)
How to Profile Complex Object Format(s) for OAI-PMH Harvesting
Large records
Making resources re-harvestable
Because the resource is represented as <metadata>, can rights
pertaining to the resource be expressed according to the “rights for
metadata” OAI-rights guideline?
• Tools:
– Software library to write compliant complex objects
– Integration of this library with repository systems (Fedora, DSpace,
eprints.org, ….)
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
38
Complex Object Formats & OAI-PMH:
Existing implementations
•
LANL Repository
– Local storage of Terrabytes of scholarly assets
– Assets stored as MPEG-21 DIDL documents
– DIDL documents made accessible to downstream applications via the
OAI-PMH
•
Mirroring of American Physical Society collection at LANL
– Maps APS document model to MPEG-21 DIDL Transfer Profile
– Exposes MPEG-21 DIDL documents through OAI-PMH infrastructure
– Inlcudes digests/signatures
•
DSpace & Fedora plug-ins
– Maps DSpace/Fedora document model to MPEG-21 DIDL Transfer
Profile
– Exposes MPEG-21 DIDL documents through OAI-PMH infrastructure
•
mod_oai
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
39
Outline: 4
1.
2.
3.
4.
5.
11/29/2006
Digital Preservation: Concepts & Issues
OAI-PMH Mechanics
Complex Objects
Preservation Using OAI-PMH & Complex Objects
Implementation Example: mod_oai
CS 751/851: OAI-PMH & Complex Objects
40
Digital Preservation: A New Strategy
1. OAIS
2. OAI-PMH
3. Complex Objects
+
Digital Preservation
We can leverage these existing technologies to
create a unique approach to web preservation
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
41
OAI-PMH Data Model with Complex Objects
resource
OAI-PMH identifier
= entry point to all records pertaining to the resource
metadata pertaining
to the resource
modeled representation
of the resource
11/29/2006
Dublin Core
metadata
MPEG-21
DIDL
METS
MARCXML
metadata
simple
model
complex
model
complex
model
more expressive
model
CS 751/851: OAI-PMH & Complex Objects
item
records
42
Complex Object Formats & OAI-PMH : archive
export/ingest
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
43
2 Problems: Counting & Representation
Counting Problem (Itemizing Resources)
•
•
•
•
•
Finding all URLs on a site is hard
Can’t preserve a resource if you can’t find it…
Access-restrictions may exist
Pages may be orphaned intentionally or accidentally
URL normalization complicated, time-consuming
Representation Problem (Characterizing Resources)
•
•
•
•
•
Resource types in use migrate over time
Mechanisms for accessing resources evolve
Old formats may not be recognizable
Other metadata might be desirable
Keeping the bits & bytes alone is insufficient
Can the web server help to solve these problems?
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
44
CRATE: A Model for Web Resource Preservation
•
•
•
•
11/29/2006
Fits with OAIS Preservation Model
Text-based protocol for long-term survivability
Complex object format supported by HTTP via OAI-PMH
Utilizes web-server to support preservation via mod_oai
CS 751/851: OAI-PMH & Complex Objects
46
Outline: 5
1.
2.
3.
4.
5.
11/29/2006
Digital Preservation: Concepts & Issues
OAI-PMH Mechanics
Complex Objects
Preservation Using OAI-PMH & Complex Objects
Implementation Example: mod_oai
CS 751/851: OAI-PMH & Complex Objects
47
What if we could -•
Get a list of all URLs for the site
– Including those not linked from root
– Maybe even CGI-related links
•
Get a list of everything new since last visit
– Any pages that have changed
– Any new pages added
– Any pages that have been deleted
•
Get a list of all <put your mime type here>
–
–
–
–
•
Images (specific subtype or all of them)
HTML pages only
PDFs only
Whatever mime spec you want…
Package resource and metadata together in one object
I.E., Solve the Counting and Representation problems
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
48
mod_oai solution
Integrate OAI-PMH functionality into the web server itself…
1.
Use mod_oai
•
•
•
•
2.
3.
→
an Apache 2.0 module
automatically answers OAI-PMH requests for an http server
written in C
respects values in .htaccess, httpd.conf
Install mod_oai on http://www.foo.edu/
Define baseURL: http://www.foo.edu/modoai
Result: web harvesting with OAI-PMH syntax (e.g., from, until, sets)
http://www.foo.edu/modoai?verb=ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=mime:video:mpeg
From site foo,
Using OAI-PMH
Give me a list of all resources
dating from 9/15/2004 through today
And their Dublin Core metadata
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
that are MIME type video-MPEG
49
How does mod_oai work?
• Source Code
– Written in C
– Designed to be platform-independent
• Requires Apache 2
• Uses APSX2 calls
• Linux, MAC compatible
•
Runs as a web server process
– Installed on web server like mod_perl or mod_deflate, for example
– Config file handles module specifics (baseURL location, etc)
– Enables OAI-PMH verbs to appear in the HTTP request
• baseURL + verb gets OAI-PMH response
•
The rest of the site works as normal
– Users see no change
– Standard crawlers can operate as usual
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
50
OAI-PMH concepts : typical repository
OAI-PMH Entity
Resource
value
URL
description
PDF, PS, XML, HTML or other file
Item
identifier OAI Identifier DNS-based name of metadata about
resource
set membership
LCSH
Library of Congress Subject Heading
metadataPrefix
oai_dc
bibliographic metadata in Dublin Core
Record
datestamp
2004-10-18
modification date of DC record
Record
metadataPrefix
datestamp
11/29/2006
oai_marc
2004-07-31
bibliographic metadata in MARC
modification date of MARC record
CS 751/851: OAI-PMH & Complex Objects
51
OAI-PMH concepts : mod_oai
OAI-PMH Entity
Resource
value
description
URL
HTML, GIF, PDF or other web file
URL
same URL as the resource
set membership
MIME type
MIME type of the resource
metadataPrefix
http_header
the http headers that would have been
returned via HTTP GET/HEAD
datestamp
2004-07-31
modification date of resource
oai_dc
a subset of http_header in DC
2004-07-31
modification date of resource
Item
identifier
Record
Record
metadataPrefix
datestamp
Record
metadataPrefix
datestamp
oai_didl
2004-07-31
MPEG-21 DIDL: base64 encoded
resource + http_header metadata
modification date of resource
Efficient, Automatic Harvesting
A better way: using OAI-PMH to crawl a site
– Identify
• Gives essential repository information
– ListRecords/ListIdentifiers
• Lists all of the resources on the site
• Can be “tweaked”:
– Only those that are new since YYYY-MM-DD
– Only those of MIME type <???>
• Streamlines crawling process
– ListSets
• Tells the crawler what kind of groupings the site supports
– 6 Verbs in All
– Streamlined initial crawl, fast update crawls
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
53
Performance of mod_oai vs wget
•
All crawlers
–
–
•
ListIdentifiers
–
–
–
•
Must ask for every resource
Discovery faster, automatic for mod_oai
Only an OAI-PMH verb
Could be used to create an index of resource
names
Gets unlinked and linked resources
ListRecords
–
–
–
Data from performance on www.cs.odu.edu
Only an OAI-PMH verb
Returns metadata plus resource
Gets unlinked and linked resources
•
wget
•
–
Behaves like common crawler
–
Can only find linked resources
Update performance improved using mod_oai (OAI-PMH)
–
•
Conditional request is streamlined
If only new/changed pages are requested:
–
OAI-PMH crawler:
•
•
–
wget
index.html as "find . -type f" as
seed
seed
# of files in baseline
709
5739
# of files in update (25%) 114
1318
11/29/2006
mod_oai
files
“GET from yyyy-mm-dd” (last visit date)
One request gets all the new data
Standard crawler
•
•
“GET if-modified-since”
Must ask for every page
5268
1335
CS 751/851: OAI-PMH for
& Complex
Objects
54 “
more detail:
“mod_oai: An Apache Module for Metadata Harvesting
http://arxiv.org/abs/cs.DL/0503069
Improving Crawls Using mod_oai
•
Google sitemaps for OAI-PMH sites
–
–
–
•
currently harvests Dublin Core only
Uses your baseURL to crawl your site
Uses the date feature to get newest information
Complex-object format/MPEG-21 DIDL
–
–
New OAI-PMH approach combines resource + metadata
Big files, but –
•
•
•
–
Simplifies crawls
•
•
•
Could use gzip, deflate if server supports it (many do)
Still more efficient than traditional crawling
Can provide lots of useful metadata
ListRecords gets everything
ListRecords + date range = fast updates
Any crawler could request MPEG-21 DIDL format (oai_didl)
•
•
•
•
Google could easily adopt it since they already use ListRecords
Any search engine looking for a competitive edge could implement DIDL metadata prefix to
streamline crawls
Intranets could adopt this approach for archiving their internal web
Encoded base64 resource is also easy to decode for analysis or restoration
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
55
Addressing the Counting Problem: ListIdentifiers
CRAWLER:
•
issues a ListIdentifiers,
•
finds URLs of updated resources
•
does HTTP GET updates only
•
can get URLs of resources with
specified MIME types
EXTEND mod_oai “counting”:
•
Web log lists
•
File system lists
•
Configuration information
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
56
Addressing the Representation Problem: ListRecords in DIDL Format
CRAWLER:
•
Makes a ListRecords query,
•
Gets updates as MPEG-21 DIDL
records (HTTP headers, resource By
Value or By Reference)
•
can get resources with specified MIME
types
EXTEND mod_oai “representation”:
•
Add ability to incorporate other
metadata output
•
Build metadata-rich complex object
response
•
Encapsulate within existing OAI-PMH
DIDL metadata format response
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
57
GetRecord: Get the Id and the Data
http://beatitude.cs.odu.edu:8080/modoai?verb=GetRecord
&Identifier=http://beatitude.cs.odu.edu:8080/modoaitest/joan.html
&metadataPrefix=oai_didl
• oai_didl metadata format (prefix)
• Complex object response
– Encapsulates resource within the response
– Encodes it as base64
• Everything known about the URL is in the response
– All of the metadata types and the contents
• Dublin Core
• HTTP Headers
• Any others that might be used by that server…
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
58
Actual GetRecord Response (oai_didl)
“joan.html” encoded
in base64
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
59
Advantages of mod_oai
•
•
•
•
•
•
•
Search engines are taking a real interest in OAI-PMH as a means to
improve crawling
mod_oai is an Apache 2.0 module that provides OAI-PMH interface for your
site (currently Linux & Mac)
You can send the baseURL to Google
The module is relatively simple to install
It won’t affect regular site users and regular web crawlers
Any changes to your site will be reflected by the mod_oai server
It makes crawling much faster, more efficient, more useful
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
60
Search Engine Use of OAI-PMH
•
Google sitemaps: OAI-PMH or Do-It-Yourself
–
Via OAI-PMH
•
•
–
Via Google’s tool or manually constructed
•
•
•
•
•
Digital-library-centric (not general web)
Specifically states it can access OAI-PMH repositories
Unclear if role will grow to include MSN Search
•
http://academic.live.com/Publishers_Faq.htm
Yahoo
–
–
No sign-up guidelines for OAI-PMH-enabled sites
Yet… research showed good coverage of OAI-PMH Repositories
•
•
•
XML-formatted file; URI/IRI compliant
Follow schema: http://www.google.com/schemas/sitemap/0.84/sitemap.xsd
ASCII and UTF-8 encoded (escaped quotes, ampersands, etc)
Limited size: 50,000 urls, 10mb max (per sitemap file)
MSN Academic Live
–
–
–
•
Just send them the baseURL!
Google does a ListRecords query on your site
Outsourced OAI-PMH crawls [1]
OAIster (U Michigan Library) provides Yahoo with OAI repository information
Professional Digital Libraries
–
–
Many support OAI-PMH
Many are not open to commercial search engines
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
61
Google Sitemaps Using OAI-PMH
http://www.google.com/support/webmasters/bin/answer.py?answer=34655&ctx=sibling
XML Format info here:
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#sitemapXMLFormat
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
62
Issues, Current Research, and Future Work
•
For a given server, there are a set of URLs, U, and a set of files F
–
–
•
Neither function is 1-1 nor onto
–
•
Apache maps U F
mod_oai maps F U
We can easily check if a single u maps to F, but given F we cannot (easily) generate U
Short-term issues:
–
dynamic files
•
–
IndexIgnore
•
–
httpd will advertise files it cannot read
Long-term issues
–
Alias, Location
•
–
files can be covered up by the httpd
UserDir
•
•
httpd will “hide” valid URLs
File permissions
•
•
exporting unprocessed server-side files would be a security hole
interactions between the httpd and the filesystem
Preservation research
–
–
–
–
Plug-in metadata harvesters
Efficient packaging of resource with metadata
Impact of processes on web server performance
Suitability of CRATE model for preservation
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
63
IndexIgnore & File Permissions
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
64
Alias: Covering Up Files
httpd.conf:
Alias /A /usr/local/web/htdocs/B
Alias /B /usr/local/web/htdocs/A
the files “A” and “B” will be different from the URLs
http://server/A
http://server/B
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
65
UserDir: “Just in Time” mounting of directories
whiskey.cs.odu.edu:/ftp/WWW/conf% ls /home
liu_x/ mln/
whiskey.cs.odu.edu:/ftp/WWW/conf% ls -d /home/tharriso
/home/tharriso/
whiskey.cs.odu.edu:/ftp/WWW/conf % ls /home
liu_x/ mln/ tharriso/
whiskey.cs.odu.edu:/ftp/WWW/conf %
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
66
Example CRATE Plug-Ins for mod_oai
•
•
Name
Description
Jhove
Image analysis
Kea
Key-phrase extraction
OTS
Open Text Summarizer
ExifTool
Image/video metadata extractor
Pdflib
Extract PDF metadata
MP3-Tag
Extract audio file tags
Essence
Customized information extraction
GDFR
MIME++
Plug-in design allows for any type of extraction tool to be included
Flexible architecture elements:
Tags | Argument-Name | Version | CDATA output
•
•
•
Simple Apache configuration file modification to enable plug-in
Plug-ins written by 3rd-party programmers
Validity of metadata is not verified by CRATE
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
67
OAI-PMH + Complex Objects: A New Model for
Web Resource Harvesting & Preservation
•
Better web harvesting can be achieved through:
– OAI-PMH
– Complex object formats
•
Use cases:
– Preservation (ListRecords)
– Web crawling (ListIdentifiers)
•
mod_oai: reference implementation
– Better performance than wget
– static files only; dynamic files in the future
– not a replacement for DSpace, Fedora, eprints.org, etc.
•
New version of mod_oai
– Plug-in compatible
– Flexible architecture
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
68
For more information
•
A website with mod_oai releases, demos and documentation is maintained
by Old Dominion University and LANL:
http://www.modoai.org/
– New release next month
– Improved installation process
•
The Open Archives Initiative also maintains a web site:
http://www.openarchives.org/
– Forum, tutorials, news, research
– OAI-PMH information
•
There are active research projects at ODU using mod_oai
– Web preservation
– Repository ingestion/handling
– See http://www.cs.odu.edu/~mln/research
11/29/2006
CS 751/851: OAI-PMH & Complex Objects
69