Transcript Metadata

DL:Lesson 5
Classification Schemas
Luca Dini
[email protected]
Web: Crawling
“central”
index
?
Metadata harvesting
metadata
metadata
?
Author
Title
Abstract
Identifer
Metasearching
?
Metasearch
Engine
What is metasearching?

Given many document sources and a query, a
metasearcher:
–
–
–
Finds the good sources for the query
Evaluates the query at these sources
Merges the results from these sources
Metasearcher
Unindexed
Documents
Legacy
Database /
WAIS / etc.
Existing
Web
Application
<%
%>
Main Issues


How to query different types of sources?
How to combine results and rankings from multiple
data sources?
Metasearcher
grep
‘biomedical’
*.txt
SELECT
title
FROM
articles . . .
http://…/getTitle?
title=‘biomedical’&…
<%
%>
Other Issues


How to choose among multiple data sources?
How to get metadata about multiple data
sources?
Metasearcher
cat
*.txt
SELECT
SCHEMA
…….
<%
%>
Best:
http://….?getMetaData
Worst:
“Hi. What do you have?”
Cost/Functionality
Cost of acceptance
Z39.50
SDLIP/STARTS
Metadata Harvesting
google
Function
Z39.50

http://www.loc.gov/z3950/agency/
Goals
• Permits one computer, the client, to search and retrieve
information on another, the database server
• Important both technically and for its wide use in library
systems
• Most development has concentrated on bibliographic data
• Most implementations emphasize searches that use a
bibliographic set of attributes to search databases of
MARC records
Principles
Abstract view of database searching.
• Server stores a set of databases with searchable
indexes
• Interactions are based on a session
• The client opens a connection with the server, carries
out a sequence of interactions and then closes the
connection.
• During the course of the session, both the server and
the client remember the state of their interaction.
The results
Z39.50
• The server carries out the search and builds a results set
• Server saves the results set.
• Subsequent message from the client can reference the
result set.
• Thus the client can modify a large set by increasingly
precise requests, or can request a presentation of any
record in the set, without searching entire database.
Services
init -- client connects to the server and exchanges initial
information, e.g., preferred message size
explain -- client inquires of the server what databases are
available for searching, the fields that are available, the syntax
and formats supported, and other options
search -- client presents a query to a database choices of syntax
for specifying searches
• only Boolean queries widely implemented
• one or more records may be returned to the client
Services
manipulation of results sets -- e.g., sort or delete
present -- requests the server to send specified records from
the results set to the client in a specified format
• options:
for controlling content and formats
for managing large records or large results sets
Example
In the database named "Books" find
all records for which the access
point title that contains the
value "evangeline" and the access
point author contains the value
"longfellow.“
Z39.50 defines a rich variety of search access
points that can be extended by implementers
Problems

Very difficult to implement
–

Outdated assumptions
–
–




There are freely available implementations, but they are
complex
Searching is expensive computationally
Bandwidth is limited (ASN.1 compression)
Originally designed for bibliographic record retrieval,
and not full documents or other objects
“Overspecified”
(Almost) Nobody Implements Explain!
Assumes questionable user model (stateful)
Simple Digital Library
Interoperability Protocol

http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/
SDLIP




Compromise between a full-scale, all encompassing
search middleware design such as Z39.50 and the
“anything goes” approach typical for ad-hoc search
interface design on web
Support for stateful and stateless operation by the
server
Support for thin clients, such as handheld devices
Developed jointly by Stanford, Berkeley, and UC
Santa Barbara
SDLIP – Search Middleware
Interfaces
Interfaces



Search Interface – defines simple query
language, protocol can then include other
languages
Result Interface – parking meter metaphor
supports varying notions of results sets
Source Metadata Interface – provides
extension mechanism through discovery
server capabilities
Result access interface


This interface allows client applications to
access the set of result documents, wherever
that set is maintained
Four services:
–
–
–
–
getSessionInfo
getDocs
extendStateTimeout
cancelRequest
Source metadata interface

Provides information about the service and
server itself, such as
–
–
–

Collections served
Collection metadata/content information
Searchable properties
Three operations
–
–
–
getInterface
getSubcollectionInfo
getPropertyInfo
OAI
Metadata
Harvesting
Z39.50
SGML
HTTP Dublin
Google Core
Functionality
Metadata
metadata
Author
Title
Abstract
Identifer
History



Increasing interest in alternative scholarly
publishing solutions – e.g., LANL arXiv
Increasing impact through federation
UPS Mtg., Sante Fe, October 1999
–
–
–
Representatives of various ePrint, library,
publishing, communities
Goal: definition of an interoperability framework
among ePrint providers
Result: Santa Fe Convention, interoperability
through metadata harvesting
Umbrella model
Reference
Libraries
Museums
Publishers
E-Print
Archives
…that can be exploited by different communities
Key Technical features





Deploy now technology – 80/20 rule
Two-party model – providers (data providers) and
consumers (service providers)
Simple HTTP encoding
XML schema for some degree of protocol
conformance
Extensibility
–
–
Multiple item-level metadata
Collection level metadata
Roles
Service Providers
Discovery
Current
Awareness
Data Providers
Preservation
Key Features

definitions & concepts
–
–
–
–
–
repository
record
identifier
datestamp
set

protocol features
–
–
–

HTTP encoding
metadata prefix &
schema
flow control
protocol requests
–
–
supporting requests
harvesting requests
Record
<record>
<header>
<identifier>oai:eg:001</identifier>
<datestamp>1999-01-01</datestamp>
</header>
<metadata>
<dc xmlns=“http://purl.org/dc”>
<title>My Example</title>
</dc>
</metadata>
<about>
<ea xmlns=“http://www.arXiv.org/ea”
<usage>No restrictions</usage>
</ea>
</about>
</record>
protocol support
format-specific
metadata
community-specific
record data
Identifiers
locally unique key for extracting a record
from a repository
oai-identifier = oai:archive-identifier:record-identifier
Registered
URI
Scheme
Unique ID within
archive:
(syntax is archiveexample = oai:ncstrl:ncstrl.cornellcs/TR94-1418
Archive
specific)
Idendifier:
Registered within
service provider
h
a
r
v
e
s
t
e
r
data provider
Identify
•Repository name
•Base-URL
• Admin e-mail
• OAI protocol version
• Description Container
r
e
p
o
s
i
t
o
r
y
service provider
h
a
r
v
e
s
t
e
r
data provider
ListMetadataFormats
REPEAT
• Format prefix
• Format XML schema
/REPEAT
r
e
p
o
s
i
t
o
r
y
service provider
h
a
r
v
e
s
t
e
r
data provider
ListSets
REPEAT
• Set Specification
• Set Name
/REPEAT
r
e
p
o
s
i
t
o
r
y
service provider
h
a
r
v
e
s
t
e
r
data provider
* from=a
* until=b
* set=klm
r
ListRecords * metadataPrefix=oai_dc
e
p
o
s
REPEAT
i
• Identifier
t
• Datestamp
o
• Metadata
r
•About Container
y
/REPEAT
service provider
h
a
r
v
e
s
t
e
r
ListIdentifiers
* from=a
* until=b
* set=klm
data provider
REPEAT
• Identifier
• Datestamp
/REPEAT
r
e
p
o
s
i
t
o
r
y
service provider
h
a
r
v
e
s
t
e
r
GetRecord
data provider
* identifier=oai:mlib:123a
* metadataPrefix=oai_dc
• Identifier
• Datestamp
• Metadata
• About
r
e
p
o
s
i
t
o
r
y





http://www.google.com.tw/webmasters/sitemap
s/docs/en/other.html#oai
http://www.nla.gov.au/digicoll/oai/getRecord.ht
ml
oai_dc
http://www.nla.gov.au/digicoll/oai/
http://www.nla.gov.au/digicoll/oai/listMetadataF
ormats.html (oai:nla.gov.au:nla.pican22111591)