Transcript Metadata

Metadata
Andy Powell
Technical Development and Research
UKOLN
University of Bath
http://www.ukoln.ac.uk/
[email protected]
1
Metadata
• What is metadata?
• an introduction
• The Dublin Core
• metadata for the Web
• Metadata management
• Models for dealing with Web-site
metadata
• UKOLN metadata projects
• overviews (and problems)
2
What is metadata?
• by definition:
..data about data..
..data which provides information
about a resource..
• by example:
• title, author, subject classification, shelf
mark
• digital format, terms and conditions,
location (URL)
3
What is metadata? (2)
• by usage:
• Resource discovery
– Searching, location
– Authentication
– Quality/rating
• Semantic interoperability
• Resource management
• User interface
– Grouping resources for printing
– 3-D visualisations
4
Range of formats
Simple
Rich
Dublin Core
MARC
NetFirst
IAFA
TEI
headers
Lycos
SOIF
CIMI
Alta Vista
robot generated
hand crafted
5
Where is metadata?
• Embedded within resource
• HTML <META> tags
• Linked to resource
• Remote database
• distributed
• union (centralised)
6
Who creates metadata?
• Publisher side
• author
• webmaster
• institution
• Service side
• search service
• third party creators
robot generated
hand crafted
7
Dublin Core
• 15 element core metadata set
• Primarily intended to aid resource
discovery on the Web
• Main usage currently embedded into
HTML META tags
• All elements optional and repeatable
• Status?
• Agreed syntax for embedding in HTML
• Still discussion about the use of some of the
elements
http://www.ukoln.ac.uk/metadata/resources/dc.html
8
Dublin Core History
• 4 DC meetings
• Dublin, Warwick, Dublin, Canberra
• (DC-5 - Helsinki coming soon)
• Mailing list discussions
• [email protected]
• W3C interest
• RDF (PICS-NG), MCF
• Various projects
• Still no significant interest yet from
the big search engines :-(
9
DC Elements - 1
• Title
• Subject
• intended to promote use of controlled vocabularies but
in practice likely to be used for uncontrolled list of
keywords
• Description
• abstract
• Creator
• Publisher
10
DC Elements - 2
• Contributor
• Date
• the date ‘the resource was made available in its
present form’. Agreed default format uses subset of
ISO 8601, e.g. 1997-09-15
• Type
• category of resource - document, image, sound, home
page, novel, poem, etc. Still much discussion about
the content of this element
• Format
• MIME type
• Identifier
11
DC Elements - 3
• Source
• Language
• language of the resource - NOT the metadata
• Relation
• no guidelines for usage currently
• Coverage
• separate working party looking at usage
• Rights
• rights management seen as too complex for DC. This
will give a URL to some external information
12
Simple Example
13
<HTML><HEAD>
<TITLE>UKOLN Home Page</TITLE>
<META NAME="DC.title” CONTENT="UKOLN: UK Office for
Library and Information Networking">
<META NAME="DC.subject" CONTENT="national centre,
network information support, library community,
awareness, research, information services, public library
networking, bibliographic management, distributed library
systems, metadata, resource discovery, conferences,
lectures, workshops">
<META NAME="DC.description" CONTENT="UKOLN is a
national centre for support in network information
management in the library and information communities. It
provides awareness, research and information services">
<META NAME="DC.creator" CONTENT=”Stark, Isobel">
</HEAD>
...
Element qualifiers
• Need to refine meaning in some
cases
• TYPE
Refines meaning of element - sub-divides
element namespace
• SCHEME
Element value taken from external schema,
e.g. LCSH for DC.subject, Z39.53 for
DC.language
• LANGUAGE
14
Language of element value (not of the
resource being described!)
Examples - TYPE
• Original DC.creator tag
<META NAME="DC.creator" CONTENT=”Stark, Isobel">
• Non-personal author
<META NAME="DC.creator.corporate"
CONTENT=”UKOLN Information Services Group">
• Author’s email address
<META NAME="DC.creator.email”
CONTENT=”[email protected]">
15
Examples - SCHEME
• Library of Congress Subject Heading
<META NAME="DC.subject" CONTENT=”(SCHEME=LCSH)
Library information networks -- Great Britain">
<META NAME="DC.subject" CONTENT="(SCHEME=LCSH)
Information technology -- higher education">
…or…
<META NAME="DC.subject" SCHEME=“LCSH”
CONTENT=”Library information networks -- Great Britain">
<META NAME="DC.subject" SCHEME=“LCSH”
CONTENT="Information technology -- higher education">
16
Metadata Management
Practical issues of using Dublin Core
for Internet resource description...
• UKOLN metadata system
• Requirements
• 3 models for metadata management
• Implementation at UKOLN
17
UKOLN metadata system
requirements
• Easy to use
• Work with a variety of methods of
creating HTML
• Simple migration to future metadata
formats
• Separate metadata from resource
18
Managing Dublin Core (1)
HTML Authoring tool
Embed by hand using HTML or text editor
Pros…
• Simple
• May be useful for
training and
familiarisation
19
Cons…
• May not be
possible with all
editors
• Maintenance
problems
• Easy to make
errors
DC-dot
• A Web based tool for creating
Dublin Core <meta> tags
• Automatic generation of some tags
based on content of the resource
• Forms based editing of tags
• Cut-and-paste output into HTML
• Conversion to other formats…
• SOIF, ROADS/WHOIS++, USMARC,
GILS...
http://www.ukoln.ac.uk/metadata/dcdot/
20
Managing Dublin Core (2)
Web-site management tool
Use Web-site management tool,
for example NetObjects Fusion
21
Pros…
• Use of Web-site
management tools
likely to increase
• Object-oriented
database
approach
Cons…
• Proprietry formats
• Early days - too
early to evaluate
use for metadata
yet?
Managing Dublin Core (3)
On the fly generation
Hold Dublin Core separately and embed
on-the-fly using server-side include (SSI)
Pros…
• Separates
metadata from
resource
• Future migration
fairly simple
22
Cons…
• Performance
• Lack of integration
with HTML tools
• Server specific
UKOLN metadata system (1)
•
•
•
•
Embed on-the-fly
Apache SSI script
Store metadata using SOIF records
Use MS-Access as tool to create the
records
• Associate metadata with resource
by co-locating them in the Web
server filestore
23
UKOLN metadata system (2)
Apache syntax for calling server-side script
<!--#exec cmd="getmeta" -->
HTML
editor
intro.html
<html>
<head>
<title>…</title>
<!--#exec cmd="getmeta" -->
</head>
...
intro.html.soif
MS-Access
Database
24
@FILE { http://www.ukoln.ac.
...
keywords{13}: xxx, yyy, zzz
description{14}: blah blah b
author{13}: Stark, Isobel
...
}
UKOLN metadata system (3)
MS-Access front
end...
Filename browser
Text boxes
Name choosers
UKOLN
specific
metadata
25
UKOLN metadata system (4)
intro.html
Web
robot
1
6
<html>
<head>
<title>…</title>
<!--#exec cmd="getmeta" -->
</head>
...
2
UKOLN
Web server
intro.html.soif
3
4
5
26
SSI
script
@FILE { http://www.ukoln.ac.
...
keywords{13}: xxx, yyy, zzz
description{14}: blah blah b
author{13}: Stark, Isobel
...
}
Issues
• Performance
• Interaction with Web caches
• Dublin Core vs Alta Vista style
metadata
<META NAME=”Description” CONTENT=”blah, blah">
<META NAME="Keywords” CONTENT="xxx, yyy, zzz">
• Granularity
• Which pages should have metadata?
27
What's the point...
…of embedding DC <meta> tags?
• Alta Vista isn't going to look for them
• But, worth doing...
• within individual projects
• within specific communities (e.g. eLib)
• Improve local search facilities
• e.g. load SOIF records into a Netscape
Catalogue Server
• Web-site management benefits
28
UKOLN Metadata projects
• ROADS
• Software for Subject Service
• DESIRE
• European Web indexing
• NewsAgent
• Current awareness service for Library
and Information Staff
• BIBLINK
• Information flow from publishers to
National Bibliographic Agencies
29
ROADS
• Resource Organisation and Discovery
in Subject-based Services
• Web based tools for Subject Services
• SOSIG, ADAM, OMNI, …
• Manage and search Internet resource
descriptions
• ROADS templates (based on IAFA
templates)
• WHOIS++
http://www.ukoln.ac.uk/roads/
30
ROADS - WHOIS++ (1)
• Simple client-server search and
retrieve protocol
• Developed originally for ‘white
pages’ applications
• Offer search facilities across several
Subject Services
• Distribute a Subject Service across
several physical servers
• Query routing - centroids and CIP
31
ROADS - WHOIS++ (2)
• Centroid generated by ADAM contains… “you’ll
find the string ‘mona’ in the ‘title’ attribute of at
least one record in the ADAM database”.
SOSIG
2
CGI-based
WHOIS++
client
3
CIP sharing
of centroids
1
4
6
5
Web browser
32
ADAM
OMNI
DESIRE
European Web cataloguing
• Subject Services
• EuroSOSIG (Bristol), EELS (Lund),
Arts (Koninklijke Bibliotheek)
• Manually created ROADS templates
• European Web Index
•
•
•
•
33
based on Nordic Web Index (NWI)
Robot generated, all resources
Multiple servers linked with Z39.50
GILS
http://www.nic.surfnet.nl/surfnet/projects/desire/desire.html
DESIRE - current work (1)
• Internationalisation of ROADS
• Use of robots to:
• aid manual cataloguing of resources
• build indexes based on list of URLs in
a ROADS database
• Robot will use embedded Dublin Core
if available
34
DESIRE - current work (2)
• Re-design of EWI robot - including:
• support for Dublin Core
• EWI records GILS-II compatible
• Allow users to search across subject
services and the EWI using Z39.50
• by converting ROADS records into
GILS records
• by building a WHOIS++ to Z39.50
gateway
http://roads.ukoln.ac.uk/cgi-bin/egwcgi/egwirtcl/targets.egw
35
NewsAgent
Current awareness service for LIS...
• Distributed database
• servers at LITC, FD, UKOLN - Z39.50
• metadata (and some full-text)
• based on DALI
• Mixture of content streams
• Variety of access methods
• Web, e-mail and Z39.50 clients
• user-configurable profiles
http://www.ukoln.ac.uk/metadata/NewsAgent/
36
NewsAgent - Content
• Journals
• Program, VINE, Journal of
Librarianship and Information Science
• News and briefing material
• LA, IIS, UKOLN (Ariadne), BL, LITC
• Web pages
• E-mail lists and USENET news
37
NewsAgent - Harvesting
• Web crawler
• looking for embedded Dublin Core
• Limiting the harvest
– simple heuristics
– use of Dublin Core Relation element
• E-mail parser
http://www.ukoln.ac.uk/metadata/NewsAgent/dcusage.html
38
BIBLINK
Information flow between publishers
• traditional
• new - CD-ROM or Web (new to publishing)
and National Bibliographic Agencies
•
•
•
•
•
•
British Library, UK
Biblioteca Nacional, Madrid, Spain
Bibliothèque Nationale de France, Paris
Koninklijke Bibliotheek, Den Haag, Netherlands
Nasjonalbiblioteket, Rana, Norway
Universitat Oberta de Catalunya, Barcelona, Spain
http://www.ukoln.ac.uk/metadata/BIBLINK/
39
BIBLINK - research
• Scope
• Electronic publications suitable for inclusion in
National Bibliographies
• Metadata
• Dublin Core (with extensions!), SGML DTD
• Identifiers
• ISBN, ISSN, SICI, DOI, URN
• Transmission
• Simple e-mail or Web crawler
• Authentication
• MD5 hash assigned to each resource
40
BIBLINK - data set
• Minimum data set
– Author, Title, Publisher, Place of Publication, Price,
Extent (size), Keywords, Description,
Edition/Version, Date of Publication, System
Requirements, Format, Language, Terms and
Conditions, Frequency, Identifier, Contributor,
Checksum
• Similar to DC but some don’t fit…
<META NAME=“BIBLINK.placePublication”
CONTENT=“Bath, UK”>
<META NAME=“BIBLINK.frequency”
CONTENT=“monthly”>
• Issues over conversion to MARC
41
BIBLINK - demonstrator
Publishers
Dublin Core
• Cataloguing in
Publication(CIP) level
records
E-mail
Dublin Core
• Enhanced records
optionally returned to
publishers
UNIMARC
??MARC
NBAs/National Libraries
42
• Conversion on to local
MARC format using
USEMARCON
Conclusions
• Think about metadata as a ‘process’
• Dublin Core syntax now stable
enough to use
• Use within projects initially
• Choose metadata management
model appropriate to your site
• Consider long term maintenance
and transition to other formats
43