Presentation Charts

Download Report

Transcript Presentation Charts

LIS 450EP Case Study:
The Illinois Digital Library Initiative Project
Timothy W. Cole
William H. Mischo
[email protected], [email protected]
Grainger Engineering Library Information Center
University of Illinois at Urbana-Champaign
http://dli.grainger.uiuc.edu/Publications/WHMischo/LIS450EP/
Outline
• Digital Libraries, Publishers, XML, &
the Scholarly Information Environment.
• The Illinois DLI / D-Lib Testbed Project.
• XML Technologies in Journal Publishing
• Current work: linking, metadata, metasearch, & the
Open Archives Initiative Protocol for Metadata
Harvesting.
References
Cole, Timothy W., William H. Mischo, Thomas G. Habing, and Robert H.
Ferrer. "Using XML and XSLT to Process and Render Online Journals,"
Library Hi Tech 19, no. 3 (2001): 210 - 222. Available
http://dx.doi.org/10.1108/07378830110405067
Shreeves, Sarah L., Joanne S. Kaczmarek, and Timothy W. Cole.
"Harvesting Cultural Heritage Metadata Using the OAI Protocol."
Library Hi Tech 21, no. 2 (2003): 159-169. Available:
http://dx.doi.org/10.1108/07378830310479802
Lagoze, Carl and Herbert Van de Sompel. "The making of the Open
Archives Initiative Protocol for Metadata Harvesting," Library Hi Tech
21, no. 2 (2003): 118 - 128. Avaliable
http://dx.doi.org/10.1108/07378830310479776
XML Schemas for Qualified Dublin Core, see bottom of Web page at URL:
http://www.dublincore.org/schemas/xmls/
Overview
• We now have the tools to pursue the grand
challenges of Information retrieval:
– Standard retrieval environment (Web) and
interface/client (Web Browser).
– Standardized search/retrieval mechanisms (HTTP
Post/Get, SQL, Z39.50, OAI).
– Standard language for describing and transforming
content and metadata (XML, XSLT, XML Schemas).
– Standard interoperability mechanisms to connect
heterogeneous content (HTTP, SOAP, OAI).
XML and Publishers
• Tim Gill of Quark, “…the use of XML could
lead to a drop in the cost of Web publishing by
30% to 50% and a significant reduction in the
time it takes to produce sites.”
• Gill: “I don’t believe that there is any
innovation in print that is going to save us even
10% in costs.”
• AIP all-XML Journal
• Issues and Challenges remain.
• Use of XML behind the scenes commonplace
XML and Publishers
• Vendor-Neutral, platform-independent structured
information standard.
• Document representation & interchange standard.
• Applications can externalize their data/metadata
as XML.
• Based on Document Object Model (DOM), std.
OOP-style components (XSLT, CSS, …)
• Issues with full-text representation: PDF,
XML/HTML. Value in indexing, retrieval.
The Digital Library
• ‘Digital’, ‘Virtual’, ‘Electronic’ Library as
network-based library without regard to place
and time.
• Digital Collections vs. Digital Library.
– Tendency to call collections & resources DLs.
– IMLS Framework of Guidance for Building Good
Digital Collections
• Emphasis on the integration of collections and
creation of DL services (e.g., NSDL).
• Application of standards and protocols enables
and facilitates development of services.
Scholarly Communication Overview
• Web-based E-Resources still publisher-centric.
– Not user-centric or topic-centric
• Growth of Heterogeneous Distributed Repositories.
–
–
–
–
Value-added services and ‘branding’ of journals.
Prestige of Journals and Publishers
Reciprocal linking relationships between publishers.
Cooperation on linking standards (DOI, CrossRef).
• Alternative publishing models - Academia (e.g.,
SPARC), Preprint Servers, disintermediation.
Full-Text Technologies
• Continuum of Web-Enabled technologies
presently being utilized.
• Evolving technologies and standards.
• Role and history of markup.
• Increasing role and importance of XML.
• Towards a “Smart Document”
Distributed Repositories
• Current Resources:
– publisher repositories; A & Is (remote and local);
course management systems; OIA and preprint
servers; Web search engines; vendor portals;
institutional repositories
• Goal for distributed repositories:
Integration of discrete publisher repositories,
locally loaded full-text, local and remote A & I
services, OPAC, Web resources, and local data.
Distributed Repository - Needs
• Support simultaneous searching of A & I Services,
Distributed Repositories, OPACs, Web search
engines, local files. Integrate TOC, full-text.
• Remote Reference 24 X 7.
• Metadata harvesting
• Digital archiving.
• Local Resolver services for locally loaded or
Aggregator Resources.
Illinois Testbed Project
• Funded under DLI-I by NSF, DARPA, and
NASA, 1994--1998. Awards made to 6
universities.
• Large-scale Testbed, Distributed Repository
models, evaluation, Web software.
• Funded under CNRI D-Lib Test Suite
Program, 1998—2001.
• Collaborating Partners Program. AIP, APS,
ASCE, IEE, NRL, ASM, ACM, NTT
Learning Systems, Elsevier.
• All XML Journals -- AIP, APS, ACM.
Illinois Testbed
• American Institute of Physics--APL, JAP, RSI
– 18,000+ articles, 1995--.
• American Physical Society--PRL
– 14,000+ articles, 1995--, weekly updates.
• ASCE Journals (25 titles)
– 10,000+ articles, 1995--.
• IEE Proceedings and Electronics Letters
– 8,500+ articles, 1993--.
• IEEE Computer Society.
• ASM (American Society for Materials) Handbook.
• ACM (Association for Computing Machinery)
Transactions.
• Elsevier Science.
Project Issues
• Evolution of the Document.
• Distributed information environment.
• Use of Metalanguages & Transformations
(SGML, XML).
• Searching over full-text of journals vs. document
surrogates in A & I format.
• Rendering and styling (SGML, XML, MathML).
• Dynamic metadata for normalization, linking.
• Breadth and depth of collections.
• User needs.
Accomplishments
• Process & retrieve from multiple publishers &
heterogeneous DTDs.
• Metadata specification that uses RDF, Qualified
Dublin Core, XML Schemas, XML Namespaces.
• Cross-repository searching (Testbed & D-LIB
Test Suite). Full-Text and Metadata.
• SGML to XML Conversion.
• XSLT, CSS, for transformation & rendering,
including Mathematics.
Accomplishments (2)
• Linking: Forward/Backward within Testbed,
from/to A & I Services.
• Conversion of ISO 12083 math markup to
MathML; rendering of MathML.
• Enhanced Web retrieval mechanisms: Author
Word Wheels, Co-Occurrence Matrices.
• Detailed user transaction logs, gathered at the
search argument level, with identification of
characteristics of each user search sessions
• Simultaneous search within DeLiver of Tesbed
repositories, A & Is, NCSTRL, …
Ongoing Investigations
• Support federated/broadcast searching of A & I
Services, Distributed Repositories, enhanced
navigation, expanded gateway functions.
• Interoperability models, e.g.,
Metadata harvesting vs. Federated (Broadcast)
• Z39.50 protocols, HTTP harvesting, Spider
technology (gathering).
• E-Journal Archiving (AIP).
• Local link server with context-sensitive resources.
• MathML & other ENTS (Essential Non-Text
Stuff)
XML Parser APIs: Tree-Based and
Event-Based
• DOM (Document Object Model for XML & HTML).
– DOM Level 1 and Level 2 W3C recommendation. Widely
implemented, Tree-Based. Hierarchy of nodes. Loads entire
document into memory. Level 2 adds namespace support,
traversal, stylesheets, events, triggers. Level 3 W3C candidate
recommendation. Parsers allow developers to iterate through
documents, change document content.
• SAX (Simple API for XML).
– Open-source, not W3C. Initially Java-based. Event-based,
fires events as it reads document, need not load entire
document into memory. Good for single-pass processing.
Xerces, XML4C, Sun Project X (Crimson), MSXML.
XML Schema and Structure
• DTD
– Original schema representation, defines structural rules for a
class of XML documents. Inherited from SGML.
• XML Schema http://www.w3.org/XML/Schema
– W3C recommendation. Also sets out standardized structure
for class of XML documents. Is coded in XML, can be parsed
and edited with standard software. Two separate parts:
structures and datatypes.
• Namespaces http://www.w3.org/TR/REC-xml-names/
– W3C recommendation (1.1 candidate in work) Allows
developers to qualify element and attribute names with unique
URIs, avoids recognition errors.
XML, XSLT, and CSS
• Use XML full-text articles as ordered hierarchy
of content objects.
• Generate item-level metadata in XML, using
RDF and Dublin Core syntax and semantics.
• XSLT and CSS used to present metadata and
articles in either XML or HTML format
depending on Browser.
• Mathematics rendering using MathML tools
(conversion from ISO 12083 to MathML).
• Real-time transformation between XML and
HTML using XSLT (scalability issues).
XML Linking
• XML Base http://www.w3.org/TR/xmlbase
– W3C recommendation. Permits use of relative URI path prefixes. Can
then shorten references.
• XLink http://www.w3.org/TR/xlink/
– W3C recommendation. Method for specifying navigational links. Allows
enforcement of specific path order through links. xlink:type=“simple”
corresponds to HTML <a> or <img> tags. May be used with XPointer.
• XInclude http://www.w3.org/TR/xinclude
– W3C working draft. Copies entire XML documents or selected portions
into current document. Uses XPath and XPointer to specify document
elements to include. Unlike XML external entities, no DTD is required.
• XML Pointer Language http://www.w3.org/XML/Linking
– Composed of multiple W3C recommendations and working drafts. A
language to be used for fragment identifier in XML. Uses XPath. Permits
string searches and range specifiers.
Searching and Transformation
• XPath http://www.w3.org/TR/xpath
– W3C recommendation. Defines pattern-matching syntax used
by XSLT and XPointer. Method for selecting data (e.g. nodes,
attributes, …) in a document.
• XSL-FO http://www.w3.org/TR/xsl/
– W3C recommendation. FO similar to CSS but more powerful
for XML document formatting.
• XSLT http://www.w3.org/TR/xslt
– W3C recommendation. (2.0 working draft) Mechanism for
transforming XML documents. Can be used for
normalization of XML documents from different schemas.
• XML Query http://www.w3.org/XML/Query
– Composed of multiple W3C working drafts. Designed to
bring database-style queries to XML documents.
Converting XML to HTML (XSLT)
• Simple one-to-one conversions:
<sect> becomes <span class="sect">
– span.sect {display:block;margin-left:2em}
• Attribute based conversions:
<emph type="1"> becomes <span class="emph_1">
– span.emph_1 {font-style:italic}
• Generated text, such as punctuation:
<ag><au>Tom</au><au>Tim</au><au>Bob</au></ag
> becomes Tom, Tim, Bob.
• Rearranged children:
<au><sn>Habing</sn><fn>Tom</fn></au> becomes
Tom Habing
XSLT Where Should It Happen
• Client-side
– IE5+, Netscape 7+/Mozilla
• Not Netscape 6 and earlier
• IE5 not fully compliant w/ XSLT and XPath standard
– Can reduce the load on your servers
– But performance on low-end clients can be BAD
• Server-side
– Performance could be a problem on busy servers, serving
large, complex documents
– More control & flexibility over the conversion
(metamerge)
• Offline Preconversion
– Best performance
– Not best for dynamic documents (metamerge)
Remote Object Access
• Web Services:
– Based on XML, SOAP (Simple Object Access
Protocol – W3C), UDDI (Universal Description,
Discovery, and Integration), and WSDL (Web
Services Description Language). Applications are
assembled on the fly in XML, exposed to the world,
and accessed via the Web from different devices.
– Supported by Microsoft .net, IBM WebSphere, SUN
One.
• OCLC looking at implementing Web Services
(e.g., for Name Authority lookup)
Schemas vs. DTDs
• Both are systems of representing a data model
that defines the data’s elements and attributes,
and the relationship among elements.
• Schemas add namespaces, address limitations of
DTDs & facilitate data-typing.
• W3C XML Schema Working Group: two
documents: XML structures and datatypes.
– Alternatives to XML Schema:
RELAX-NG
Schematron
Examples from DLI / D-Lib
• ACM Search
• XML & XSLT for layered views of content
(publisher.toc, journal.toc, XSLT, HTML)
• Transforms of SGML to MathML
(png image, SGML math, MathML)
• On the fly XML to HTML
• Transforms of Qualified DC to Simple DC
Qualified, Simple, XSLT, Alt. XSLT
Linking & Metadata Aggregation
• Digital Object Identifier (DOI) and
CrossRef.
• OpenURL and Value-Added Service
Components (SFX, Encompass).
• Local Resolver Servers.
• OAI-PMH, Dublin Core (DC) & Qualified
DC.
Metadata in DLI
•
•
•
•
To normalize & augment presentation.
To normalize searching (e.g. Names).
To store dynamic links.
Types of links:
–
–
–
–
Articles referenced By item (Backward).
Articles that reference the item (Forward).
A & I Records for references and items.
Other relationships (TOC, Other items by
Author, Collaborative Data).
– Known item and presumptive linking.
Digital Object Identifier (DOI)
• DOI is both a unique identifier of a piece of
digital content AND a system to access that
content digitally. Persistent object identifier.
– ‘The ISBN for the 21st Century’ -- Norman Paskin.
– DOI system has two main parts: (the identifier and a
directory system) and a third logical component, a
database.
– Developed by AAP (Association of American
Publishers), now managed by International DOI
Foundation.
• 5 million+ DOI records in CrossRef
DOI Construction
• First real open standard for content identification.
• DOI is a number that identifies a digital object:
– 10.1063/S000369519903216
• 10
Registration Agency Prefix
• 1063
Publisher Prefix
• S000369519903216 Suffix (Publisher-assigned ID)
• Suffix can be SICI or PII.
• The DOI and URL pointing to the digital object,
is registered with the International DOI
Foundation, e.g:
– 10.1063/333 | http://www.pubsite.org/apr99/artl1.pdf
Reference Linking
• Alternatives to DOI:
– Proprietary Link Managers (AIP, APS
– Even then, most still use DOIs as well
• CrossRef Project: major Sci-Tech professional
societies and commercial publishers.
– 252 members
– 9.3 million registered items (journal articles &
conference papers).
• Appropriate Copy Problem (OhioLink, Los
Alamos, NRL).
Local Resolver
• Issue: Directing users to locally held or
licensed version of Digital Object (locally
loaded or from Aggregator).
• Appropriate Copy problem.
• Additional desire to direct users to local
value-added services: local print holdings,
interlibrary borrowing, other articles in A &
I Services.
• Special Services
– http://g118.grainger.uiuc.edu/linker/
Cookie on
OpenURL
Client
client
(Web Browser) dx.doi.org/10.1063/1234
DOI Proxy
Nosfx=y
AIP
Handle
Server
IEE
Aware
Elsevier
Local
AIP, IEE
OpenURL
CrossRef
Metadata
Database
DOI
Illinois Local
Link Server
Metadata
UIUC Metadata
Registry
Local
Value
Added
Open Archives Initiative (OAI)
• Version 1 released Jan ‘01, V.2 released June ‘02
• Mechanism for data providers to expose their
metadata through an HTTP protocol and a
mechanism for harvesting records containing
metadata from repositories.
• Roots in e-print archives.
• Lightweight, low-barrier. Easy to implement on
standard Web servers to handle OAI protocol
requests; need to incorporate into workflow used
to create / maintain metadata.
OAI Continued
• Requires repositories to support the Dublin Core
schema as lowest common denominator.
• Allows communities to expose metadata in other
formats as long as records are structured as XML
data with corresponding XML schema.
– Application for discipline specific portals, institutional
repositories, NSDL, IMLS
• Over 250 OAI 2.0 metadata providers.
– http://oai.grainger.uiuc.edu/registry
• OAI extensions in development:
– OAI Static Repository Gateway
– OAI Rights
How OAI Works
OAI “VERBS”
Service Provider
Metadata Provider
Identify
ListMetadataFormats
ListSets
ListIdentifiers
ListRecords
GetRecord
H
HTTP Request
A
(OAI Verb)
R
V
E OAI
S
T
HTTP Response
E
(Valid XML)
R
R
E
P
O
OAI S
I
T
O
R
Y
Metadata Schemas Used By
OAI Metadata Providers
Illinois-Mellon OAI Project
• Funded to create a web portal to scholarly information
resources in cultural heritage harvested via OAI-PMH
• Primary objectives:
– Build harvesting and search service
– Investigate viability and utility of searching OAI
harvested resources
• Explore issues of advanced search/indexing/display
• Document user needs & usage patterns
– Identify critical issues and best practices for using
OAI-PMH with cultural heritage material
Technical achievements (Mellon)
• Developed harvesting tools (OpenSource)
• Refined data provider tools (OpenSource)
• Investigated logistics and scalability of
harvesting activities
• Created XSL stylesheets for metadata
transformations
• Experimented w/configurations for
scalability and performance issues
Metadata aggregation (Mellon)
• 39 providers (OAIcompliant and
surrogates)
Types of Institutions in
UIUC Repository
5%
1 8%
41 %
– Metadata describing
resources of 580
institutions
36%
• 1.1 million original
records
Academic Libraries
Digital Libraries
Museums/Cultural/Historical Orgs
Public Libraries
– 2.6 million including
item-level records derived
from EAD finding aids
Type of resources (Mellon)
• Hidden web
• Other includes:
–
–
–
–
archival collections
websites
moving images
audio
• 30% of metadata
describes digitized
objects (of any type)
Other
5%
Artifacts
20%
Images
25%
Text &
Sheet
Music
50%
DC element usage (Mellon)
– Records containing subject & description element
SUBJECT
DESCRIPTION
Digital libraries
(10 total, 122,719 records)
78%
36%
Museums, hist. societies, etc.
(6 total, 255,800 records)
93%
93%
Academic libraries
(7 total, 235,294 records)
15%
13%
– Many different controlled and local vocabularies in use
– Granularity: a record may describe a collection
of coins — or one coin
Related ongoing & future work
• Test usability with targeted user community
• Linking resources
– Including linking using MathML
• Simultaneous search, automated metadata
generation, & automated metadata normalization
• NSF National Science Digital Library Projects
– Mathematics resources & MathML
– Combining sci-tech journals with other Web resources
• Additional OAI Implementations
– IMLS NLG
– CIC
– DLF - DODL
Open Issues
• Role of Authors, Academic Institutions, Libraries,
Publishers, Abstracting & Indexing Services.
• Disintermediation may affect both Libraries and
Publishers.
• Information as Function not Place.
• Provide ‘Digital Library’ services built atop
digital collections.
• Role of XML technology.
• Service mechanisms: processing & archiving,
search and discovery, presentation, linking.