XML: a very brief introduction

Download Report

Transcript XML: a very brief introduction

XML: a very brief introduction
(and other related acronyms)
John Miller, KU, February 28, 2002
General outline:


What is XML?
How is it related to other stuff?





the context
other standards and tools
Why is it potentially important to
libraries?
Examples
Q&A
It’s an acronym soup ...
some of the ingredients

XML

W3C

RDF

XSL

SGML

TEI

XSLT

HTML

EAD

DTD

XHTML

MARC

XML Schema

DOM

VRA

CSS

PICS

DC
XML: What is it?
eXtensible
Markup
Language
Language?




conveys meaning -- provides a meaning
for others to understand your intent
has rules
has a syntax
but ... NOT a programming language
Markup?


conveys meaning by “marking up” other
text and data with tags
for example
<name>John Miller</name>
<city>Lawrence</city>
<shoe_size>13</shoe_size>
generically:
<element>value</element>
eXtensible?




“capable of being extended”
extend = “to increase the scope,
meaning, or application of; broaden”
not tied to a single model or data
definition
handles both text and data
What else is it?


it is a family of technologies (more later)
it is a “simplified” version or subset of SGML
(more later)


it is a means of separating the description of
document structure from document
appearance
combined with style sheets, it can be use to
create formatted documents in any style you
want
What else is it?




-- cont.
it is modular -- one can define a new
document by combining and reusing other
existing formats
it is an open standard, not tied to any one
company or software
it can be read both by humans and by
programs (unlike a MARC record, for example)
perhaps most importantly, “it is the basis for
RDF and the Semantic Web” (more later)
What else is it?

-- cont.
To repeat ... it is a single system that
can be used as the basis both for
 storing, searching, formatting, &
displaying TEXT
 storing, searching, formatting, &
displaying DATA
but, there is some disagreement ...

“XML is not a a markup language -- it is a toolkit for
creating, shaping, and using markup languages (Erik
T. Ray, Learning XML, O’Reilly, 2001)

“XML is a markup language, and only a markup
language. It’s important to remember this fact. The
XML hype has become so extreme that some people
expect XML to do everything up, and including,
washing the family dog.” (Elliotte Rusty Harold & W. Scott
Means, XML in a Nutshell, O’Reilly, 2001.)
How is it related to other stuff?
The Context
Or, ... what the h*@# are RDF and the
Semantic Web?
[hint: it all depends on your ontology]
Definitions: W3C





World Wide Web Consortium
founded 1994 by Tim Berners-Lee (with MIT &
CERN); now has 506 institutional members; TBL still
leads
“... develops interoperable technologies
(specifications, guidelines, software, and tools) to
lead the Web to its full potential as a forum for
information, commerce, communication, and
collective understanding.”
XML, XSL, CSS, HTML, and many others are W3C
standards
www.w3c.org
Definitions: ontology / ontologies



Webster's: “a branch of metaphysics concerned with
the nature and relations of being”
W3C: “Formal descriptions of terms in a certain area
(shopping or manufacturing, for example) are called
ontologies and are a necessary part of the semantic
web.”
TBL et al.: [Ontologies are ] “collections of
statements written in a language such as RDF that
define the relations between concepts and specify
logical rules for reasoning about them. Computers
will "understand" the meaning of semantic data on a
Web page by following links to specified ontologies.”
Ontologies

-- cont.
... or, more precisely:
“Artificial-intelligence and Web researchers
have co-opted the term for their own jargon,
and for them an ontology is a document or
file that formally defines the relations among
terms. The most typical kind of ontology for
the Web has a taxonomy and a set of
inference rules.” (TBL et al.)
Namespaces

What are they?



means of linking a tag to a metadata standard
and/or DTD
spaces within which an ontology is defined
Why are they needed?

XML is modular: can combine portions of different
XML documents that conform to different DTDs
into a single document, i.e., which use different
ontologies
 Example: Both HTML and Dublin Core have an
element called <title>
 format = <namespace : element>

for example: <dc:title>
Definitions: RDF



Resource Description Framework
W3C: “Resource Description Framework (RDF) is a
foundation for processing metadata; it provides
interoperability between applications that exchange
machine-understandable information on the Web.
RDF emphasizes facilities to enable automated
processing of Web resources.”
and ... “The broad goal of RDF is to define a
mechanism for describing resources that makes no
assumptions about a particular application domain,
nor defines (a priori) the semantics of any application
domain.“
the RDF data model

3 types of objects:



resource -- web site; web page, individual
tagged element on a page, etc. -- always
named by a URI
property -- “a specific aspect,
characteristic, attribute, or relation used to
describe a resource” (W3C) -- also
identifiable by a URI
statement -- combination of a resource, a
property, and a value for the property
RDF: statement example

“John Smith is the creator of
www.xyz.edu”
subject (resource)
www.xyz.edu
predicate (property)
creator
object (value/literal) John Smith
RDF -- cont.

So ... how can this framework be
implemented? ... how can automated
communication occur across the web?
XML !
Another definition of RDF:
Scientific American: “A scheme for defining
information on the Web. RDF provides the technology
for expressing the meaning of terms and concepts in
a form that computers can readily process. RDF can
use XML for its syntax and URIs to specify entities,
concepts, properties and relations.”
an RDF “message” (“xmlns” = name space
declaration)
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdfsyntax-ns#"
xmlns:s="http://description.org/schema/">
<rdf:Description
about="http://www.lib.xyz.edu/">
<s:Creator>John Smith</s:Creator>
</rdf:Description>
</rdf:RDF>
Definitions: URI vs. URL


URI: “Uniform Resource Identifier. The
generic set of all names/addresses that are
short strings that refer to resources.” (W3C)
URL: “Uniform Resource Locator. An informal
term (no longer used in technical
specifications) associated with popular URI
schemes: http, ftp, mailto, etc.” (W3C)
Definitions: Semantic Web


SEMANTICS: “the study of meanings” -- “the
meaning or relationship of meaning of a sign or set
of signs” (Webster’s)
“The Semantic Web” by Tim Berners-Lee, James
Hendler, & Ora Lassila. (see bibl. at end)
Some quotes from the article:


“The Semantic Web is not a separate Web but an extension
of the current one, in which information is given well-defined
meaning, better enabling computers and people to work in
cooperation. “
“The Semantic Web will bring structure to the meaningful
content of Web pages, creating an environment where
software agents roaming from page to page can readily
carry out sophisticated tasks for users.”
Semantic web
-- cont.
More quotes:



“For the semantic web to function, computers must have
access to structured collections of information and sets of
inference rules that they can use to conduct automated
reasoning.”
“Human language thrives when using the same term to
mean somewhat different things, but automation does not.”
“Two important technologies for developing the Semantic
Web are already in place: eXtensible Markup Language
(XML) and the Resource Description Framework (RDF).”
How is it related to other stuff?
standards and tools
XML: “well-formed”


vs.
“valid”
well-formed: follows the rules for XML
internal structure and consistency
valid: follows a standard definition of
the structure and content of a
document, either a


Data Type Definition (DTD), or an
XML Schema
3 basic kinds of XML docs
1)
2)
well-formed, but unvalidated
well-formed and valid, based upon a
Data Type Definition (DTD) -- either
internal or external
3)
well-formed and “valid”, based upon
specifications in an XML Schema
Requirements for being “well-formed”
1)
2)
3)
4)
5)
6)
7)
8)
a declaration at the top of a document signaling
what it is: <?XML version=“1.0”?>
if conforming to a DTD, a declaration of that DTD:
<!DOCTYPE TEI SYSTEM “teixlite.dtd”>
a root element: <document> or <letter> or
<shoe> or ...
every start tag must have an end tag or, if empty,
have a special format: <a></a> or <a/>
tags must nest cleanly: <a><b></b></a>
attribute values must be in quotation marks
tags are case-sensitive and must match
some characters must be rendered in a special way
Valid? Data Type Definition (DTD)
What is a DTD?
It is a set of rules that define:
what elements may appears in a document
2)
what elements must appear in a document
3)
what elements may be repeated
4)
the hierarchical relationship among elements
5)
what attributes are allowed for each element
6)
other structural requirements
Generally, a separate document, but definitions also can
be inside an XML document
1)
Valid? XML Schema



“an XML language for describing and
containing the content of XML
documents” (W3C)
a schema document is itself an XML
document
an alternative to a DTD -- both can
exist (as alternatives) for a particular
format, but only one is needed
XSL & XSLT

eXtensible Stylesheet Language


“XSL is a language for expressing style sheets. An XSL style
sheet is, like with CSS, a file that describes how to display
an XML document of a given type” (W3C)
 includes XSL FO: XSL Formatting Objects
eXtensible Stylesheet Language for Transformations

“Originally intended to perform complex styling operations,
like the generation of tables of contents and indexes, it is
now used as a general purpose XML processing language.
XSLT is thus widely used for purposes other than XSL, like
generating HTML web pages from XML data.” (W3C)
Stylesheets - why do we need them?




XML is not a fixed tag set -- a generic
processor/browser has no idea what the tags
“mean”
XML markup generally does not include any
formatting instruction
want to store XML data in one format and
present it in a different form
want to present same XML data in many
different ways
CSS




Cascading Style Sheets
a simple styling language defining and
attaching styles to HTML (or XML)
elements. Each element type and each of its
occurrences within a document can be
given a unique style
defines margins, positioning, fonts, color,
size, box and list properties, etc.
not NEEDED to render XML-based HTML,
but useful
How do XSL and CSS compare?




XSL uses XML notation; CSS uses its own
CSS formatting following the document
“object tree”; XSL formatting can radically
move objects around
both can be used to directly format XML
documents
XSL transforms while it formats
CSS & XSL (cont.)
CSS
XSL
Can be used with HTML?
yes
no
Can be used with XML?
yes
yes
Transformation language?
no
yes
Syntax
CSS
XML
-- from W3C’s “What are style sheets” @ http://www.w3.org/Style)
Xlink & Xpointer

XML replacements for the HTML tags:





Xlink replaces <a href ...>
Xpointer replace <a name ..
in XML, ANY element can have a linking
capability
in XML, can link to any point in a document
with a tag
in XML, a link imports rather than transfers
DTD
or
Schema
XML
doc
XSL
doc
XSLT transformer
software
HTML
doc
CSS
SGML




Standard Generalized Markup Language
an ISO standard for defining the structural
descriptions of electronic documents
“SGML is very large, powerful, and complex. It has
been in heavy industrial and commercial use for over
a decade, and there is a significant body of expertise
and software to go with it. XML is a lightweight cutdown version of SGML which keeps enough of its
functionality to make it useful but removes all the
optional features which make SGML too complex to
program for in a Web environment.” (from Peter Flynn’s
“XML FAQ” @ http://www.ucc.ie/xml/#sgml)
valid, well-formed XML is valid SGML, but not
necessarily vice versa
XHTML




eXtensible Hypertext Markup Language
the successor to HTML, almost the same as HTML
4.01 (their DTDs are identical except for some
differences between SGML and XML)
HTML defined according to XML rules -- the HTML
spec is a specific XML DTD/schema
“The emergence of XHTML is just another chapter in
the often tumultuous history of HTML and the World
Wide Web, where confusion for authors is the norm,
not the exception.” (Chuck Musciano & Bill Kennedy, HTML
& XHTML: the Definitive Guide, O’Reilly, 2000)
Why is XML potentially
important to libraries?
Library-related uses




Cross-database searching -- integration of
multiple data definitions
Digital library metadata (EAD, VRA, etc.)
Textual markup for presentation
Public Interface design (e.g., OPAC)

-- send XML (formatted by XSL), not HTML
Library-related uses

-- cont.
exchanging data & metadata

between libraries






an alternative to MARC ?
an alternative to Z39.50 ?
NCIP (National Circulation Interchange
Protocol)
between libraries and vendors
between library system & other campus
systems (e.g., Voyager and PeopleSoft)
Open Archives Initiative (OAI) metadata
harvesting
Example: Endeavor ENCompass

Federated Searching

Multi-protocol searching




Z39.50
HTTP
XML Gateways
Search and Navigation

Web based, using XSL technology for
ultimate customization of displays
Collection
of XML
docs
query
results
query
XML
search
software
Perl / CGI
control
program
results
validation
form
DTD
or
Schema
Example of
query
XSL
doc
HTML
search
form
HTML
response
html
search system
XSLT transformer
software
Browser
CSS
an XML-based
happy user
results
as
XML doc
Data vs. Metadata





Like HTML, XML can contain both data and
metadata
metadata can be explicit (i.e., between
<metadata></metadata> tags)
metadata can be individual elements
metadata also can appear as attributes
but ... at least with text, what’s the difference
once everything is tagged according to
content & structure? -- one person’s
metadata is another’s data
Some Common metadata schemes







DC: Dublin Core
VRA Core: Visual Resources Association
EAD: Encoded Archival Description
TEI: Text Encoding Initiative
(also TEI Lite)
MARC: MAchine-Readable Cataloging
CSDGM: Content Standard for Digital
Geospatial Metadata
FGDC: Federal Geographic Data Committee
metadata
Example: DTD
<!-- This is a sample DTD for a record/CD collection -->
<!ELEMENT MYMUSIC (album+)>
<!ELEMENT album
(title, ((artist+,genre+) | (genre+, artist+))
year_produced, year_purchased?, label?, song_list?)
>
<ATTLIST album
id
ID
#REQUIRED
ref
IDREF
#IMPLIED
condition (n.p. | bad | worn | good | excellent) “n.p.”
>
<ELEMENT title (#PCDATA)>
<ELEMENT artist (#PCDATA)>
<ELEMENT genre (#PCDATA)>
<ATTLIST genre+
type (folk | rock | country | blues | jazz | classical ) #REQUIRED>
...
Example: XML
<MYMUSIC>
<album id="LedZeppelin-1969-1" condition="worn">
<title>Led Zeppelin II</title>
<artist>Led Zeppelin</artist>
<genre type="rock">Rock</genre>
<year_produced>1969</year_produced>
<year_purchased>1988</year_purchased>
<label>Atlantic Records</label>
<song_list>
<song length="5:34">Whole Lotta Love</song>
<song length="6:19">The Lemon Song</song>
...
<song length="4:24">Ramble On</song>
<song length="4:21">Moby Dick</song>
<song length="4:19">Bring It On Home</song>
</song_list>
</album>
Example: XSL
(part 1)
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="html" indent="yes" />
<xsl:template match="/">
<html>
<head><title>My Music</title></head>
<body>
<xsl:apply-templates select="MYMUSIC/album">
<xsl:sort select="year_produced"/>
</xsl:apply-templates>
</body>
</html>
</xsl:template>
Example: XSL
(part 2)
...
<xsl:template match="album">
<xsl:apply-templates select="title"/>
<i><xsl:value-of select="artist"/></i><br/>
<xsl:value-of select="year_produced"/>
<ol>
<xsl:for-each select="song_list/song">
<li><xsl:value-of select="."/></li>
</xsl:for-each>
</ol>
</xsl:template>
<xsl:template match="title">
<h2><xsl:apply-templates /></h2>
</xsl:template>
</xsl:stylesheet>
Example: the resulting HTML display
Led Zeppelin II
Led Zeppelin
1969
1.
Whole Lotta Love
2.
What Is and What Should Never Be
3.
The Lemon Song
4.
Thank You
5.
Heartbreaker
6.
Living Loving Maid (She's Just a Woman)
7.
Ramble On
8.
Moby Dick
9.
Bring It On Home
Dublin Core examples:

DTD declaration:
<!DOCTYPE rdf:RDF PUBLIC "-//DUBLIN CORE//DCMES
DTD 2001 11 28//EN"
"http://dublincore.org/documents/2001/11/28/dcmesxml/dcmes-xml-dtd.dtd">

RDF/namespace declaration:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
MARC & XML

MARC: MAchine-Readable Cataloging





a structure / communications format, not a set of
cataloging rules
like XML, elements indicate structure/meaning
rather than presentation -- can be extensible
unlike XML, is fixed set of tags (i.e., 3-digit
numbers), designed principally to accommodate
cataloging elements -- allows precise coding and
facilitates precise searching and retrieval
well-tested and implemented
LC and others are working on creating a
MARC XML DTD
245 Title Statement
1st indicator
0
No added entry
1
Added entry
(NR)
2nd indicator
Subfield Codes
$a
Title (NR)
$b
Remainder of title (NR)
0-9 number of nonfiling characters
The way it looks
in the USMARC
manual
LC’s MARC 245 definition in XML DTD
(edited slightly for clarity)
<!ELEMENT
mrcb245 ((mrcb245-6 | mrcb245-8 | mrcb245-a |
mrcb245-b | mrcb245-c | mrcb245-d | mrcb245-e | mrcb245-f |
mrcb245-g | mrcb245-h | mrcb245-k | mrcb245-n | mrcb245-p |
mrcb245-s)*)
>
<!ATTLIST mrcb245
>
name
CDATA #FIXED
"TITLE STATEMENT"
obsolete
CDATA #FIXED
"no"
repeatable
CDAT #FIXED
"no"
i1
i2
(i1-0 | i1-1)
#REQUIRED
(i2-0 | i2-1 | i2-2 | i2-3 | i2-4 |
i2-5 | i2-6 | i2-7 | i2-8 | i2-9) #REQUIRED
MARC 245 subfields $a & $b definitions in XML
<!ELEMENT
mrcb245-a
<!ATTLIST
mrcb245-a
(#PCDATA)>
name
CDATA #FIXED "Title"
obsolete
CDATA #FIXED "no"
repeatable
CDATA #FIXED "no"
>
<!ELEMENT
mrcb245-b
<!ATTLIST
mrcb245-b
>
(#PCDATA)>
name
CDATA #FIXED "Remainder of title"
obsolete
CDATA #FIXED "no"
repeatable
CDATA #FIXED "no"
245 10
$a Moby Dick; $b or, The Whale.
becomes
<mrcb245 i1=“1” i2=“0”>
<mrcb245-a>Moby Dick;</mrcb245-a>
<mrcb245-b>or, The Whale.</mrcb245-b>
</mrcb245>
or, an alternative: an XML schema from OAI
<element name="varfield">
<complexType>
<sequence>
<element ref="oai_marc:subfield"
MaxOccurs="unbounded"/>
</sequence>
<attribute name="id“
type="oai_marc:idType“
use="required"/>
<attribute name="i1“
type="oai_marc:iType“
use="required"/>
<attribute name="i2“
type="oai_marc:iType“
use="required"/>
</complexType>
</element>
generic idTag and iType indicator definitions
<simpleType name="idType">
<restriction base="string">
<pattern value="[0-9]{1,3}"/>
</restriction>
</simpleType>
<simpleType name="iType">
<restriction base="string">
<pattern value="[0-9a-z\s]?"/>
</restriction>
</simpleType>
and … generic schema for a subfield:
<element name="subfield">
<complexType>
<simpleContent>
<extension base="string">
<attribute name="label"
type="oai_marc:subfieldType“
use="required"/>
</extension>
</simpleContent>
</complexType>
</element>
and … generic definition of subfieldType
<simpleType name="subfieldType">
<restriction base="string">
<pattern value="[0-9a-z]"/>
</restriction>
</simpleType>
OAI XML to match the OAI Schema
<varfield id="100" i1="1" i2="0">
<subfield label="a">Melville, Herman,
</subfield>
<subfield label="d">1819-1891
</subfield>
</varfield>
<varfield id="245" i1="1" i2="3">
<subfield label="a">Moby Dick;
</subfield>
<subfield label="b">or, The Whale
</subfield>
</varfield>
Brief bibliography: books






Chuck Musciano & Bill Kennedy. HTML &
XHTML: the Definitive Guide. O’Reilly, 2000.
Elliotte Rusty Harold & W. Scott Means. XML
in a Nutshell. O’Reilly, 2001.
Eric T. Ray. Learning XML. O’Reilly, 2001.
Doug Tidwell. XSLT. O’Reilly, 2001.
Eric A. Meyer. Cascading Style Sheets: the
Definitive Guide. O’Reilly, 2000.
Bob DuCharme. XML: the Annotated
Specification. Prentice-Hall, 1999.
Brief bibliography: web






www.w3c.org [everything you ever wanted to know]
www.xml.com [O’Reilly site]
www.xml.org
xml.coverpages.org
xml.apache.org
www.sciam.com/2001/0501issue/0501berners
-lee.html [“The Semantic Web” -- Scientific American
article by Tim Berners-Lee, James Hendler & Ora Lassila]

www.iath.virginia.edu/ead/xml.html
XML]
[EAD and
The End
Questions?