Transcript DOM_SAX

Cspp51038
Parsing XML into programming
languages
Parsing XML
• Goal: read XML files into data structures in
programming languages
• Possible strategies
–
–
–
–
Parse by hand with some reusable libraries
Parse into generic tree structure
Parse as sequence of events
Automagically parse to language-specific objects
Parsing by-hand
• Advantages
– Complete control
– Good if simple needs – build off of regex package
• Disadvantages
– Must write the initial code yourself, even if it becomes
generalized
– Pretty tedious and error prone.
– Gets very hard when using schema or DTD to validate
– No one does this anymore
Parsing into generic tree structure
• Advantages
– Industry-wide, language neutral W3C standard exists called DOM
(Document Object Model)
– Learning DOM for one language makes it easy to learn for any
other
– As of JAXP 1.2, support for Schema
– Have to write much less code to get XML to something you want
to manipulate in your program
• Disadvantages
– Non-intuitive API, doesn’t take full advantage of Java
– Still quite a bit of work
What is JAXP?
• JAXP: Java API for XML Processing
– In the Java language, the definition of these standard
API’s (together with XSLT API) comprise a set of
interfaces known as JAXP
– Java also provides standard implementations together
with vendor pluggability layer
– Some of these come standard with J2SDK, others are
only availdable with Web Services Developers Pack
– We will study these shortly
Another alternative
• JDOM: Native Java published API for
representing XML as tree
• Like DOM but much more Java-specific,
object oriented
• However, not supported by other languages
• Also, no support for schema
• Dom4j another alternative
JAXB
• JAXB: Java API for XML Bindings
• Defines an API for automagically representing
XML schema as collections of Java classes.
• Most convenient for application programming
• Will cover next class
DOM
About DOM
• Stands for Document Object Model
• A World Wide Web Consortium (w3c) standard
• Standard constantly adding new features – Level 3
Core released late 05
• Well cover most of the basics. There’s always
more, and it’s always changing.
DOM abstraction layer in Java -architecture
Emphasis is on allowing vendors to supply their own DOM
Implementation without requiring change to source code
Returns specific parser
implementation
org.w3d.dom.Document
Sample Code
A factory instance
is the parser implementation.
DocumentBuilderFactor factory =
Can be changed with runtime
DocumentBuilderFactory.newInstance(); System property. Jdk has default.
Xerces much better.
/* set some factory options here */
DocumentBuilder builder =
factory.newDocumentBuilder();
Document doc = builder.parse(xmlFile);
javax.xml.parsers.DocumentBuilderFactory
javax.xml.parsers.DocumentBuilder
org.w3c.dom.Document
From the factory one obtains
an instance of the parser
xmlFile can be an java.io.File,
an inputstream, etc.
For reference. Notice that the
Document class comes from the
w3c-specified bindings.
Validation
• Note that by default the parser will not
validate against a schema or DTD
• As of JAXP1.2, java provides a default
parser than can handle most schema
features
• See next slide for details on how to setup
Important: Schema validation
String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";
Next, you need to configure DocumentBuilderFactory to generate a
namespace-aware, validating parser that uses XML Schema:
… DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance()
factory.setNamespaceAware(true);
factory.setValidating(true);
try {
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); }
catch (IllegalArgumentException x) {
// Happens if the parser does not support JAXP 1.2 ...
}
Associating document with schema
•
An xml file can be associated with a
schema in two ways
1. Directly in xml file in regular way
2. Programmatically from java
•
Latter is done as:
–
factory.setAttribute(JAXP_SCHEMA_SOURCE,
new File(schemaSource));
A few notes
• Factory allows ease of switching parser
implementations
– Java provides simple DOM implementation, but
much better to use vendor-supplied when doing
serious work
– Xerces, part of apache project, is installed on
cluster as Eclipse plugin. We’ll use next week.
– Note that some properties are not supported by
all parser implementations.
Document object
• Once a Document object is obtained, rich API to
manipulate.
• First call is usually
Element root = doc.getDocumentElement();
This gets the root element of the Document as an
instance of the Element class
• Note that Element subclasses Node and has methods
getType(), getName(), and getValue(), and
getChildNodes()
Types of Nodes
• Note that there are many types of Nodes (ie
subclasses of Node):
Attr, CDATASection, Comment, Document, DocumentFragment,
DocumentType, Element, Entity, EntityReference, Notation,
ProcessingInstruction, Text
Each of these has a special and non-obvious associated type, value, and name.
Standards are language-neutral and are specified on chart on following slide
Important: keep this chart nearby when using DOM
nodeName()
nodeValue()
Attr
Attr name
Value of attribute
null
2
CDATASection
#cdata-section
CDATA cotnent
null
4
Comment
#comment
Comment content
null
8
Document
#document
Null
null
9
DocumentFragment
#documentfragment
null
null
11
DocumentType
Doc type name
null
null
10
Element
Tag name
null
NamedNodeMap
1
Entity
Entity name
null
null
6
EntityReference
Name entity
referenced
null
null
5
Notation
Notation name
null
null
1
ProcessingInstruction
target
Entire string
null
7
Text
#text
Actual text
null
3
Node
Attributes
nodeType()
DOM Exercise
Write a function to do a depth search printout of the node information of a given XML file as:
recursePrint(root);
Assume you have access to the following:
printNodeInfo(Node node):prints the name, type, and value of the input node.
boolean Node.hasChildNodes(): to check if a node has any children
NodeList Node.getChildNodes(): to get a list of all children nodes
Node NodeList.item(int num): to select the num’th child node
public static void recursePrint(Node node){
}
DOM Exercise Answer
Write a function to do a depth search printout of the node information of a given XML file as:
recursePrint(root);
Assume you have access to the following:
printNodeInfo(Node node):prints the name, type, and value of the input node.
boolean Node.hasChildNodes(): to check if a node has any children
NodeList Node.getChildNodes(): to get a list of all children nodes
Node NodeList.item(int num): to select the num’th child node
public static void recursePrint(Node node){
printNodeInfo(node);
if (!node.hasChildNodes()) return;
NodeList nodes = node.getChildNodes();
for (int i = 0; i < nodes.getLength(); ++i){
node = nodes.item(i);
recursePrint(depth, node);
}
}
Transforming XML
The JAXP Transformation Packages
•
JAXP Transformation APIs:
– javax.xml.transform
• This package defines the factory class you use to get a Transformer object. You then
configure the transformer with input (Source) and output (Result) objects, and invoke its
transform() method to make the transformation happen. The source and result objects are
created using classes from one of the other three packages.
– javax.xml.transform.dom
• Defines the DOMSource and DOMResult classes that let you use a DOM as an input to or
output from a transformation.
– javax.xml.transform.sax
• Defines the SAXSource and SAXResult classes that let you use a SAX event generator as
input to a transformation, or deliver SAX events as output to a SAX event processor.
– javax.xml.transform.stream
• Defines the StreamSource and StreamResult classes that let you use an I/O stream as an
input to or output from a transformation.
Transformer Architecture
Writing DOM to XML
public class WriteDOM{
public static void main(String[] argv) throws Exception{
File f = new File(argv[0]);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(f);
TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(document);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
}
}
Creating a DOM from scratch
• Sometimes you may want to create a DOM
tree directly in memory. This is done with:
DocumentBuilderFactory factory
= DocumentBuilderFactory.newInstance();
DocumentBuilder builder
=
factory.newDocumentBuilder();
document = builder.newDocument();
Manipulating Nodes
• Once the root node is obtained, typical tree
methods exist to manipulate other elements:
boolean node.hasChildNodes()
NodeList node.getChildNodes()
Node node.getNextSibling()
Node node.getParentNode()
String node.getValue();
String node.getName();
String node.getText();
void setNodeValue(String nodeValue);
Node insertBefore(Node new, Node ref);
JDOM
JDOM Motivation
(from Elliot Harold)
•
Unfortunately DOM suffers from a number of design flaws and
limitations that make it less than ideal as a Java API for processing
XML
– DOM had to be backwards compatible with the hackish, poorly thought out,
unplanned object models used in third generation web browsers.
– DOM was designed by a committee trying to reconcile differences between
the object models implemented by Netscape, Microsoft, and other vendors.
They needed a solution that was at least minimally acceptable to
everybody, which resulted in an API thatユs maximally acceptable to no one.
– DOM is a cross-language API defined in IDL, and thus limited to those
features and classes that are available in essentially all programming
languages, including not fully-object oriented scripting languages like
JavaScript and Visual Basic. It is a lowest common denominator API. It
does not take full advantage of Java, nor does it adhere to Java best
practices, naming conventions, and coding standards.
– DOM must work for both HTML (not just XHTML, but traditional malformed
HTML) and XML.
Some sample JDOM
<fibonacci/>
In JDOM:
Element element = new Element("fibonacci");
In DOM:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
DOMImplementation impl = builder.getDOMImplementation();
Document doc = impl.createDocument( null, "Fibonacci_Numbers", null);
In JDOM:
Element element = doc.createElement("fibonacci");
Element element = new Element("fibonacci");
element.setText("8"); :
element.setAttribute("index", "6");
Extremely simple and intuitive!
More JDOM
•
To create this element
<sequence>
<number>3</number>
<number>5</number>
</sequence>
Element element = new Element("sequence");
Element firstNumber = new Element("number");
Element secondNumber = new Element("number");
firstNumber.setText("3");
secondNumber.setText("5");
element.addContent(firstNumber);
element.addContent(secondNumber);
import org.jdom.*;
import org.jdom.input.SAXBuilder;
Parsing XML file with JDOM
import java.io.IOException;
import java.util.*;
public class ElementLister {
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("Usage: java ElementLister URL");
return;
}
SAXBuilder builder = new SAXBuilder();
try {
Document doc = builder.build(args[0]);
Element root = doc.getRootElement();
listChildren(root, 0); }
// indicates a well-formedness error
catch (JDOMException e) {
System.out.println(args[0] + " is not well-formed.");
System.out.println(e.getMessage());
}
catch (IOException e) {
System.out.println(e);
}
}
public static void listChildren(Element current, int depth) {
printSpaces(depth);
System.out.println(current.getName());
List children = current.getChildren();
Iterator iterator = children.iterator();
while (iterator.hasNext()) {
Element child = (Element) iterator.next();
listChildren(child, depth+1);
}
}
private static void printSpaces(int n) {
for (int i = 0; i < n; i++) {
System.out.print(' ');
}
}}
SAX
Simple API for XML Processing
About SAX
• SAX in Java is hosted on source forge
• SAX is not a w3c standard
• Originated purely in Java
• Other languages have chosen to implement in their
own ways based on this prototype
SAX vs. …
• Please don’t compared unrelated things:
– SAX is an alternative to DOM, but realize that
DOM is often built on top of SAX
– SAX and DOM do not compete with JAXP
– They do both compete with JAXB
implementations
How a SAX parser works
• SAX parser scans an xml stream on the fly and responds to
certain parsing events as it encounters them.
• This is very different than digesting an entire XML
document into memory.
• Much faster, requires less memory.
• However, need to reparse if you need to revisit data.
Obtaining a SAX parser
• Important classes
javax.xml.parsers.SAXParserFactory;
javax.xml.parsers.SAXParser;
javax.xml.parsers.ParserConfigurationException;
//get the parser
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
//parse the document
saxParser.parse( new File(argv[0]), handler);
DefaultHandler
• Note that an event handler has to be passed to the
SAX parser.
• This must implement the interface
org.xml.sax.ContentHandler;
• Easier to extend the adapter
org.xml.sax.helpers.DefaultHandler
Overriding Handler methods
•
Most important methods to override
–
void startDocument()
•
–
void endDocument()
•
–
Called once when parsing ends
void startElement(...)
•
–
Called each time an element begin tag is encountered
void endElement(...)
•
–
Called once when document parsing begins
Called each time an element end tag is encountered
void characters(...)
•
Called randomly between startElement and endElement calls
to accumulated character data
startElement
• public void startElement(
String namespaceURI, //if namespace assoc
String sName,
//nonqualified name
String qName,
//qualified name
Attributes attrs)
//list of attributes
• Attribute info is obtained by querying Attributes
objects.
Characters
• public void characters(
char buf[],
//buffer of chars accumulated
int offset,
//begin element of chars
int len)
//number of chars
• Note, characters may be called more than once between
begin tag / end tag
• Also, mixed-content elements require careful handling
Entity references
• Recall that entity references are special character
sequences for referring to characters that have
special meaning in XML syntax
– ‘<‘ is &lt
– ‘>’ is &gt
• In SAX these are automatically converted and
passed to the characters stream unless they are part
of a CDATA section
Choosing a Parser
• Choosing your Parser Implementation
– If no other factory class is specified, the default SAXParserFactory
class is used. To use a different manufacturer's parser, you can
change the value of the environment variable that points to it. You
can do that from the command line, like this:
• java -Djavax.xml.parsers.SAXParserFactory=yourFactoryHere ...
• The factory name you specify must be a fully qualified
class name (all package prefixes included). For more
information, see the documentation in the newInstance()
method of the SAXParserFactory class.
Validating SAX Parsers
String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";
Next, you need to configure DocumentBuilderFactory to generate a
namespace-aware, validating parser that uses XML Schema:
… SaxParserFactory factory = SaxParserFactory.newInstance()
factory.setNamespaceAware(true);
factory.setValidating(true);
try {
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); }
catch (IllegalArgumentException x) {
// Happens if the parser does not support JAXP 1.2 ...
}
Transforming arbitrary data
structures using SAX and
Transformer
Goal
• Now that we know SAX and a little about
Transformations, there are some cool things we
can do.
• One immediate thing is to create xml files from
plain text files using the help of a faux SAX parser
• Turns out to be more robust than doing by hand
Transformers
• Recall that transformers easily let us go between
any source and result by arbitrary wirings of
– StreamSource / StreamResult
– SAXSource / SAXResult
– DOMSource / DOMResult
• We used this to write a DOM tree to an XML file
• Now we will use a SAXSource together with a
StreamResult to convert our text file
Strategy
• We construct our own SAXParser – ie a class that
implements the XMLReader interface
• This class must have a parse method (among
others)
• We use parse to read our input file and fire the
appropriate SAX events, rather than handcoding
the Strings ourselves.
Main snippet
public static void main (String argv []){
StudentReader parser = new StudentReader();
TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
FileReader fr = new FileReader(“students.txt”);
BufferedReader br = new BufferedReader(fr);
InputSource inputSource = new InputSource(fr);
SAXSource source = new SAXSource(saxReader, inputSource);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
}
Create SAX “parser”
create transformer
Use text File as
Transformer source
Use text as result
XMLReader implementation
• To have a valid SAXSource we need a class that implements
XMLReader interface
public void parse(InputSource input)
public void setContentHandler(ContentHandler handler)
public ContentHandler getContentHandler()
.
.
.
•Shown are the important methods for a simple app
See Course Examples for details
JAXB
Java Architecture for XML Bindings
What is JAXB?
• JAXB defines the behavior of a standard set of tools and
interfaces that automatically generate java class files from
XML schema
• JAXB is a framework or architecture, not an
implementation.
• Sun provides a reference implementation of JAXB with the
Web Services Developers kit, available as a separate
download
http://java.sun.com/webservices/downloads/webservicespa
ck.html
JAXB vs. DOM and SAX
• JAXB is a higher level construct than DOM or SAX
– DOM represents XML documents as generic trees
– SAX represents XML documents as generic event streams
– JAXB represents XML documents as Java classes with properties
that are specific to the particular XML document
• E.g. book.xml becomes Book.java with getTitle, setTitle, etc.
• JAXB thus requires almost no knowledge of XML to be
able to programmatically process XML documents!
High-level comparison
• Before diving into details of JAXB, it’s good to
see a bird’s-eye-view of the difference between
JAXB and SAX and/or DOM-like parsers
• Study the books/ examples under the
examples/jaxb directory on the course website
JAXB steps
•
We start by assuming that you have a
valid installation of java web services
developers pack version 3. We cover
these installation details later
•
Using JAXB then requires several
steps:
1.
2.
3.
4.
Run the binding compiler on the
schema file to automagically produce
the appropriate java class files
Compile the java class files (ant tool
helps here)
Study the autogenerated api to learn
what java types have been created
Create a program that unmarshals an
xml document into these elementary
data structures
Running binding compiler
• <install_dir>/jaxb/bin/xjc.sh -p test.jaxb books.xsd -d work
–
–
–
–
xjc.sh : executes binding compiler
-p test.jaxb : place resulting class files in package test.jaxb
books.xsd : run compiler on schema books.xsd
-d work : place resulting files in directory called work/
• Note that this creates a huge number of files that together represent the
content of the books.xsd schema as a set of Java classes
• It is not necessary to know all of these classes. We’ll study them only
at a high level so we can understand how to use them
Example: students.xsd
Generated interfaces
• xjc.sh -p test.lottery students.xsd
• This generates the following interfaces
– test/lottery/ObjectFactory.java
• Contains methods for generating instances of the interfaces
– test/lottery/Students.java
• Represents the root node <students>
– test/lottery/StudentsType.java
• Represents the unnamed type of each student object
Generated implementations
• Each interface is implemented in the impl
directory
– test/lottery/impl/StudentsImpl.java
• Vendor-specific implementation of the Students inteface
– test/lottery/impl/StudentsTypeImpl.java
• Vendor-specific implementation of the StudentsType Interface
Compilation
• Next, the generated classes must be compiled:
– javac students/*.java students/impl/*.java
• CLASSPATH requires many jar files:
– jaxb/lib/*.jar
– jwsdp-shared/lib/*.jar
– jaxp/lib/**/*.jar
• Note: an ant buildfile (like a java makefile) makes
this much easier. More on this later
Generated docs
• Java API docs for these classes are
generated in
– students/docs/api/*.html
• After bindings are generated, one usually
works directly through these API docs to
learn how to access/manipulate the XML
data.
Sample Programs
Sample Programs
• Easiest way to learn is to cover certain generic sample
cases. These are all on the course website under
ace104/lesson6/examples
• Summary of examples:
– student/
• Use JAXB to read an xml document composed of a single student
complex type
– student/
• Same, but for an xml document composed of a sequence of such
student types of indefinite length
– purchaseOrder/
• Another read example, but for a more complex schema
Sample programs, cont
• Course examples, cont
– create-marshal
• Purchase-order example modified to create in memory and
write to XML
– modify-marshal
• Purchase-order example modified to read XML, change it and
write back to XML
• Study these examples!
Some additional JAXB details
Binding Data Types
• Default java datatype bindings can be found at:
http://java.sun.com/webservices/docs/1.3/tutorial/doc/JAXBWorks5.html
• These defaults can be changed if required for an
application
• Also, name binding are fairly standard changes of names to
things acceptable in java programming language
• See other binding rules on subsequent pages
Default binding rules summary
•
The JAXB binding model follows the default binding rules summarized below:
•
Bind the following to Java package:
–
•
Bind the following XML Schema components to Java content interface:
–
–
•
A global element declaration to a Element interface.
Local element declaration that can be inserted into a general content list.
Bind to Java property:
–
–
•
A named simple type definition with a basetype that derives from "xsd:NCName" and has enumeration facets.
Bind the following XML Schema components to a Java Element interface:
–
–
•
Named complex type
Anonymous inlined type definition of an element declaration
Bind to typesafe enum class:
–
•
XML Namespace URI
Attribute use
Particle with a term that is an element reference or local element declaration.
Bind model group with a repeating occurrence and complex type definitions with mixed {content type} to:
–
A general content property; a List content-property that holds Java instances representing element information items and character
data items.
End