XMLParsing - People.cs.uchicago.edu

Download Report

Transcript XMLParsing - People.cs.uchicago.edu

Parsing XML into programming
languages
JAXP, DOM, SAX, JDOM/DOM4J,
Xerces, Xalan, JAXB
Parsing XML
• Goal: read XML files into data structures in
programming languages
• Possible strategies
–
–
–
–
Parse by hand with some reusable libraries
Parse into generic tree structure
Parse as sequence of events
Automagically parse to language-specific objects
Parsing by-hand
• Advantages
– Complete control
– Good if simple needs – build off of regex package
• Disadvantages
– Must write the initial code yourself, even if it becomes
generalized
– Pretty tedious and error prone.
– Gets very hard when using schema or DTD to validate
Parsing into generic tree structure
• Advantages
– Industry-wide, language neutral standard exists called DOM
(Document Object Model)
– Learning DOM for one language makes it easy to learn for any
other
– As of JAXP 1.2, support for Schema
– Have to write much less code to get XML to something you want
to manipulate in your program
• Disadvantages
– Non-intuitive API, doesn’t take full advantage of Java
– Still quite a bit of work
What is JAXP?
• JAXP: Java API for XML Processing
– In the Java language, the definition of these standard
API’s (together with XSLT API) comprise a set of
interfaces known as JAXP
– Java also provides standard implementations together
with vendor pluggability layer
– Some of these come standard with J2SDK, others are
only availdable with Web Services Developers Pack
– We will study these shortly
Another alternative
• JDOM: Native Java published API for
representing XML as tree
• Like DOM but much more Java-specific,
object oriented
• However, not supported by other languages
• Also, no support for schema
• Dom4j another alternative
JAXB
• JAXB: Java API for XML Bindings
• Defines an API for automagically representing
XML schema as collections of Java classes.
• Most convenient for application programming
• Will cover next class
DOM
About DOM
• Stands for Document Object Model
• A World Wide Web Consortium (w3c) standard
• Standard constantly adding new features – Level 3
Core just released this month
• Well cover most of the basics. There’s always
more, and it’s always changing.
DOM abstraction layer in Java -architecture
Emphasis is on allowing vendors to supply their own DOM
Implementation without requiring change to source code
Returns specific parser
implementation
org.w3d.dom.Document
Sample Code
A factory instance
is the parser implementation.
DocumentBuilderFactor factory =
Can be changed with runtime
DocumentBuilderFactory.newInstance(); System property. Jdk has default.
Xerces much better.
/* set some factory options here */
DocumentBuilder builder =
factory.newDocumentBuilder();
Document doc = builder.parse(xmlFile);
javax.xml.parsers.DocumentBuilderFactory
javax.xml.parsers.DocumentBuilder
org.w3c.dom.Document
From the factory one obtains
an instance of the parser
xmlFile can be an java.io.File,
an inputstream, etc.
For reference. Notice that the
Document class comes from the
w3c-specified bindings.
Validation
• Note that by default the parser will not
validate against a schema or DTD
• As of JAXP1.2, java provides a default
parser than can handle most schema
features
• See next slide for details on how to setup
Important: Schema validation
String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";
Next, you need to configure DocumentBuilderFactory to generate a
namespace-aware, validating parser that uses XML Schema:
… DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance()
factory.setNamespaceAware(true);
factory.setValidating(true);
try {
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); }
catch (IllegalArgumentException x) {
// Happens if the parser does not support JAXP 1.2 ...
}
Associating document with schema
•
An xml file can be associated with a
schema in two ways
1. Directly in xml file in regular way
2. Programmatically from java
•
Latter is done as:
–
factory.setAttribute(JAXP_SCHEMA_SOURCE,
new File(schemaSource));
A few notes
• Factory allows ease of switching parser
implementations
– Java provides simple DOM implementation, but
much better to use vendor-supplied when doing
serious work
– Xerces, part of apache project, is installed on
cluster as Eclipse plugin. We’ll use next week.
– Note that some properties are not supported by
all parser implementations.
Document object
• Once a Document object is obtained, rich API to
manipulate.
• First call is usually
Element root = doc.getDocumentElement();
This gets the root element of the Document as an
instance of the Element class
• Note that Element subclasses Node and has methods
getType(), getName(), and getValue(), and
getChildNodes()
Types of Nodes
• Note that there are many types of Nodes (ie
subclasses of Node):
Attr, CDATASection, Comment, Document, DocumentFragment,
DocumentType, Element, Entity, EntityReference, Notation,
ProcessingInstruction, Text
Each of these has a special and non-obvious associated type, value, and name.
Standards are language-neutral and are specified on chart on following slide
Important: keep this chart nearby when using DOM
nodeName()
nodeValue()
Attr
Attr name
Value of attribute
null
2
CDATASection
#cdata-section
CDATA cotnent
null
4
Comment
#comment
Comment content
null
8
Document
#document
Null
null
9
DocumentFragment
#documentfragment
null
null
11
DocumentType
Doc type name
null
null
10
Element
Tag name
null
NamedNodeMap
1
Entity
Entity name
null
null
6
EntityReference
Name entity
referenced
null
null
5
Notation
Notation name
null
null
1
ProcessingInstruction
target
Entire string
null
7
Text
#text
Actual text
null
3
Node
Attributes
nodeType()
DOM Exercise
Write a function to do a depth search printout of the node information of a given XML file as:
recursePrint(root);
Assume you have access to the following:
printNodeInfo(Node node):prints the name, type, and value of the input node.
boolean Node.hasChildNodes(): to check if a node has any children
NodeList Node.getChildNodes(): to get a list of all children nodes
Node NodeList.item(int num): to select the num’th child node
public static void recursePrint(Node node){
}
DOM Exercise Answer
Write a function to do a depth search printout of the node information of a given XML file as:
recursePrint(root);
Assume you have access to the following:
printNodeInfo(Node node):prints the name, type, and value of the input node.
boolean Node.hasChildNodes(): to check if a node has any children
NodeList Node.getChildNodes(): to get a list of all children nodes
Node NodeList.item(int num): to select the num’th child node
public static void recursePrint(Node node){
printNodeInfo(node);
if (!node.hasChildNodes()) return;
NodeList nodes = node.getChildNodes();
for (int i = 0; i < nodes.getLength(); ++i){
node = nodes.item(i);
recursePrint(depth, node);
}
}
Transforming XML
The JAXP Transformation Packages
•
JAXP Transformation APIs:
– javax.xml.transform
• This package defines the factory class you use to get a Transformer object. You then
configure the transformer with input (Source) and output (Result) objects, and invoke its
transform() method to make the transformation happen. The source and result objects are
created using classes from one of the other three packages.
– javax.xml.transform.dom
• Defines the DOMSource and DOMResult classes that let you use a DOM as an input to or
output from a transformation.
– javax.xml.transform.sax
• Defines the SAXSource and SAXResult classes that let you use a SAX event generator as
input to a transformation, or deliver SAX events as output to a SAX event processor.
– javax.xml.transform.stream
• Defines the StreamSource and StreamResult classes that let you use an I/O stream as an
input to or output from a transformation.
Transformer Architecture
Writing DOM to XML
public class WriteDOM{
public static void main(String[] argv) throws Exception{
File f = new File(argv[0]);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(f);
TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(document);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
}
}
Creating a DOM from scratch
• Sometimes you may want to create a DOM
tree directly in memory. This is done with:
DocumentBuilderFactory factory
= DocumentBuilderFactory.newInstance();
DocumentBuilder builder
=
factory.newDocumentBuilder();
document = builder.newDocument();
Manipulating Nodes
• Once the root node is obtained, typical tree
methods exist to manipulate other elements:
boolean node.hasChildNodes()
NodeList node.getChildNodes()
Node node.getNextSibling()
Node node.getParentNode()
String node.getValue();
String node.getName();
String node.getText();
void setNodeValue(String nodeValue);
Node insertBefore(Node new, Node ref);
JDOM
JDOM Motivation
(from Elliot Harold)
•
Unfortunately DOM suffers from a number of design flaws and
limitations that make it less than ideal as a Java API for processing
XML
– DOM had to be backwards compatible with the hackish, poorly thought out,
unplanned object models used in third generation web browsers.
– DOM was designed by a committee trying to reconcile differences between
the object models implemented by Netscape, Microsoft, and other vendors.
They needed a solution that was at least minimally acceptable to everybody,
which resulted in an API thatユs maximally acceptable to no one.
– DOM is a cross-language API defined in IDL, and thus limited to those
features and classes that are available in essentially all programming
languages, including not fully-object oriented scripting languages like
JavaScript and Visual Basic. It is a lowest common denominator API. It
does not take full advantage of Java, nor does it adhere to Java best
practices, naming conventions, and coding standards.
– DOM must work for both HTML (not just XHTML, but traditional malformed
HTML) and XML.
Some sample JDOM
<fibonacci/>
In JDOM:
Element element = new Element("fibonacci");
In DOM:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
DOMImplementation impl = builder.getDOMImplementation();
Document doc = impl.createDocument( null, "Fibonacci_Numbers", null);
In JDOM:
Element element = doc.createElement("fibonacci");
Element element = new Element("fibonacci");
element.setText("8"); :
element.setAttribute("index", "6");
Extremely simple and intuitive!
More JDOM
•
To create this element
<sequence>
<number>3</number>
<number>5</number>
</sequence>
Element element = new Element("sequence");
Element firstNumber = new Element("number");
Element secondNumber = new Element("number");
firstNumber.setText("3");
secondNumber.setText("5");
element.addContent(firstNumber);
element.addContent(secondNumber);
import org.jdom.*;
import org.jdom.input.SAXBuilder;
Parsing XML file with JDOM
import java.io.IOException;
import java.util.*;
public class ElementLister {
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("Usage: java ElementLister URL");
return;
}
SAXBuilder builder = new SAXBuilder();
try {
Document doc = builder.build(args[0]);
Element root = doc.getRootElement();
listChildren(root, 0); }
// indicates a well-formedness error
catch (JDOMException e) {
System.out.println(args[0] + " is not well-formed.");
System.out.println(e.getMessage());
}
catch (IOException e) {
System.out.println(e);
}
}
public static void listChildren(Element current, int depth) {
printSpaces(depth);
System.out.println(current.getName());
List children = current.getChildren();
Iterator iterator = children.iterator();
while (iterator.hasNext()) {
Element child = (Element) iterator.next();
listChildren(child, depth+1);
}
}
private static void printSpaces(int n) {
for (int i = 0; i < n; i++) {
System.out.print(' ');
}
}}
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import java.io.IOException;
import java.util.*;
public class NodeLister {
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("Usage: java NodeLister URL");
SAXBuilder builder = new SAXBuilder();
try {
return;
}
Document doc = builder
SAX
Simple API for XML Processing
About SAX
• SAX in Java is hosted on source forge
• SAX is not a w3c standard
• Originated purely in Java
• Other languages have chosen to implement in their
own ways based on this prototype
SAX vs. …
• Please don’t compared unrelated things:
– SAX is an alternative to DOM, but realize that
DOM is often built on top of SAX
– SAX and DOM do not compete with JAXP
– They do both compete with JAXB
implementations
How a SAX parser works
• SAX parser scans an xml stream on the fly and responds to
certain parsing events as it encounters them.
• This is very different than digesting an entire XML
document into memory.
• Much faster, requires less memory.
• However, need to reparse if you need to revisit data.
Obtaining a SAX parser
• Important classes
javax.xml.parsers.SAXParserFactory;
javax.xml.parsers.SAXParser;
javax.xml.parsers.ParserConfigurationException;
//get the parser
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
//parse the document
saxParser.parse( new File(argv[0]), handler);
DefaultHandler
• Note that an event handler has to be passed to the
SAX parser.
• This must implement the interface
org.xml.sax.ContentHandler;
• Easier to extend the adapter
org.xml.sax.helpers.DefaultHandler
Overriding Handler methods
•
Most important methods to override
–
void startDocument()
•
–
void endDocument()
•
–
Called once when parsing ends
void startElement(...)
•
–
Called each time an element begin tag is encountered
void endElement(...)
•
–
Called once when document parsing begins
Called each time an element end tag is encountered
void characters(...)
•
Called randomly between startElement and endElement calls
to accumulated character data
startElement
• public void startElement(
String namespaceURI, //if namespace assoc
String sName,
//nonqualified name
String qName,
//qualified name
Attributes attrs)
//list of attributes
• Attribute info is obtained by querying Attributes
objects.
Characters
• public void characters(
char buf[],
//buffer of chars accumulated
int offset,
//begin element of chars
int len)
//number of chars
• Note, characters may be called more than once between
begin tag / end tag
• Also, mixed-content elements require careful handling
Entity references
• Recall that entity references are special character
sequences for referring to characters that have
special meaning in XML syntax
– ‘<‘ is &lt
– ‘>’ is &gt
• In SAX these are automatically converted and
passed to the characters stream unless they are part
of a CDATA section
Choosing a Parser
• Choosing your Parser Implementation
– If no other factory class is specified, the default SAXParserFactory
class is used. To use a different manufacturer's parser, you can
change the value of the environment variable that points to it. You
can do that from the command line, like this:
• java -Djavax.xml.parsers.SAXParserFactory=yourFactoryHere ...
• The factory name you specify must be a fully qualified
class name (all package prefixes included). For more
information, see the documentation in the newInstance()
method of the SAXParserFactory class.
Validating SAX Parsers
String JAXP_SCHEMA_LANGUAGE =
"http://java.sun.com/xml/jaxp/properties/schemaLanguage";
String W3C_XML_SCHEMA =
"http://www.w3.org/2001/XMLSchema";
Next, you need to configure DocumentBuilderFactory to generate a
namespace-aware, validating parser that uses XML Schema:
… SaxParserFactory factory = SaxParserFactory.newInstance()
factory.setNamespaceAware(true);
factory.setValidating(true);
try {
factory.setAttribute(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA); }
catch (IllegalArgumentException x) {
// Happens if the parser does not support JAXP 1.2 ...
}
Transforming arbitrary data
structures using SAX and
Transformer
Goal
• Now that we know SAX and a little about
Transformations, there are some cool things we
can do.
• One immediate thing is to create xml files from
plain text files using the help of a faux SAX parser
• Turns out to be more robust than doing by hand
Transformers
• Recall that transformers easily let us go between
any source and result by arbitrary wirings of
– StreamSource / StreamResult
– SAXSource / SAXResult
– DOMSource / DOMResult
• We used this to write a DOM tree to an XML file
• Now we will use a SAXSource together with a
StreamResult to convert our text file
Strategy
• We construct our own SAXParser – ie a class that
implements the XMLReader interface
• This class must have a parse method (among
others)
• We use parse to read our input file and fire the
appropriate SAX events, rather than handcoding
the Strings ourselves.
Main snippet
public static void main (String argv []){
StudentReader parser = new StudentReader();
TransformerFactory tFactory =
TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
FileReader fr = new FileReader(“students.txt”);
BufferedReader br = new BufferedReader(fr);
InputSource inputSource = new InputSource(fr);
SAXSource source = new SAXSource(saxReader, inputSource);
StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
}
Create SAX “parser”
create transformer
Use text File as
Transformer source
Use text as result
XMLReader implementation
• To have a valid SAXSource we need a class that implements
XMLReader interface
public void parse(InputSource input)
public void setContentHandler(ContentHandler handler)
public ContentHandler getContentHandler()
.
.
.
•Shown are the important methods for a simple app
See Course Examples for details
End