| [ directory ] |
|
2.4 Programming Interfaces for Document StructureIn the previous sections, we showed how to read and parse an XML document. Next, we explain how to process an XML document by accessing its internal structure through APIs. The XML 1.0 Recommendation defines the precise behavior of an XML processor when reading and parsing a document, but it says nothing about which API to use. In this section, we discuss two widely used APIs. The Document Object Model (DOM), a tree structure朾ased API by W3C. The specification consists of Level 1 (Recommendation in October 1998), Level 2 (Recommendation in November 2000), and Level 3 (currently a Working Draft) documents. Xerces 1.4.3 supports most of DOM Level 2. The Simple API for XML (SAX), an event-driven API developed by David Megginson and a number of people on the xml-dev mailing list. Although not sanctioned by any standards body, SAX is supported by most of the available XML processors. Xerces 1.4.3 supports SAX and SAX2, which supports namespaces. In this book, the word "SAX" refers to SAX (version 1.0) and SAX2. Figure 2.2 depicts the difference between the DOM and SAX APIs. When an application uses a DOM-based parser, it parses an XML document and passes a Document instance. The application should wait until it parses the whole XML document. When an application uses a SAX-based parser, it starts parsing an XML document and passes an event stream to the application in the course of parsing. The next sections discuss in detail the pros and cons of using these APIs. Figure 2.2. DOM versus SAX
NOTE In SAX2, some interfaces have been changed and renamed to support namespaces. Xerces supports both the SAX and SAX2 APIs, but the old SAX interfaces are now deprecated. 2.4.1 DOM: Tree-Based APIThe term "document object model" has been used to refer to a model that defines the structure of an HTML document, thereby allowing scripting languages, such as JavaScript, to access the elements of the structure. You might have written JavaScript programs that manipulate the value of an input field in a form element in an HTML document. For example, document.forms(1).username.value refers to the value of the input field with the name username in the first form element in an HTML document. This expression is used to access the HTML DOM on HTML browsers like Microsoft Internet Explorer (IE) and Netscape Navigator. However, current HTML object models and APIs to access them are browser-dependent (though the problem is being resolved). Thus you generally should prepare different pages suited for each type of browser that might execute your scripts. One goal of the DOM specification is to define a common, interoperable document object model for HTML as well as XML. The first edition of this book is based on the DOM Level 1 Recommendation. The DOM Level 2 Recommendation was published on November 13, 2000. Handling of namespaces, events, traversal range, and views were introduced in DOM Level 2. Standardization of DOM Level 3 is in progress. It will support load and save functions and other new functions. The details of using the DOM API are discussed in Chapter 4. In DOM, an XML document is represented as a tree whose nodes are elements, text, and so on. An XML processor generates the tree and hands it to an application. A DOM-based XML processor (for example, DOMParser or DocumentBuilder) creates the entire structure of an XML document in memory (though Xerces defers the creation of DOM nodes until it is accessed). XML is a language for describing tree-structured data. In XML, an element is represented by a start tag and a matching end tag (or an empty-element tag). An element may contain one or more other elements between its start and end tags. Thus an entire document is represented as a nested tree. For example, our previous department example, department.xml, can be represented in a tree structure, as shown in Figure 2.3. Figure 2.3. Tree expression for department.xml
Each pair of start and end tags corresponds to an Element node, represented by the boxes in the figure, such as department and employee. Each chunk of text surrounded by two tags corresponds to a Text node, represented by the strings in the figure. These nodes are defined as objects in DOM, and the DOM specification defines a platform- and language-neutral interface for application programs in terms of a standard set of the objects. To help interoperability, it defines APIs, called language bindings, for Java, ECMAScript (JavaScript), and the Interface Definition Language (IDL) from the Object Management Group (OMG). From an object-oriented programming viewpoint, the DOM API is a set of interfaces that should be implemented by a particular DOM implementation. Table 2.1 shows the interfaces (and some classes) that define the DOM (Core) Level 1 specification. Figure 2.4 shows the class/interface hierarchy of the interfaces and classes. Note that Node is the primary data structure that constructs a tree structure. DOM tree constituents, such as Element, Text, and Attr, are all defined as interfaces derived from the Node interface. Figure 2.4. Class/interface hierarchy
2.4.2 SAX: Event-Driven APIIn the previous section, we showed a method to access the structure and content of an XML document represented as a tree with the DOM API. An alternative to the tree-based API to access the document structure is an event-driven API. An application that wants information about document structure, such as element and attribute names/values in the document, can register handlers, a kind of callback function, to an XML processor. The processor notifies the handlers of events such as the start of a tag, an attribute, and the existence of data characters. Unlike when using the DOM API, which can traverse the structure of the document multiple times, the entire process of parsing with the SAX API is one-pass, and the event sequence is notified to an application in the process. SAX is designed as a lightweight API that does not generate a tree structure of an input document. Applications must register event handlers to a parser instance that implements the org.xml.sax.XMLReader interface. SAX has several event handler interfaces, including ContentHandler, DTDHandler, and ErrorHandler. It also provides the default implementation class org.xml.sax.helpers.DefaultHandler, which implements all the interface methods that do nothing. You can implement the methods necessary for your application if you extend DefaultHandler as your event handler. ContentHandler is the most often used interface because it is called whenever an element and an attribute are found. You can see how it is used in Listing 2.15, a simple program that reads an XML document and notifies events according to the document structure. Listing 2.15 Tracing SAX events, chap02/TraceEventsNS.java
package chap02;
/**
* TraceEventsNS.java
*/
import java.io.IOException;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.XMLReader;
import org.xml.sax.SAXException;
import org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;
[12] public class TraceEventsNS extends DefaultHandler {
[13] public TraceEventsNS() {
[14] }
static public void main(String[] argv) {
if (argv.length != 1) {
System.err.println(
"Usage: chap02.TraceEventsNS <filename>");
System.exit(1);
}
try {
[23] // Creates SAX Parser object. When SAX2.0 is used,
[24] // XMLReader class is used. Implementation class is
[25] // specified by using
[26] // System property org.xml.sax.driver
[27] XMLReader parser =
[28] XMLReaderFactory.createXMLReader();
[29] // Tells the parser to be aware of namespace
[30] parser.setFeature(
[31] "http://xml.org/sax/features/namespaces", true);
[32] // Creates document handler and registers the handler
[33] TraceEventsNS handler = new TraceEventsNS();
[34] parser.setContentHandler(handler);
[35] parser.setDTDHandler(handler);
[36] parser.setErrorHandler(handler);
// Parses input document
parser.parse(argv[0]);
} catch (SAXException se) {
se.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
public void startDocument() throws SAXException {
System.out.println("startDocument is called:");
}
public void endDocument() throws SAXException {
System.out.println("endDocument is called:");
}
public void startElement(String uri, String localpart,
String name, Attributes amap) {
System.out.println("startElement is called: localpart="
+ localpart
+ ", namespace URI="+uri);
for (int i = 0; i < amap.getLength(); i++) {
String attname = amap.getQName(i);
String type = amap.getType(i);
String value = amap.getValue(i);
System.out.println(" attribute name="+attname+" type="
+type+" value="+value);
}
}
public void endElement(String name) throws SAXException {
System.out.println("endElement is called: " + name);
}
public void characters(char[] ch, int start, int length)
throws SAXException {
System.out.println("characters is called: " +
new String(ch, start, length));
}
}
[12] public class TraceEventsNS extends DefaultHandler {...
}
In Line 12, the TraceEventsNS class extends the DefaultHandler class, which is a helper class used to catch all the events provided the SAX API. [27] XMLReader parser = XMLReaderFactory.createXMLReader(); The org.xml.sax.XMLReader is an interface that represents a SAX-based parser, and an instance of the interface is given by the createXMLReader() method. The object is associated with an implementation class of an XML processor (SAX parser) by using the following two methods.
java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser ... Using this property enables your program to be independent of specific parser implementations.
[30] parser.setFeature("http://xml.org/sax/features/namespaces", true);
If you want the parser to be aware of namespaces, you can set a feature of the parser (see line 30). The feature http://xml.org/sax/features/namespaces is the same as for the DOM parser (see Section 2.3.1). [33] TraceEventsNS handler = new TraceEventsNS(); [34] parser.setContentHandler(handler); [35] parser.setDTDHandler(handler); [36] parser.setErrorHandler(handler); In lines 33?6, an instance of TraceEventsNS that implements DefaultHandler is created, and event handlers are set to the parser instance. Although the three handlers are set, methods for ContentHandler are actually implemented in TraseEventsNS, which extends the DefaultHandler class. The class provides "empty" method implementations for interfaces, and you do not have to prepare the other methods. parser.parse(argv[1]); The parse() method starts parsing. The SAX parser reads an input XML document and sends events to the instance that implements the handler. Table 2.2 shows the methods defined in the ContentHandler interface.
The following is the output of TraceEventsNS. R:\samples>java chap02.TraceEventsNS org.apache.xerces.parsers SAXParser department-ns.xml startDocument is called: startElement is called: localpart=department, namespace URI=http:// www.schema.org/department/ startElement is called: localpart=employee, namespace URI=http://www schema.org/department/attribute name=id type=CDATA value=J.D startElement is called: localpart=name, namespace URI=http://www. schema.org/department/characters is called: John Doe startElement is called: localpart=email, namespace URI=http://www. schema.org/department/characters is called: John.Doe@foo.com startElement is called: localpart=employee, namespace URI=http://www. schema.org/department/attribute name=id type=CDATA value=B.S startElement is called: localpart=name, namespace URI=http://www. schema.org/department/characters is called: Bob Smith startElement is called: localpart=email, namespace URI=http://www. schema.org/department/characters is called: Bob.Smith@foo.com startElement is called: localpart=employee, namespace URI=http://www. schema.org/department/attribute name=id type=CDATA value=A.M startElement is called: localpart=name, namespace URI=http://www. schema.org/department/characters is called: Alice Miller startElement is called: localpart=url, namespace URI=http://www. schema.org/department/attribute name=href type=CDATA value=http:// www.foo.com/~amiller/ endDocument is called: In Listing 2.13 we showed a DOM-based program with JAXP. In Listing 2.16, we show a SAX-based program with JAXP. Listing 2.16 Tracing SAX events, chap02/TraceEventsJAXP.java
package chap02;
/**
* TraceEventsJAXP.java
*/
import java.io.File;
import java.io.IOException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.FactoryConfigurationError;
import org.xml.sax.SAXException;
import org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;
public class TraceEventsJAXP extends DefaultHandler {
public TraceEventsJAXP() {
}
static public void main(String[] argv) {
try {
if (argv.length != 1) {
System.err.println(
"Usage: chap02.java TraceEventsJAXP finename");
System.exit(1);
}
[26] // Creates SAX Parser factory
[27] SAXParserFactory factory =
[28] SAXParserFactory.newInstance();
[29] // Tells the parser to be aware of namespaces
[30] factory.setNamespaceAware(true);
[31] // Creates parser object
[32] SAXParser parser = factory.newSAXParser();
[33] DefaultHandler handler = new TraceEventsJAXP();
[34] // Parses input document
[35] parser.parse(new File(argv[0]), handler);
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} catch (ParserConfigurationException pce) {
pce.printStackTrace();
}
}
public void startDocument() throws SAXException {
System.out.println("startDocument is called:");
}
public void endDocument() throws SAXException {
System.out.println("endDocument is called:");
}
public void startElement(String uri, String localpart,
String name, Attributes amap) {
System.out.println("startElement is called: localpart="
+ localpart
+ ", namespace URI="+uri);
for (int i = 0; i < amap.getLength(); i++) {
String attname = amap.getQName(i);
String type = amap.getType(i);
String value = amap.getValue(i);
System.out.println(" attribute name="+attname+" type="
+type+" value="+value);
}
}
public void endElement(String name) throws SAXException {
System.out.println("endElement is called: " + name);
}
public void characters(char[] ch, int start, int length)
throws SAXException {
System.out.println("characters is called: " +
new String(ch, start, length));
}
}
The output of this program is the same as in Listing 2.15. [27] SAXParserFactory factory = SAXParserFactory.newInstance(); Line 27 creates an instance of the factory class SAXParserFactory for the SAX API. [30] factory.setNamespaceAware(true); Line 30 tells the factory to be aware of namespaces. [32] SAXParser parser = factory.newSAXParser(); Line 32 creates a SAXParser instance from the factory. [33] DefaultHandler handler = new TraceEventsJAXP(); [35] parser.parse(new File(argv[0]), handler); Lines 33 and 35 create an instance of an event handler (DefaultHandler) implementation (TraceEventsJAXP itself). The handler is given to the parser as an argument of the parse() method. We described methods for parsing XML documents with the JAXP API. The DocumentBuilderFactory (for DOM) and SAXParserFactory (for SAX) factories hide the implementation of the XML processors. However, how do the factories call the factory implementations? How can developers use another implementation? In DOM, DocumentBuilderFactory.newInstance() gives an instance of an implementation class. The JAXP 1.1 specification defines a procedure for determining the class name as follows (refer to the specification for details).
For example, an implementation of DocumentBuilderFactory provided by Xerces is org.apache.xerces.jaxp.DocumentBuilderFactoryImpl. When you want to specify the class at runtime, do as follows: R:\samples>java chap02.SimpleParseJAXP -Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp. DocumentBuilderFactoryImpl chap02/department-dtd.xml In SAX, replace the string "DocumentBuilderFactory" with "SAXParserFactory". 2.4.3 Design Point: DOM versus SAXWhen you design and develop Web applications, the choice of the access API is very important. The use of DOM is best suited for the following situations:
On the other hand, an XML processor with SAX does not create a tree structure. Instead, it scans an input XML document and generates events. Application programs receive these events and do whatever is appropriate for the task, such as getting an element type name and its text content. SAX is more efficient than DOM, therefore, it is good for the following occasions:
Listing 2.17 summarizes the programming patterns with the Xerces and JAXP APIs. Listing 2.17 Programming patterns for the Xerces and JAXP APIs(1) Xerces: DOM parser import org.w3c.dom.Document; import org.apache.xerces.parsers.DOMParser; import org.w3c.dom.Document; String filename; ... DOMParser parser = new DOMParser(); parser.parse(filename); Document doc = parser.getDocument(); (2) Xerces: SAX parser import org.xml.sax.helpers.XMLReaderFactory; import org.xml.sax.XMLReader; import org.xml.sax.helpers.DefaultHandler; import org.w3c.dom.Document; DefaultHandler handler; String filename; ... XMLReader parser = XMLReaderFactory.createXMLReader(); parser.setContentHandler(handler); parser.setDTDHandler(handler); parser.setErrorHandler(handler); parser.parse(filename); (3) Xerces: handling namespaces
parser.setFeature("http://xml.org/sax/features/namespaces", true);
(4) Xerces: Validation
parser.setFeature("http://xml.org/sax/features/validation", true);
(5) Xerces: Schema validation
parser.setFeature("http://apache.org/xml/features/validation/schema", true);
(1) JAXP: DOM parser
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
String filename;
...
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder =
factory.newDocumentBuilder();
Document doc = builder.parse(filename);
(2) JAXP: SAX parser
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.helpers.DefaultHandler;
import org.w3c.dom.Document;
DefaultHandler handler;
String filename;
...
SAXParserFactory factory =
SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
parser.parse(filename, handler);
(3) JAXP: Handling namespaces factory.setNamespaceAware(true); (4) JAXP: (Schema) validation factory.setValidating(true); |
| [ directory ] |
|