站内搜索: 请输入搜索关键词
当前页面: 图书首页 > XML and Java: Developing Web Applications, Second Edition

XML and Java: Developing Web Applications, Second Edition

[ directory ] Previous Section Next Section

2.4 Programming Interfaces for Document Structure

In the previous sections, we showed how to read and parse an XML document. Next, we explain how to process an XML document by accessing its internal structure through APIs.

The XML 1.0 Recommendation defines the precise behavior of an XML processor when reading and parsing a document, but it says nothing about which API to use. In this section, we discuss two widely used APIs.

The Document Object Model (DOM), a tree structure朾ased API by W3C. The specification consists of Level 1 (Recommendation in October 1998), Level 2 (Recommendation in November 2000), and Level 3 (currently a Working Draft) documents. Xerces 1.4.3 supports most of DOM Level 2.

The Simple API for XML (SAX), an event-driven API developed by David Megginson and a number of people on the xml-dev mailing list. Although not sanctioned by any standards body, SAX is supported by most of the available XML processors. Xerces 1.4.3 supports SAX and SAX2, which supports namespaces. In this book, the word "SAX" refers to SAX (version 1.0) and SAX2.

Figure 2.2 depicts the difference between the DOM and SAX APIs. When an application uses a DOM-based parser, it parses an XML document and passes a Document instance. The application should wait until it parses the whole XML document. When an application uses a SAX-based parser, it starts parsing an XML document and passes an event stream to the application in the course of parsing. The next sections discuss in detail the pros and cons of using these APIs.

Figure 2.2. DOM versus SAX

graphics/02fig02.gif

NOTE

In SAX2, some interfaces have been changed and renamed to support namespaces. Xerces supports both the SAX and SAX2 APIs, but the old SAX interfaces are now deprecated.


2.4.1 DOM: Tree-Based API

The term "document object model" has been used to refer to a model that defines the structure of an HTML document, thereby allowing scripting languages, such as JavaScript, to access the elements of the structure. You might have written JavaScript programs that manipulate the value of an input field in a form element in an HTML document. For example, document.forms(1).username.value refers to the value of the input field with the name username in the first form element in an HTML document. This expression is used to access the HTML DOM on HTML browsers like Microsoft Internet Explorer (IE) and Netscape Navigator.

However, current HTML object models and APIs to access them are browser-dependent (though the problem is being resolved). Thus you generally should prepare different pages suited for each type of browser that might execute your scripts. One goal of the DOM specification is to define a common, interoperable document object model for HTML as well as XML. The first edition of this book is based on the DOM Level 1 Recommendation. The DOM Level 2 Recommendation was published on November 13, 2000. Handling of namespaces, events, traversal range, and views were introduced in DOM Level 2. Standardization of DOM Level 3 is in progress. It will support load and save functions and other new functions. The details of using the DOM API are discussed in Chapter 4.

In DOM, an XML document is represented as a tree whose nodes are elements, text, and so on. An XML processor generates the tree and hands it to an application. A DOM-based XML processor (for example, DOMParser or DocumentBuilder) creates the entire structure of an XML document in memory (though Xerces defers the creation of DOM nodes until it is accessed).

XML is a language for describing tree-structured data. In XML, an element is represented by a start tag and a matching end tag (or an empty-element tag). An element may contain one or more other elements between its start and end tags. Thus an entire document is represented as a nested tree. For example, our previous department example, department.xml, can be represented in a tree structure, as shown in Figure 2.3.

Figure 2.3. Tree expression for department.xml

graphics/02fig03.gif

Each pair of start and end tags corresponds to an Element node, represented by the boxes in the figure, such as department and employee. Each chunk of text surrounded by two tags corresponds to a Text node, represented by the strings in the figure.

These nodes are defined as objects in DOM, and the DOM specification defines a platform- and language-neutral interface for application programs in terms of a standard set of the objects. To help interoperability, it defines APIs, called language bindings, for Java, ECMAScript (JavaScript), and the Interface Definition Language (IDL) from the Object Management Group (OMG).

From an object-oriented programming viewpoint, the DOM API is a set of interfaces that should be implemented by a particular DOM implementation. Table 2.1 shows the interfaces (and some classes) that define the DOM (Core) Level 1 specification. Figure 2.4 shows the class/interface hierarchy of the interfaces and classes. Note that Node is the primary data structure that constructs a tree structure. DOM tree constituents, such as Element, Text, and Attr, are all defined as interfaces derived from the Node interface.

Figure 2.4. Class/interface hierarchy

graphics/02fig04.gif

Table 2.1. Interfaces in the org.w3c.dom Package

INTERFACE NAME

DESCRIPTION

Node

The primary data type representing a single node in the document tree.

Document

Represents the entire XML document.

Element

Represents an element and any contained nodes.

Attr

Represents an attribute in an Element object.

ProcessingInstruction

Represents a processing instruction.

CDATASection

Represents a CDATASection.

DocumentFragment

A lightweight document object used for representing multiple subtrees or partial documents.

Entity

A lightweight document object used for representing multiple subtrees or partial documents.

EntityReference

Represents an entity reference, as it appears in the document tree.

DocumentType

Represents a DTD, which contains a list of entities.

Notation

Represents a notation declared in the DTD. A notation declares, by name, the format of an unparsed entity.

CharacterData

A parent interface of Text and others, which requires operations such as insert and delete string.

Comment

Represents a comment.

Text

Represents text.

DOMException

An exception thrown when no further processing is possible. Normal errors are reported by return values.

DOMImplementation

Intended to be a placeholder of methods that are not dependent on specific DOM implementations.

NodeList

Represents an ordered collection of nodes. The items in the NodeList are accessible via an integral index, starting from 0.

NamedNodeMap

Represents a collection of nodes that can be accessed by name.

2.4.2 SAX: Event-Driven API

In the previous section, we showed a method to access the structure and content of an XML document represented as a tree with the DOM API. An alternative to the tree-based API to access the document structure is an event-driven API. An application that wants information about document structure, such as element and attribute names/values in the document, can register handlers, a kind of callback function, to an XML processor. The processor notifies the handlers of events such as the start of a tag, an attribute, and the existence of data characters. Unlike when using the DOM API, which can traverse the structure of the document multiple times, the entire process of parsing with the SAX API is one-pass, and the event sequence is notified to an application in the process.

SAX is designed as a lightweight API that does not generate a tree structure of an input document. Applications must register event handlers to a parser instance that implements the org.xml.sax.XMLReader interface. SAX has several event handler interfaces, including ContentHandler, DTDHandler, and ErrorHandler. It also provides the default implementation class org.xml.sax.helpers.DefaultHandler, which implements all the interface methods that do nothing. You can implement the methods necessary for your application if you extend DefaultHandler as your event handler.

ContentHandler is the most often used interface because it is called whenever an element and an attribute are found. You can see how it is used in Listing 2.15, a simple program that reads an XML document and notifies events according to the document structure.

Listing 2.15 Tracing SAX events, chap02/TraceEventsNS.java
       package chap02;
       /**
        * TraceEventsNS.java
        */
       import java.io.IOException;
       import org.xml.sax.helpers.XMLReaderFactory;
       import org.xml.sax.XMLReader;
       import org.xml.sax.SAXException;
       import org.xml.sax.Attributes;
       import org.xml.sax.helpers.DefaultHandler;

[12]   public class TraceEventsNS extends DefaultHandler {
[13]      public TraceEventsNS() {
[14]      }

          static public void main(String[] argv) {
             if  (argv.length != 1) {
                 System.err.println(
                       "Usage: chap02.TraceEventsNS <filename>");
                 System.exit(1);
             }
             try {
[23]             // Creates SAX Parser object. When SAX2.0 is used,
[24]             // XMLReader class is used. Implementation class is
[25]             // specified by using
[26]             // System property org.xml.sax.driver
[27]             XMLReader parser =
[28]                XMLReaderFactory.createXMLReader();
[29]             // Tells the parser to be aware of namespace
[30]             parser.setFeature(
[31]                   "http://xml.org/sax/features/namespaces", true);
[32]             // Creates document handler and registers the handler
[33]             TraceEventsNS handler = new TraceEventsNS();
[34]             parser.setContentHandler(handler);
[35]             parser.setDTDHandler(handler);
[36]             parser.setErrorHandler(handler);

                 // Parses input document
                 parser.parse(argv[0]);

             } catch (SAXException se) {
                 se.printStackTrace();
             } catch (IOException ioe) {
                 ioe.printStackTrace();
             }
           }

           public void startDocument() throws SAXException {
              System.out.println("startDocument is called:");
           }

           public void endDocument() throws SAXException {
              System.out.println("endDocument is called:");
            }

           public void startElement(String uri, String localpart,
                                    String name, Attributes amap) {
              System.out.println("startElement is called: localpart="
                                 + localpart
                                 + ", namespace URI="+uri);
              for (int i = 0; i < amap.getLength(); i++) {
                  String attname = amap.getQName(i);
              String type = amap.getType(i);
              String value  = amap.getValue(i);
              System.out.println("  attribute name="+attname+" type="
                                 +type+" value="+value);
           }
           }

           public void endElement(String name) throws SAXException {
              System.out.println("endElement is called: " + name);
           }

           public void characters(char[] ch, int start, int length)
              throws SAXException {
              System.out.println("characters is called: " +
                                 new String(ch, start, length));
           }
       }

[12]   public class TraceEventsNS extends DefaultHandler {...
       }

In Line 12, the TraceEventsNS class extends the DefaultHandler class, which is a helper class used to catch all the events provided the SAX API.

[27]   XMLReader parser = XMLReaderFactory.createXMLReader();

The org.xml.sax.XMLReader is an interface that represents a SAX-based parser, and an instance of the interface is given by the createXMLReader() method.

The object is associated with an implementation class of an XML processor (SAX parser) by using the following two methods.

  • Specify the class name as the argument of the createXMLReader() method.

  • Specify the class name as the value of the org.xml.sax.driver system property. In this case, createXMLReader() is called with no arguments.

java -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser
...

Using this property enables your program to be independent of specific parser implementations.

[30]   parser.setFeature("http://xml.org/sax/features/namespaces", true);

If you want the parser to be aware of namespaces, you can set a feature of the parser (see line 30). The feature http://xml.org/sax/features/namespaces is the same as for the DOM parser (see Section 2.3.1).

[33]   TraceEventsNS handler = new TraceEventsNS();
[34]   parser.setContentHandler(handler);
[35]   parser.setDTDHandler(handler);
[36]   parser.setErrorHandler(handler);

In lines 33?6, an instance of TraceEventsNS that implements DefaultHandler is created, and event handlers are set to the parser instance. Although the three handlers are set, methods for ContentHandler are actually implemented in TraseEventsNS, which extends the DefaultHandler class. The class provides "empty" method implementations for interfaces, and you do not have to prepare the other methods.

parser.parse(argv[1]);

The parse() method starts parsing. The SAX parser reads an input XML document and sends events to the instance that implements the handler. Table 2.2 shows the methods defined in the ContentHandler interface.

Table 2.2. Methods of the ContentHandler Interface

METHOD NAME

DESCRIPTION

startDocument()

Receives notification of the beginning of the document.

endDocument()

Receives notification of the end of the document.

startElement(String uri, String localpart, String name, Attributes amap)

Receives notification of the beginning of an element.

endElement(String name)

Receives notification of the end of an element.

characters(char ch[], int start, int length)

Receives notification of character data.

ignorableWhitespace(char ch[],int start, int length)

Receives notification of ignorable whitespace in element content.

processingInstruction(String target, String data)

Receives notification of a processing instruction.

setDocumentLocator(Locator locator)

Receives an object for locating the origin of SAX document events. The Locator object gives information on the location of the event, such as line number and column position.

The following is the output of TraceEventsNS.

R:\samples>java chap02.TraceEventsNS org.apache.xerces.parsers
  SAXParser department-ns.xml
startDocument is called:
startElement is called: localpart=department, namespace URI=http://
  www.schema.org/department/
startElement is called: localpart=employee, namespace URI=http://www
  schema.org/department/attribute name=id type=CDATA value=J.D
startElement is called: localpart=name, namespace URI=http://www.
  schema.org/department/characters is called: John Doe
startElement is called: localpart=email, namespace URI=http://www.
  schema.org/department/characters is called: John.Doe@foo.com
startElement is called: localpart=employee, namespace URI=http://www.
  schema.org/department/attribute name=id type=CDATA value=B.S
startElement is called: localpart=name, namespace URI=http://www.
  schema.org/department/characters is called: Bob Smith
startElement is called: localpart=email, namespace URI=http://www.
  schema.org/department/characters is called: Bob.Smith@foo.com
startElement is called: localpart=employee, namespace URI=http://www.
  schema.org/department/attribute name=id type=CDATA value=A.M
startElement is called: localpart=name, namespace URI=http://www.
  schema.org/department/characters is called: Alice Miller
startElement is called: localpart=url, namespace URI=http://www.
  schema.org/department/attribute name=href type=CDATA value=http://
  www.foo.com/~amiller/
endDocument is called:

In Listing 2.13 we showed a DOM-based program with JAXP. In Listing 2.16, we show a SAX-based program with JAXP.

Listing 2.16 Tracing SAX events, chap02/TraceEventsJAXP.java
       package chap02;
       /**
        * TraceEventsJAXP.java
        */
       import java.io.File;
       import java.io.IOException;
       import javax.xml.parsers.SAXParser;
       import javax.xml.parsers.SAXParserFactory;
       import javax.xml.parsers.ParserConfigurationException;
       import javax.xml.parsers.FactoryConfigurationError;
       import org.xml.sax.SAXException;
       import org.xml.sax.Attributes;
       import org.xml.sax.helpers.DefaultHandler;

       public class TraceEventsJAXP extends DefaultHandler {
         public TraceEventsJAXP() {
         }

         static public void main(String[] argv) {
           try {
             if (argv.length != 1) {
               System.err.println(
                            "Usage: chap02.java TraceEventsJAXP finename");
               System.exit(1);
             }
[26]         // Creates SAX Parser factory
[27]         SAXParserFactory factory =
[28]                                SAXParserFactory.newInstance();
[29]         // Tells the parser to be aware of namespaces
[30]         factory.setNamespaceAware(true);
[31]         // Creates parser object
[32]         SAXParser parser = factory.newSAXParser();
[33]         DefaultHandler handler = new TraceEventsJAXP();
[34]         // Parses input document
[35]         parser.parse(new File(argv[0]), handler);
           } catch (SAXException e) {
         e.printStackTrace();
           } catch (IOException ioe) {
         ioe.printStackTrace();
           } catch (ParserConfigurationException pce) {
         pce.printStackTrace();
           }
         }

         public void startDocument() throws SAXException {
           System.out.println("startDocument is called:");
         }

         public void endDocument() throws SAXException {
           System.out.println("endDocument is called:");
         }

         public void startElement(String uri, String localpart,
                                  String name, Attributes amap) {
           System.out.println("startElement is called: localpart="
                              + localpart
                              + ", namespace URI="+uri);
           for (int i = 0; i < amap.getLength(); i++) {
             String attname = amap.getQName(i);
             String type = amap.getType(i);
             String value = amap.getValue(i);
             System.out.println("  attribute name="+attname+" type="
                                             +type+" value="+value);
           }
         }

         public void endElement(String name) throws SAXException {
           System.out.println("endElement is called: " + name);
         }

         public void characters(char[] ch, int start, int length)
                                               throws SAXException {
          System.out.println("characters is called: " +
                                    new String(ch, start, length));
         }
     }

The output of this program is the same as in Listing 2.15.

[27]   SAXParserFactory factory = SAXParserFactory.newInstance();

Line 27 creates an instance of the factory class SAXParserFactory for the SAX API.

[30]   factory.setNamespaceAware(true);

Line 30 tells the factory to be aware of namespaces.

[32]   SAXParser parser = factory.newSAXParser();

Line 32 creates a SAXParser instance from the factory.

[33]   DefaultHandler handler = new TraceEventsJAXP();
[35]   parser.parse(new File(argv[0]), handler);

Lines 33 and 35 create an instance of an event handler (DefaultHandler) implementation (TraceEventsJAXP itself). The handler is given to the parser as an argument of the parse() method.

We described methods for parsing XML documents with the JAXP API. The DocumentBuilderFactory (for DOM) and SAXParserFactory (for SAX) factories hide the implementation of the XML processors. However, how do the factories call the factory implementations? How can developers use another implementation?

In DOM, DocumentBuilderFactory.newInstance() gives an instance of an implementation class. The JAXP 1.1 specification defines a procedure for determining the class name as follows (refer to the specification for details).

  1. Use the javax.xml.parsers.DocumentBuilderFactory system property.

  2. Use lib/jaxp.properties in the JRE directory.

  3. Use the META-INF/services/javax.xml.parsers.DocumentBuilder Factory file in jars (provided by the Jar Services API).

  4. Use the platform default.

For example, an implementation of DocumentBuilderFactory provided by Xerces is org.apache.xerces.jaxp.DocumentBuilderFactoryImpl. When you want to specify the class at runtime, do as follows:

R:\samples>java chap02.SimpleParseJAXP
-Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.
DocumentBuilderFactoryImpl
chap02/department-dtd.xml

In SAX, replace the string "DocumentBuilderFactory" with "SAXParserFactory".

2.4.3 Design Point: DOM versus SAX

When you design and develop Web applications, the choice of the access API is very important.

The use of DOM is best suited for the following situations:

  • When structurally modifying an XML document梖or example, sorting elements in a particular order or moving some elements from one place in the tree to another.

  • When sharing the document in memory with other applications. Applications can share a Document instance after the parsing process.

  • When the size of the XML documents to be parsed is not so large. Generally, creation of Java objects has a performance penalty. Xerces is designed to delay the creation of element and text node objects until they are requested, but DOM-based programming is still more costly than using SAX.

  • When applications want to start processing after finishing validation.

On the other hand, an XML processor with SAX does not create a tree structure. Instead, it scans an input XML document and generates events. Application programs receive these events and do whatever is appropriate for the task, such as getting an element type name and its text content. SAX is more efficient than DOM, therefore, it is good for the following occasions:

  • When your task is performance and memory sensitive.

  • When your task does not need to recognize the (complex) structure of an XML document. SAX scans the XML document at once, so you should keep the status of where you are processing during the parsing process. When the XML document represents a set of attribute/value pairs, you can get them very efficiently with SAX.

Listing 2.17 summarizes the programming patterns with the Xerces and JAXP APIs.

Listing 2.17 Programming patterns for the Xerces and JAXP APIs

(1) Xerces: DOM parser

import org.w3c.dom.Document;
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Document;

String filename;
...
DOMParser parser = new DOMParser();
parser.parse(filename);
Document doc = parser.getDocument();

(2) Xerces: SAX parser

import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.w3c.dom.Document;
DefaultHandler handler;
String filename;
...
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler(handler);
parser.setDTDHandler(handler);
parser.setErrorHandler(handler);
parser.parse(filename);

(3) Xerces: handling namespaces

parser.setFeature("http://xml.org/sax/features/namespaces", true);

(4) Xerces: Validation

parser.setFeature("http://xml.org/sax/features/validation", true);

(5) Xerces: Schema validation

parser.setFeature("http://apache.org/xml/features/validation/schema", true);

(1) JAXP: DOM parser

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;

String filename;
...
DocumentBuilderFactory factory =
                 DocumentBuilderFactory.newInstance();
DocumentBuilder builder =
                 factory.newDocumentBuilder();
Document doc = builder.parse(filename);

(2) JAXP: SAX parser

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.helpers.DefaultHandler;
import org.w3c.dom.Document;

DefaultHandler handler;
String filename;
...
SAXParserFactory factory =
                 SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
parser.parse(filename, handler);

(3) JAXP: Handling namespaces

factory.setNamespaceAware(true);

(4) JAXP: (Schema) validation

factory.setValidating(true);
    [ directory ] Previous Section Next Section