站内搜索: 请输入搜索关键词
当前页面: 图书首页 > XML and Java: Developing Web Applications, Second Edition

XML and Java: Developing Web Applications, Second Edition

[ directory ] Previous Section Next Section

5.2 Basic Tips for Using SAX

In Chapter 2, Sections 2.4 (see Figure 2.2) and 2.4.2 describe the basic concepts of SAX and the programming model for SAX. The concept of SAX is simple. A SAX parser reads an XML document from the beginning, and the parser tells an application what it finds by using the callback methods of ContentHandler or other interfaces.

However, there are some things you should know. We discuss them in this section.

5.2.1 ContentHandler

In this section, we discuss a major trap for beginning users of SAX and the parser feature mechanism, an important feature introduced in SAX2.

Trap of the characters() Events

The characters() method of ContentHandler confuses SAX beginners. Consider the following document:

<root>
   Hello,
   XML &#x26; Java!
</root>

A programmer might expect the parsing of this document to throw five events:

  • startDocument()

  • startElement() for the root element

  • characters(): "\n Hello,\n XML & Java!\n"

  • endElement() for the root element

  • endDocument()

Actually, the SAX parser of Xerces produces three characters() events between startElement() and endElement(). They are:

  • characters(): "\n Hello,\n XML "

  • characters(): "&"

  • characters(): " Java!\n"

The SAX parser of Crimson produces eight characters() events:

  • characters(): ""

  • characters(): "\n"

  • characters(): " Hello,"

  • characters(): "\n"

  • characters(): " XML "

  • characters(): "&"

  • characters(): " Java!"

  • characters(): "\n"

These behaviors are not bugs in these parsers. The SAX specification allows splitting a text segment into several events. So take care when you write an application that processes character data.

Listing 5.1 is a program that checks whether the text in an element matches a given string. The program shows a way to solve the problem of split characters() events.

Listing 5.1 A correct way to process text, chap05/TextMatch.java
package chap05;

import java.io.IOException;
import java.util.Stack;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

 public class TextMatch extends DefaultHandler {
   StringBuffer buffer;
   String pattern;
   Stack context;

   public TextMatch(String pattern) {
      this.buffer = new StringBuffer();
      this.pattern = pattern;
      this.context = new Stack();
   }

   protected void flushText() {
      if (this.buffer.length() > 0) {
          String text = new String(this.buffer);
          if (pattern.equals(text)) {
             System.out.print("Pattern '"+this.pattern
                           +"' has been found around ");
             for (int i = 0; i < this.context.size();  i++) {
                 System.out.print("/"+this.context.elementAt(i));
             }
             System.out.println("");
          }
      }
      this.buffer.setLength(0);
   }

   public void characters(char[] ch, int start, int len)
      throws SAXException {
      this.buffer.append(ch, start, len);
   }
   public void ignorableWhitespace(char[] ch, int start, int len)
       throws SAXException {
       this.buffer.append(ch, start, len);
   }
   public void processingInstruction(String target, String data)
      throws SAXException {
      // Nothing to do because PI does not affect the meaning
      // of a document.
   }
   public void startElement(String uri, String local,
                            String qname, Attributes atts)
      throws SAXException {
      this.flushText();
      this.context.push(local);
   }
   public void endElement(String uri, String local, String qname)
      throws SAXException {
      this.flushText();
      this.context.pop();
   }

   public static void main(String[] argv) {
      if (argv.length != 2) {
          System.out.println("TextMatch <pattern> <document>");
          System.exit(1);
      }
      try {
         XMLReader xreader = XMLReaderFactory.createXMLReader(
                "org.apache.xerces.parsers.SAXParser");
         xreader.setContentHandler(new TextMatch(argv[0]));
         xreader.parse(argv[1]);
      } catch (IOException ioe) {
          ioe.printStackTrace();
      } catch (SAXException se) {
          se.printStackTrace();
      }
   }
}

This program assumes that the start tags and end tags split the text and that the comments and processing instructions do not. Character data is saved to a buffer in the characters() method, and a matching process against the buffer is invoked in tag events.

Let's run TextMatch against the XML document shown in Listing 5.2.

Listing 5.2 A sample document for TextMatch, chap05/match.xml
<?xml version="1.0" encoding="us-ascii"?>
<root>
   <movie>A 3x3 Matri<X/movie>
   <book>XM<!-- -->L &#x26; Jav<?target?>a</book>
</root>
R:\samples>java chap05.TextMatch "XML & Java" file:./chap05/match.xml
Pattern 'XML & Java' has been found around {}root/{}book

TextMatch finds "XML & Java" in the book element, the character data of which is split by a comment, an entity reference, and a processing instruction.

Parser Features

The SAX2 specification defines two standard features: namespace and namespace-prefix. The default feature settings of SAX2-compliant parsers are as follows.

  • Namespace feature, http://xml.org/sax/features/namespaces, is true.

  • Namespace-prefix feature, http://xml.org/sax/features/namespace-prefixes, is false.

The default settings have these meanings.

  • The parser provides information about namespace URIs and local names via ContentHandler.startElement(), ContentHandler.endElement(), Attributes.getURI(), and Attributes.getLocalName().

  • ContentHandler.startPrefixMapping() and ContentHandler.endPrefixMapping() are called when elements declaring namespaces are visited and left, respectively.

  • An Attributes instance contains no namespace declarations.

  • The availability of qualified names is implementation-dependent.

If the namespace feature is turned off, the availability of namespace URIs and local names is implementation-dependent, start/endPrefixMapping() are not called, and an Attributes instance contains namespace declarations.

If the namespace-prefix feature is turned on, qualified names are available, and an Attributes instance contains namespace declarations.

Table 5.1 shows a summary of these features.

Table 5.1. SAX Features
NAMESPACE FEATURE NAMESPACE- PREFIX FEATURE NS URI/LOCAL NAME QUALIFIED NAME CALLS *PrefixMapping() NS DECLS IN Attributes
true false x - x -
true true x x x x
false false - - - x
false true - x - x

Basically, you need not disable the namespace feature. Turn it off only when the slight overhead of this feature is unacceptable. Turn on the namespace-prefix feature if you need qualified names or namespace declarations as attributes.

According to the JAXP specification, a SAX parser created by SAXParserFactory is not namespace-aware by default. In the JAXP implementation of Xerces, SAXParserFactory.setNamespaceAware() affects the setting of the namespace feature. As for Crimson in the JAXP 1.1 reference implementation, SAXParserFactory.setNamespaceAware() seems to affect neither the namespace feature nor the namespace-prefix feature. We recommend that you always get an XMLReader instance by using SAXParser.getXMLReader() and that you set these features explicitly.

5.2.2 Using and Writing SAX Filters

A SAX filter receives SAX events from a SAX parser, modifies these events, andforwards them to a handler, as shown in Figure 5.1. As far as the SAX parser is concerned, the SAX filter can be seen as a handler. On the other hand, as far the handler is concerned, the SAX filter can be seen as a SAX parser.

Figure 5.1. SAX filter

graphics/05fig01.gif

The SAX2 specification provides the XMLFilter interface for SAX filters. This interface is derived from XMLReader, the interface for SAX parsers.

Typical uses of SAX filters are the following.

Modifying XML documents

When you write a program for modifying XML documents, you might want to reuse XMLSerializer for serializing SAX events to an XML document. Then you only have to write a SAX filter that modifies SAX events, and insert the filter between a SAX parser and XMLSerializer.

Convenience for the next handler

You can simplify handlers for complicated tasks by creating preprocessing SAX filters. For example, suppose that you want to write a SAX handler that supports both <book title="foobar">...</book> and <book><title>foobar</title>...</book>. The SAX handler becomes simpler if you write a filter for canonicalizing events to one of the two formats. Another example is the characters() trap discussed in Section 5.2.1. You can avoid the trap by implementing a SAX filter that concatenates consecutive characters() events.

Control of event flow

Suppose that you want to use two handlers for a single XML document at the same time. Unfortunately, you cannot register two or more handlers of the same type to one XMLReader instance. So you implement a handler as a SAX filter (see Figure 5.2), or you make a filter that accepts the registration of two handlers and duplicates the input events (see Figure 5.3.)

Figure 5.2. A handler performs as a filter.

graphics/05fig02.gif

Figure 5.3. A filter duplicates events.

graphics/05fig03.gif

Using Filters

A typical code fragment for using a SAX parser follows.

XMLReader parser = XMLReaderFactory.createXMLReader();
// or parser = new SAXParser() if you use Xerces.
parser.setContentHandler(handler);
parser.parse(...);

If you want a filter between the parser and the handler, modify this code fragment to this:

XMLReader parser = ...
XMLFilter filter = new SomethingFilter();
filter.setParent(parser);
filter.setContentHandler(handler);
filter.parse(...);

or to this:

// If the constructor for the filter takes a parent
//(parser or filter) as a parameter.
XMLReader parser = ...
XMLReader filter = new SomethingFilter(parser);
filter.setContentHandler(handler);
filter.parse(...);

The following two code fragments use a parser and two filters.

First fragment:

XMLReader parser = ...
XMLFilter filter1 = new SomethingFilter();
filter1.setParent(parser);
XMLFilter filter2 = new OtherFilter();
filter2.setParent(filter1);

filter2.setContentHandler(handler);
filter2.parse(...);

Second fragment:

XMLReader parser = ...
XMLReader filter2 = new OtherFilter(new SomethingFilter(parser));
filter2.setContentHandler(handler);
filter2.parse(...);

These code fragments make an event chain, as shown in Figure 5.4.

Figure 5.4. A parser, two filters, and a handler

graphics/05fig04.gif

Writing Filters

The XMLFilter interface is derived from the XMLReader interface by adding getParent() and setParent(). The XMLFilter is merely an interface definition, and it does not help us to implement a filter. As a base class for implementing filters, SAX provides the XMLFilterImpl class.

As demonstrated earlier, if a filter constructor takes an XMLReader as an argument, the application code becomes simpler.

Listing 5.3 is an example of a SAX filter. It replaces elements like <email>foo@example.com</email> with <uri>mailto:foo@example.com</uri>.

Listing 5.3 An example of a SAX filter, chap05/MailFilter.java
package chap05;

import org.apache.xerces.parsers.SAXParser;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.AttributesImpl;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.helpers.XMLReaderFactory;

 /**
 * <email>foo@bar.test</email>
 *   -> <uri>mailto:foo@bar.test</uri>
 */
public class MailFilter extends XMLFilterImpl {

   public MailFilter(XMLReader parent) {
      super(parent);
   }
       /**
       * Replace `email' with `uri',
       * and make a characters event for "mailto:".
       */
       public void startElement(String uri, String local, String qname,
                                Attributes atts)
          throws SAXException {
          ContentHandler ch = this.getContentHandler();
          if (ch == null)
              return;
          if (uri.length() == 0 && local.equals("email")) {
             ch.startElement("", "uri", "uri", atts);
             String mailto = "mailto:";
             ch.characters(mailto.toCharArray(), 0, mailto.length());
          } else
             ch.startElement(uri, local, qname, atts);
       }

       /**
       * Replace `email' with `uri'.
       */
       public void endElement(String uri, String local, String qname)
          throws SAXException {
          ContentHandler ch = this.getContentHandler();
          if (ch == null)
             return;
          if (uri.length() == 0 && local.equals("email")) {
             ch.endElement("", "uri", "uri");
          } else
             ch.endElement(uri, local, qname);
   }

   public static void main(String[] argv) throws Exception {
       OutputFormat format
              = new OutputFormat("xml", "UTF-8", false);
       format.setPreserveSpace(true);
       ContentHandler handler = new XMLSerializer(System.out, format);

       XMLReader parser = XMLReaderFactory.createXMLReader(
              "org.apache.xerces.parsers.SAXParser");
       XMLReader filter = new MailFilter(parser);
       filter.setContentHandler(handler);
       filter.parse(argv[0]);

       System.out.println("");
   }
}

In the overriding methods of your filter, remember to forward (modified) SAX events to the appropriate methods of the registered handler. Note that getXxxHandler() methods may return null. So you have to check whether the next handler is null before calling it.

To see how this program works, type the following:

R:\samples>type chap05\addresses.xml
<?xml version="1.0" encoding="us-ascii"?>
<addresses>
   <email>John.Doe@bar.test</email>
   <email>George.Smith@bar.test</email>
   <email>Anna.Millers@bar.test</email>
</addresses>

R:\samples> java chap05.MailFilter file:./chap05/addresses.xml
<?xml version="1.0" encoding="UTF-8"?>
<addresses>
   <uri>mailto:John.Doe@bar.test</uri>
   <uri>mailto:George.Smith@bar.test</uri>
   <uri>mailto:Anna.Millers@bar.test</uri>
</addresses>

5.2.3 New Features of SAX2

In this section, we summarize the new features of SAX2 for developers who have experience with SAX1.

Namespace support

SAX1 was finalized before the "Namespace in XML" specification became a W3C Recommendation. So SAX1 has no namespace support. With SAX2, applications can receive namespace information as described in Section 5.2.1.

SAX filters

SAX1 has no interface for filters, though we can write filters without such an interface. SAX2 introduced a standard XMLFilter interface. It makes writing and using filters easier.

More information about an XML document

With SAX1, applications can know nothing about comments, CDATA sections, and many types of declarations in DTDs. SAX2 supports them with new interfaces.

Feature/property mechanism

SAX2 provides a generic mechanism to enable or disable the features of SAX parsers and to set or get extra information about SAX parsers.

Name changes to classes and interfaces

Some interfaces of SAX1 were made obsolete by SAX2. We recommend using the SAX2 interfaces even if you don't need the new features of SAX. Table 5.2 summarizes the name changes.

Table 5.2. Interface Changes between SAX1 and SAX2
SAX1 SAX2 CHANGES
Parser XMLReader Support of new interfaces
ParserFactory XMLReaderFactory Support of new interfaces
DocumentHandler ContentHandler Support of namespace
HandlerBase DefaultHandler Support of new interfaces
AttributeList Attributes Support of namespace
AttributeListImpl AttributesImpl Support of new interfaces
N/A DeclHandler Receive declarations in DTDs
N/A LexicalHandler Receive lexical information such as comments and CDATA sections
N/A XMLFilter New filter interface

    [ directory ] Previous Section Next Section