| [ directory ] |
|
5.2 Basic Tips for Using SAXIn Chapter 2, Sections 2.4 (see Figure 2.2) and 2.4.2 describe the basic concepts of SAX and the programming model for SAX. The concept of SAX is simple. A SAX parser reads an XML document from the beginning, and the parser tells an application what it finds by using the callback methods of ContentHandler or other interfaces. However, there are some things you should know. We discuss them in this section. 5.2.1 ContentHandlerIn this section, we discuss a major trap for beginning users of SAX and the parser feature mechanism, an important feature introduced in SAX2. Trap of the characters() EventsThe characters() method of ContentHandler confuses SAX beginners. Consider the following document: <root> Hello, XML & Java! </root> A programmer might expect the parsing of this document to throw five events:
Actually, the SAX parser of Xerces produces three characters() events between startElement() and endElement(). They are:
The SAX parser of Crimson produces eight characters() events:
These behaviors are not bugs in these parsers. The SAX specification allows splitting a text segment into several events. So take care when you write an application that processes character data. Listing 5.1 is a program that checks whether the text in an element matches a given string. The program shows a way to solve the problem of split characters() events. Listing 5.1 A correct way to process text, chap05/TextMatch.java
package chap05;
import java.io.IOException;
import java.util.Stack;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class TextMatch extends DefaultHandler {
StringBuffer buffer;
String pattern;
Stack context;
public TextMatch(String pattern) {
this.buffer = new StringBuffer();
this.pattern = pattern;
this.context = new Stack();
}
protected void flushText() {
if (this.buffer.length() > 0) {
String text = new String(this.buffer);
if (pattern.equals(text)) {
System.out.print("Pattern '"+this.pattern
+"' has been found around ");
for (int i = 0; i < this.context.size(); i++) {
System.out.print("/"+this.context.elementAt(i));
}
System.out.println("");
}
}
this.buffer.setLength(0);
}
public void characters(char[] ch, int start, int len)
throws SAXException {
this.buffer.append(ch, start, len);
}
public void ignorableWhitespace(char[] ch, int start, int len)
throws SAXException {
this.buffer.append(ch, start, len);
}
public void processingInstruction(String target, String data)
throws SAXException {
// Nothing to do because PI does not affect the meaning
// of a document.
}
public void startElement(String uri, String local,
String qname, Attributes atts)
throws SAXException {
this.flushText();
this.context.push(local);
}
public void endElement(String uri, String local, String qname)
throws SAXException {
this.flushText();
this.context.pop();
}
public static void main(String[] argv) {
if (argv.length != 2) {
System.out.println("TextMatch <pattern> <document>");
System.exit(1);
}
try {
XMLReader xreader = XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
xreader.setContentHandler(new TextMatch(argv[0]));
xreader.parse(argv[1]);
} catch (IOException ioe) {
ioe.printStackTrace();
} catch (SAXException se) {
se.printStackTrace();
}
}
}
This program assumes that the start tags and end tags split the text and that the comments and processing instructions do not. Character data is saved to a buffer in the characters() method, and a matching process against the buffer is invoked in tag events. Let's run TextMatch against the XML document shown in Listing 5.2. Listing 5.2 A sample document for TextMatch, chap05/match.xml
<?xml version="1.0" encoding="us-ascii"?>
<root>
<movie>A 3x3 Matri<X/movie>
<book>XM<!-- -->L & Jav<?target?>a</book>
</root>
R:\samples>java chap05.TextMatch "XML & Java" file:./chap05/match.xml
Pattern 'XML & Java' has been found around {}root/{}book
TextMatch finds "XML & Java" in the book element, the character data of which is split by a comment, an entity reference, and a processing instruction. Parser FeaturesThe SAX2 specification defines two standard features: namespace and namespace-prefix. The default feature settings of SAX2-compliant parsers are as follows.
The default settings have these meanings.
If the namespace feature is turned off, the availability of namespace URIs and local names is implementation-dependent, start/endPrefixMapping() are not called, and an Attributes instance contains namespace declarations. If the namespace-prefix feature is turned on, qualified names are available, and an Attributes instance contains namespace declarations. Table 5.1 shows a summary of these features.
Basically, you need not disable the namespace feature. Turn it off only when the slight overhead of this feature is unacceptable. Turn on the namespace-prefix feature if you need qualified names or namespace declarations as attributes. According to the JAXP specification, a SAX parser created by SAXParserFactory is not namespace-aware by default. In the JAXP implementation of Xerces, SAXParserFactory.setNamespaceAware() affects the setting of the namespace feature. As for Crimson in the JAXP 1.1 reference implementation, SAXParserFactory.setNamespaceAware() seems to affect neither the namespace feature nor the namespace-prefix feature. We recommend that you always get an XMLReader instance by using SAXParser.getXMLReader() and that you set these features explicitly. 5.2.2 Using and Writing SAX FiltersA SAX filter receives SAX events from a SAX parser, modifies these events, andforwards them to a handler, as shown in Figure 5.1. As far as the SAX parser is concerned, the SAX filter can be seen as a handler. On the other hand, as far the handler is concerned, the SAX filter can be seen as a SAX parser. Figure 5.1. SAX filter
The SAX2 specification provides the XMLFilter interface for SAX filters. This interface is derived from XMLReader, the interface for SAX parsers. Typical uses of SAX filters are the following. Modifying XML documentsWhen you write a program for modifying XML documents, you might want to reuse XMLSerializer for serializing SAX events to an XML document. Then you only have to write a SAX filter that modifies SAX events, and insert the filter between a SAX parser and XMLSerializer. Convenience for the next handlerYou can simplify handlers for complicated tasks by creating preprocessing SAX filters. For example, suppose that you want to write a SAX handler that supports both <book title="foobar">...</book> and <book><title>foobar</title>...</book>. The SAX handler becomes simpler if you write a filter for canonicalizing events to one of the two formats. Another example is the characters() trap discussed in Section 5.2.1. You can avoid the trap by implementing a SAX filter that concatenates consecutive characters() events. Control of event flowSuppose that you want to use two handlers for a single XML document at the same time. Unfortunately, you cannot register two or more handlers of the same type to one XMLReader instance. So you implement a handler as a SAX filter (see Figure 5.2), or you make a filter that accepts the registration of two handlers and duplicates the input events (see Figure 5.3.) Figure 5.2. A handler performs as a filter.
Figure 5.3. A filter duplicates events.
Using FiltersA typical code fragment for using a SAX parser follows. XMLReader parser = XMLReaderFactory.createXMLReader(); // or parser = new SAXParser() if you use Xerces. parser.setContentHandler(handler); parser.parse(...); If you want a filter between the parser and the handler, modify this code fragment to this: XMLReader parser = ... XMLFilter filter = new SomethingFilter(); filter.setParent(parser); filter.setContentHandler(handler); filter.parse(...); or to this: // If the constructor for the filter takes a parent //(parser or filter) as a parameter. XMLReader parser = ... XMLReader filter = new SomethingFilter(parser); filter.setContentHandler(handler); filter.parse(...); The following two code fragments use a parser and two filters. First fragment: XMLReader parser = ... XMLFilter filter1 = new SomethingFilter(); filter1.setParent(parser); XMLFilter filter2 = new OtherFilter(); filter2.setParent(filter1); filter2.setContentHandler(handler); filter2.parse(...); Second fragment: XMLReader parser = ... XMLReader filter2 = new OtherFilter(new SomethingFilter(parser)); filter2.setContentHandler(handler); filter2.parse(...); These code fragments make an event chain, as shown in Figure 5.4. Figure 5.4. A parser, two filters, and a handler
Writing FiltersThe XMLFilter interface is derived from the XMLReader interface by adding getParent() and setParent(). The XMLFilter is merely an interface definition, and it does not help us to implement a filter. As a base class for implementing filters, SAX provides the XMLFilterImpl class. As demonstrated earlier, if a filter constructor takes an XMLReader as an argument, the application code becomes simpler. Listing 5.3 is an example of a SAX filter. It replaces elements like <email>foo@example.com</email> with <uri>mailto:foo@example.com</uri>. Listing 5.3 An example of a SAX filter, chap05/MailFilter.java
package chap05;
import org.apache.xerces.parsers.SAXParser;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.AttributesImpl;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.helpers.XMLReaderFactory;
/**
* <email>foo@bar.test</email>
* -> <uri>mailto:foo@bar.test</uri>
*/
public class MailFilter extends XMLFilterImpl {
public MailFilter(XMLReader parent) {
super(parent);
}
/**
* Replace `email' with `uri',
* and make a characters event for "mailto:".
*/
public void startElement(String uri, String local, String qname,
Attributes atts)
throws SAXException {
ContentHandler ch = this.getContentHandler();
if (ch == null)
return;
if (uri.length() == 0 && local.equals("email")) {
ch.startElement("", "uri", "uri", atts);
String mailto = "mailto:";
ch.characters(mailto.toCharArray(), 0, mailto.length());
} else
ch.startElement(uri, local, qname, atts);
}
/**
* Replace `email' with `uri'.
*/
public void endElement(String uri, String local, String qname)
throws SAXException {
ContentHandler ch = this.getContentHandler();
if (ch == null)
return;
if (uri.length() == 0 && local.equals("email")) {
ch.endElement("", "uri", "uri");
} else
ch.endElement(uri, local, qname);
}
public static void main(String[] argv) throws Exception {
OutputFormat format
= new OutputFormat("xml", "UTF-8", false);
format.setPreserveSpace(true);
ContentHandler handler = new XMLSerializer(System.out, format);
XMLReader parser = XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
XMLReader filter = new MailFilter(parser);
filter.setContentHandler(handler);
filter.parse(argv[0]);
System.out.println("");
}
}
In the overriding methods of your filter, remember to forward (modified) SAX events to the appropriate methods of the registered handler. Note that getXxxHandler() methods may return null. So you have to check whether the next handler is null before calling it. To see how this program works, type the following: R:\samples>type chap05\addresses.xml <?xml version="1.0" encoding="us-ascii"?> <addresses> <email>John.Doe@bar.test</email> <email>George.Smith@bar.test</email> <email>Anna.Millers@bar.test</email> </addresses> R:\samples> java chap05.MailFilter file:./chap05/addresses.xml <?xml version="1.0" encoding="UTF-8"?> <addresses> <uri>mailto:John.Doe@bar.test</uri> <uri>mailto:George.Smith@bar.test</uri> <uri>mailto:Anna.Millers@bar.test</uri> </addresses> 5.2.3 New Features of SAX2In this section, we summarize the new features of SAX2 for developers who have experience with SAX1. Namespace supportSAX1 was finalized before the "Namespace in XML" specification became a W3C Recommendation. So SAX1 has no namespace support. With SAX2, applications can receive namespace information as described in Section 5.2.1. SAX filtersSAX1 has no interface for filters, though we can write filters without such an interface. SAX2 introduced a standard XMLFilter interface. It makes writing and using filters easier. More information about an XML documentWith SAX1, applications can know nothing about comments, CDATA sections, and many types of declarations in DTDs. SAX2 supports them with new interfaces. Feature/property mechanismSAX2 provides a generic mechanism to enable or disable the features of SAX parsers and to set or get extra information about SAX parsers. Name changes to classes and interfacesSome interfaces of SAX1 were made obsolete by SAX2. We recommend using the SAX2 interfaces even if you don't need the new features of SAX. Table 5.2 summarizes the name changes.
|
| [ directory ] |
|