| [ directory ] |
|
7.3 Pros and Cons of XSLT, XPath, DOM, and SAXEarlier in this chapter, we described XPath and XSLT. Also we showed how to use DOM and SAX. Now you have four ways to write a program to deal with XML documents. Which is the best way to write what kind of program in terms of rapid development, easy maintenance, and time and space efficiency? In this section, we discuss this issue. 7.3.1 Execution EfficiencyIn general, it is not very easy to discuss time and space efficiency in Java because it highly depends on many factors, such as a developer's programming style, the behavior of a just-in-time compiler and garbage collection, and the implementation of libraries. These are some other factors specific to XML processing.
Now you can understand that it is really difficult to predict time and space efficiency before you develop an application. Therefore, if you are not experienced in XML and Java programming, we recommend that you focus on how rapidly you can develop a program or how easily you can maintain it rather than on how fast or compact it is. If your program is designed simply and flexibly, it is relatively easy to switch to the other approach. The rest of this section focuses on development efficiency rather than time and space efficiency. 7.3.2 Development EfficiencyHere are some questions you should consider when you want to improve development efficiency.
Because quantitative analysis is not always easy, we discuss the pros and cons of development efficiency through two typical cases of XML processing. We first review the case in which the combination of XPath and DOM works better than using only DOM for traversing and modifying an XML document. Then we compare the use of SAX and DOM and the use of XSLT to translate an XML document to another XML document. We hope these discussions give you a clear view of development efficiency. Combination of XPath and DOMA DOM tree faithfully represents the tree structure of an XML document. One approach to finding a target element in a DOM tree is to write a recursive program or use the org.w3c.dom.traversal package, but this is not very simple. In this case, a combination of XPath and DOM not only makes a program simple but also improves the readability of the program because it becomes compact. This approach may be slower than the previous approach. However, it is worth taking this approach when you don't need to traverse the tree many times or when performance is not critical. Let's look at the XML document in Listing 7.18. It is a SmartDoc format of Section 1.3.1 in the first edition of this book. Listing 7.18 SmartDoc example showing keywords, chap07/data/XMLandJava_1_3_1.sdoc
<?xml version='1.0' encoding="UTF-8"?>
<doc>
<head>
<author locale="en">Hiroshi Maruyama</author>
<date><time/></date>
</head>
<body>
<subsection>
<title locale="en">1.3.1 Background of XML</title>
<p locale="en">
HyperText Markup Language (<em>HTML</em>) has been widely used in
describing Web contents since it was defined in 1992. It has a simple
syntax, it is easy to create multimedia documents by incorporating
images, audio and so on, and it allows many other documents to be
linked together. With free browsers being deployed universally, it
has become one of the primary means to deliver information via the
Internet.
</p>
<p locale="en">
<em>HTML</em> has been enhanced many times in its history. Because
HTML has a fixed set of tags, the only way to add new functionality
into HTML is to bring it to <em>W3C</em> and put it on the discussion
table. This may be a lengthy process, and not all the tags are
general enough to be included in HTML.
</p>
<p locale="en">
With Extensible Markup Language (<em>XML</em>), one can define his or
her own tag set by means of document type definition (<em>DTD</em>).
At this moment, not many Web pages are authored in XML, but many
emerging proposals in the fields of document processing, meta
contents,database, and messaging are based on XML.
</p>
</subsection>
</body>
</doc>
In this document, a keyword is enclosed by an em tag only when it first appears in each paragraph (enclosed by p element). Using this feature, we can make a keyword index. First, we add an id attribute to each em element in the input XML document. Each id attribute has a sequence number. For example, the em element in the first paragraph will be changed as follows:
<p locale="en">
HyperText Markup Language (<em id="id-0">HTML</em>) has been widely
used in ...
</p>
Listing 7.19 is a keyword index file. Each keyword in the input XML document is associated with a keyword element, with the keyword itself in its id attribute. A keyword element has one or more ref elements. Each ref element corresponds to each keyword enclosed by an em tag. The href attribute of the ref element has a reference to the id attribute of the em element in the document. Listing 7.19 Keyword index file, XMLandJava_1_3_1-glossary.xml
<?xml version="1.0" encoding="UTF-8"?>
<glossary>
<keyword id="HTML">
<ref href="#id-0"/>
<ref href="#id-1"/>
</keyword>
<keyword id="W3C">
<ref href="#id-2"/>
</keyword>
<keyword id="XML">
<ref href="#id-3"/>
</keyword>
<keyword id="DTD">
<ref href="#id-4"/>
</keyword>
</glossary>
We use MakeSmartDocGlossary.java to perform the process just described. The following command displays the input document after it is processed, the keyword index, and list of XPath expressions that refer to the em elements in the input document. R:\examples\>java chap07.MakeSmartDocGlossary file:./chap07/data/XMLandJava_1_3_1.xml Here are the processing steps of MakeSmartDocGlossary.java.
The following code fragment performs step 3. It finds all the em elements in sdoc by using XPath with the XPath expression //em. In this particular case, you can use the Element#getElementsByTagNameNS() method of the DOM API to get the same result, but XPath is more appropriate if you consider the flexibility or extensibility of the program. // Finds all "em" elements from the input DOM-tree NodeIterator ni = XPathAPI.selectNodeIterator(sdoc, "//em"); The following three code fragments perform step 4. The first code fragment is a loop to process each em element found in step 3.
int nextID = 0;
// For each node found
while ((node = ni.nextNode()) != null) {
...
}
Then, we add an id attribute to each em element. This is a good example of the combination of DOM and XPath.
// Sets an id for the node
String id = "id-"+ nextID++;
((Element)node).setAttribute("id", id);
The following code fragment extracts a keyword from each em element and checks whether its associated keyword element is already registered to gdoc. We use XPath for the check because it will not be simple to use the DOM API. Then, we register the keyword to gdoc if it is not found there. Finally, we create a ref element with a reference to the em element and register it to gdoc.
// Gets the keyword
String keyword = node.getFirstChild().getNodeValue();
// Checks if the keyword is already registered
String xpath = ("//keyword[@id=normalize-space('" +
keyword + "')]");
Element elemKeyword =
(Element)XPathAPI.selectSingleNode(gdoc, xpath);
if (elemKeyword == null) {
// If it has not been registered yet,
// registers it
elemKeyword = gdoc.createElement("keyword");
elemKeyword.setAttribute("id", keyword);
groot.appendChild(elemKeyword);
}
Element ref = gdoc.createElement("ref");
ref.setAttribute("href", "#" + id);
elemKeyword.appendChild(ref);
The following code fragment performs step 7. It extracts all the references from gdoc and finds the em elements referred to by the references. We use XPath to find all the href attributes of the ref elements and em elements.
// Finds all "keyword" elements in the glossary DOM-tree
ni = XPathAPI.selectNodeIterator(gdoc, "//keyword/ref/@href");
// For each node found
while ((node = ni.nextNode()) != null) {
String id = node.getNodeValue().substring(1);
// Finds all "em" elements match with the keyword
String xpath = "//em[@id='" + id + "']";
Node node2 = XPathAPI.selectSingleNode(sdoc, xpath);
// Prints its XPath expression
System.out.println("The ID '" + id +
"' found at " + getXPath(node2));
}
The following is the full source code of getXPath().
// Creates XPath expression from Node
public static String getXPath(Node node) {
Node localNode = node;
switch (localNode.getNodeType()) {
case Node.ELEMENT_NODE:
// Finds all previous nodes that have the same node name
int index = 1;
String nodeName = node.getNodeName();
while ((localNode = localNode.getPreviousSibling()) != null)
if (localNode.getNodeName().equals(nodeName))
index++;
return (getXPath(node.getParentNode()) + "/" +
node.getNodeName() + "[" + index + "]");
case Node.TEXT_NODE:
return (getXPath(node.getParentNode()) + "/text()");
case Node.DOCUMENT_NODE:
return "";
default:
throw new UnknownError("Unexpected node type");
}
}
The getXPath() method builds an XPath expression to uniquely identify a node in a DOM tree. An example of such an XPath expression is: /doc[1]/body[1]/subsection[1]/p[2]/em[2] The getXPath() method is a typical example of traversing a DOM tree recursively. Because there are one or more XPath expressions to a node in a DOM tree, there is no standard library to perform such an operation even though it is often needed. This implementation traverses from the given node up to the document root. To determine the position of the target node, this program scans all the sibling nodes between the target node and its parent node. Therefore, the performance may not be very good. Another implementation example is IndexCreator, described in Section 11.5.2. It recursively traverses a DOM tree from a parent down to its child nodes. While processing a parent node, it uses an XPath expression to uniquely identify a child node. This approach is efficient if you need to build XPath expressions for all the nodes in a DOM tree. Now you have learned that a combination of the XPath API and the DOM API allows you to traverse a DOM tree simply and flexibly. Comparison of SAX/DOM and XSLT for XML Document ConversionA SAX event handler can generate another XML document from input SAX events. In this book, you have learned two ways to convert an XML document to another using SAX. One way is to use a SAX filter to convert SAX events (shown in Section 5.2.2). The other way is to use a SAX event converter using XSLT. We discuss which way is better in what situations. In general, a SAX event handler extracts document contexts from input SAX events and saves them in instance variables. As dependencies between the contexts become complex, a SAX event handler becomes hard to implement. For example, as shown in Figure 7.6, if element A has dependencies with elements B, C, and D, the context that the SAX event handler has to manage becomes quite complicated. Figure 7.6. Context dependencies in a SAX event handler
In the case of a SAX filter, it has to generate SAX events to be passed on to the following event handler in a pipeline. For example, because the processing of element A depends on element D, the handler holds all the events in a queue until element D appears to process element A. It may need more than one queue if there are other dependencies, and the program may become unrealistically complicated and hard to maintain. Let's reconsider the MailFilter example in Section 5.2.2, which converts this: <email>foo@bar.test</email> to this: <uri>mailto:foo@bar.test</uri> In this example, the only dependency of the email element is its child text node. This dependency is easy to handle because the dependent nodes appear very close. Another simple example shown in Section 5.2.2 converts between <book title= "foobar">...</book> and <book><title>foobar</title>...</book>. The conversion from the former to the latter can be performed right after receiving the event of the book element because it has no dependency with any other elements. The reverse direction has a dependency, and the book element cannot be converted until receiving the title event. The final example is a SAX filter program to sort sibling elements by their attribute values in alphabetical order. The program has to save all the events of the sibling elements in a buffer and then sort the elements in the buffer. It may become complicated if each sibling element has child elements. This example could imply the limitation of SAX's ability for XML document conversion. Because the SAX API is known as a faster API, it is worth using if the conversion rule is simple enough and the rule is not subject to change. The DOM API is appropriate for complicated examples in which the SAX API is hard to use. If memory efficiency is not a serious issue, as we described in this section, it is a good idea to use the combination of the XPath API and the DOM API for traversing a DOM tree and getting the resulting XML document or SAX events. We haven't described any issues here with XSLT because it is a high-level language designed for such conversions. It is powerful enough to perform a conversion, a sort, or an evaluation even if there is a complex dependency between any elements specified as XPath expressions. The advantage of using XSLT is that you can write a conversion program using a reasonably simple stylesheet that makes a program flexible for future specification changes or functional enhancements. With XSLT, for example, a position change of the dependent element requires only a change to the corresponding XPath expression. It is much easier than using DOM or SAX. Another advantage is that it is possible to generate an XSLT stylesheet by using a tool or a runtime library because a stylesheet is an XML document. One example is a Web services scenario described in Chapter 13. Suppose that a service provider publishes a WSDL, and a service requester has an interface based on a different but similar WSDL. In such a case, it may be possible to generate a stylesheet to convert a service requester's WSDL to a service provider's WSDL so that they can communicate with each other. Another example is a data binding scenario described in Chapter 8. It may be possible to generate a stylesheet for unidirectional or bidirectional conversion between two different XML document fragments mapped from Java objects. These are promising areas in which XSLT can play an important role in the future. The limitation of XSLT's ability is in writing complicated logic other than pattern matching. As the logic becomes complicated, a stylesheet becomes complicated and hard to read.[10] XSLT uses XPath expressions in a search patterns. This is another limitation of XSLT because XPath cannot express computations, such as date calculations and currency conversion. In this case, DOM or SAX is more appropriate than XSLT.
In this section, we discussed the pros and cons of using DOM, SAX, XPath, and XSLT from the aspect of execution and development efficiency. |
| [ directory ] |
|