| [ directory ] |
|
3.5 Handling WhitespaceIn previous sections, we showed programming examples to create and serialize XML documents. When the XML documents are created and serialized, there are a couple of things that should receive serious consideration, such as handling whitespace. According to the XML 1.0 Recommendation, whitespace is one or more space (#x20) characters, carriage returns, line feeds, or tabs. Whitespace is used to delimit tokens. Many production rules in the XML 1.0 Recommendation explicitly also include S, the nonterminal symbol that represents whitespace. In some applications, whitespace is meaningful (for example, poetry and program source code). However, whitespace is also used to improve the readability of XML documents. In this case, the whitespace itself is meaningless. To demonstrate how much we can improve the readability by inserting appropriate whitespace, we remove all the nonessential whitespace from our example in Chapter 2, department.xml. The result is a single long line shown in the file departmentNoWS.xml in Listing 3.6 (because of the page-width limitation, the line is wrapped, but notice that there are no newline characters between the lines). Obviously, this is much less readable than the original one, which contains appropriate newlines and indentations. As we have shown in previous sections, when an XML document is generated by a program with the DOM API, the generated document contains no whitespace for improved readability. Listing 3.6 XML document without indentation, chap03/departmentNoWS.xml<?xml version="1.0"?> <department><employee id="J.D"><name>John Doe</name><email>John.Doe@foo.com</email></employee><employee id="B.S"> <name>Bob Smith</name><email>Bob.Smith@foo.com</email></employee> <employee id="A.M"><name>Alice Miller</name><url href="http://www.foo. com/~amiller/"/></employee></department> An XML document without whitespace is difficult for humans to read. However, what about for computer programs? Figure 3.2 shows the DOM trees for department.xml and departmentNoWS.xml using the visualization tool TreeWalker, which is included in the Xerces distribution package. Figure 3.2. Visualization of an XML document (department.xml on the left, departmentNoWS.xml on the right)
Note that in the DOM tree of department.xml, all the whitespace, including both newlines and space characters, is preserved. This is a required behavior of a conforming XML processor because some whitespace is in fact meaningful in some types of applications. Without knowing the semantics of the application, you do not know which whitespace is significant and which is not. One of most frequent questions from developers about the DOM API is the use of the getFirstChild() method of the org.w3c.dom.Node interface. When developers want to get the first child element of an element, they simply call the getFirstChild() method. However, as you can see in Figure 3.2 (on the left), in many cases the first child node is a Text node that contains only whitespace. In our department.xml example, whitespace is not explicitly stated in the content models in the DTD (department.dtd). For example, the content model of the element department allows only name, email, and url as possible child elements. Will whitespace (Text elements consisting of whitespace characters) violate the validity of the document? To allow inserting whitespace for readability without explicitly specifying whitespace in the content models in a DTD, the XML 1.0 Recommendation defines the following rule as one of the validity constraints. The declaration matches children and the sequence of child elements belongs to the language generated by the regular expression in the content model, with optional white space (characters matching the nonterminal S) between each pair of child elements. Thus, although the DOM tree may have extraneous whitespace, validation is done by ignoring whitespace that is not defined in the content model. What you should care about is that if a DTD is not specified, it is impossible to determine whether the whitespace is ignorable or not. For example, if the content model of a department element is as follows, all the whitespace in the department element is meaningful. <!DOCTYPE department (#PCDATA|employee|name|email|url)*> If certain whitespace is significant, there are two ways to tell an XML processor or an application about it.
The XML document shown in Listing 3.7 indicates that all whitespace within the department element is to be preserved. Note that it is a hint for an XML application, not for an XML processor. Listing 3.7 XML document with xml:space attribute, chap03/ department- preserved.xml
<?xml version="1.0"?>
<!DOCTYPE department SYSTEM "department-preserved.dtd">
<department xml:space="preserve">
<employee id="J.D">
<name>John Doe</name>
<email>John.Doe@foo.com</email>
</employee>
<employee id="B.S">
<name>Bob Smith</name>
<email>Bob.Smith@foo.com</email>
</employee>
<employee id="A.M">
<name>Alice Miller</name>
<url href="http://www.foo.com/~amiller/"/>
</employee>
</department>
Xerces provides the getIsIgnorableWhitespace() method in org.apache.xerces.dom.TextImpl to test whether a node is ignorable whitespace. The method is Xerces-specific. The program (RemoveIgnorableWSNodes) shown in Listing 3.8 removes all the ignorable whitespace from an input XML document. Listing 3.8 Removing ignorable whitespace, chap03/RemoveIgnorableWSNodes.java
package chap03;
/**
* RemoveIgnorableWSNodes.java
**/
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import java.io.IOException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.Document;
import org.apache.xerces.parsers.DOMParser;
import org.xml.sax.SAXException;
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.OutputFormat;
import share.util.MyErrorHandler;
public class RemoveIgnorableWSNodes {
public static void main(String[] argv) {
if (argv.length != 1) {
System.err.println(
"Usage: java RemoveWSNodes <filename>");
System.exit(1);
}
try {
// Creates document builder factory
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
// Tells the parser to validate documents
factory.setValidating(true);
// Creates builder object
DocumentBuilder builder =
factory.newDocumentBuilder();
// Sets an ErrorHandler
builder.setErrorHandler(new MyErrorHandler());
// Parses the document
Document doc = builder.parse(argv[0]);
// Removes ignorable whitespace
removeIgnorableWSNodes(doc.getDocumentElement());
// Prepares output format
OutputFormat formatter = new OutputFormat();
formatter.setPreserveSpace(true);
// The XML document is output to standard output
XMLSerializer serializer =
new XMLSerializer(System.out, formatter);
// Serializes the DOM tree as an XML document
serializer.serialize(doc);
} catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}
public static void removeIgnorableWSNodes(Element parent) {
Node nextNode = parent.getFirstChild();
for (Node child = parent.getFirstChild();
nextNode != null;) {
child = nextNode;
nextNode = child.getNextSibling();
if (child.getNodeType() == Node.TEXT_NODE) {
// Checks if the text node is ignorable
if (((org.apache.xerces.dom.TextImpl)child).
isIgnorableWhitespace()) {
parent.removeChild(child);
}
} else if (child.getNodeType() == Node.ELEMENT_NODE) {
removeIgnorableWSNodes((Element )child);
}
}
}
}
Now we try to execute the program with an XML document with a DTD (department-dtd.xml). R:\samples>java chap03.RemoveIgnorableWSNodes department-dtd.xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE department SYSTEM "department.dtd"> <department><employee id="J.D"><name>John Doe</name><email>John. Doe@foo.com</email></employee><employee id="B.S"><name>Bob Smith </name><email>Bob.Smith@foo.com</email></employee><employee id="A.M"> <name>Alice Miller</name><url href="http://www.foo.com/~amiller/"/> </employee></department> Xerces recognizes a content model for an element in the DTD, checks whether whitespace is ignorable or not according to the content model, and removes ignorable whitespace. |
| [ directory ] |
|