| [ directory ] |
|
3.4 Serializing a DOM TreeIn the previous sections, you learned how to create a DOM tree from scratch. To store the tree as a file or to send it to other applications, you should serialize the DOM object tree as an XML document. The serialization process depends on the XML processor because there is no standard API for serialization. Xerces provides a utility class, org.apache.xml.serialize.XMLSerializer, to output a DOM tree in various formats. 3.4.1 Using the XMLSerializer PackageRemember the MakeDocumentWithFactory class shown in Listing 3.2? In this program, the serialization process begins at line 46.
[9] import org.apache.xml.serialize.XMLSerializer;
[10] import org.apache.xml.serialize.OutputFormat;
...
// Prepares output format
[46] OutputFormat formatter = new OutputFormat();
// Preserves whitespace
[48] formatter.setPreserveSpace(true);
// The XML document will be output to standard output
[50] XMLSerializer serializer =
new XMLSerializer(System.out, formatter);
// Serializes the DOM tree as an XML document
[53] serializer.serialize(doc);
The OutputFormat class is used to specify the format of serialization, and the class has four constructors.
As you can see, you can specify an encoding and the indenting of a serialized document. In the previous example, the object was created with no argument. For example, we can modify the first part of the program shown in Listing 3.2 as follows: OutputFormat formatter = new OutputFormat(xml, "Shift_JIS", true); In the second argument, an encoding named "Shift_JIS", which is a Japanese encoding, is specified. The third argument is a boolean variable for Xerces to put appropriate indentations in the output document to make it easier to read. When the modified program is compiled and executed, this is the result.
R:\samples>java chap03.MakeDocumentWithFactoryModified
<?xml version="1.0" encoding="Shift_JIS"?>
<department><!--The first employee description.--><employee id="J.D">
<name>John Doe</name>
<email>John.Doe@foo.com</email>
</employee><?application commandForApp?></department>
The setPreserve() method is used to specify whether any whitespace should be preserved. In this chapter, there is no need to specify this method, because a DOM tree is created by a program. Let's continue to explain the previous code fragment. After creating an OutputFormat object, we create an XMLSerializer object with it. // The XML document will be output to standard output XMLSerializer serializer = new XMLSerializer(System.out, formatter); In this case, the following constructor for the XMLSerializer class is used (refer to the Xerces API document for other constructors). public XMLSerializer(OutputStream output, OutputFormat format) The first argument takes the java.io.Outputstream object for output. In this program, the standard output (System.out) is used, but you can set other output stream objects to get a file or a string. The second argument takes the OutputFormat object created before. Finally, we execute serialization with serialize method. // Serializes the DOM tree as an XML document serializer.serialize(doc); 3.4.2 Discussions about SerializationWe showed methods to create a DOM tree from scratch and serialize it as an XML document with sample programs. You can easily modify the programs for other purposes梖or example, to connect to your database to automatically generate an employee list of your whole department in the form of XML. Another possibility is to create a DOM tree interactively by bringing up a series of dialogs (wizards, in Windows terminology) and asking the user to supply the data. You may think that it might have been easier to generate an XML document directly using System.out.println() (or printf() in C or cout in C++, or …) as follows.
System.out.println("<?xml version=\"1.0\"?>");
System.out.println("<department>");
System.out.println(" <employee>");
System.out.println(" <name>John Doe</name>");
...
In fact, most current CGI programs and servlets generate HTML pages this way. Why do we bother creating a complex object structure rather than just using the println() method? We can give you two good reasons. First, XML documents must be well-formed. The current browsers are amazingly forgiving about errors in HTML markup. This is partly because even though some of the tags are ignored or handled incorrectly, nothing serious happens. The information will be displayed on a screen in a reasonable way, and the human user is responsible for making sense of it. On the other hand, XML tags are supposed to be interpreted by application programs. The well-formedness of XML documents is strictly defined in the XML 1.0 Recommendation, and all conforming XML processors are required to report errors to application programs if the parsed XML document is not well-formed. There should not be errors such as missing end tags, unknown entities, and unknown characters. Creating a toy program that generates very simple XML documents may be possible with the println() method. But for complex enterprise Web applications, you should let an XML processor be responsible for generating well-formed and valid documents. Second, creating well-formed and valid XML documents is not as easy as it may seem. XML is intended to be a simple, lightweight markup language. Yet, understanding every detail of the specification is not easy. For example, how can you distinguish between ignorable whitespace and unignorable whitespace? Or how can you include a newline character within an attribute value? The XML 1.0 Recommendation precisely defines these details, and the XML processors in your business partner's application program expect that your XML document complies with them. Even though you are familiar with these details, it is not productive for you to develop code that takes care of them every time you call the println() method. One of the biggest values you can expect from an XML processor is that it can handle these details for you. Third, generation by string processing using the println() method sometimes produces security holes. Suppose we create an HTML document in which a person's name from an input form is embedded.
String name;
...
out.println("<td>Name</td><td>" + name + "</td>");
...
The previous code works correctly when the name variable refers to "John Doe" and outputs the following string. <td>Name</td><td>John Doe</td> However, if the name variable refers to a string that contains some special characters, like "<," it outputs an incorrect HTML document, because it is difficult to check whether the embedded string is a correct fragment of the HTML document. A more serious problem occurs if someone embeds a string that conforms to HTML but contains a malicious program. Cross-site scripting (CSS), described in Section 10.2.1, is a typical example. You should make sure that printing an HTML or XML document doesn't cause a security hole. By creating and validating an HTML or XML document as a whole before printing, you can avoid including the illegal characters. For these reasons, we recommend that you create an XML document from a DOM tree. It may involve a cost, but we believe it is a necessary investment. XML processors are the result of intensive intellectual work. Using them frees you from worrying about the proper nesting of tags, escaping special characters such as an ampersand and a left angle bracket, and handling international character sets. So why not use XML processors? Another important point when you want to generate an XML document is the encoding of the document. In this book, we recommend that you use UTF-8 or UTF-16 because any XML processor must handle these two encodings. This is not so serious a problem if an application handles only English; however, there are many language-specific encodings for Japanese, for example. Some legacy systems may use these encodings, but even so, you should use UTF-8 or UTF-16 encoding when data in XML is exchanged between systems. One reason why business applications employ XML is to keep the interoperability independent from implementations and platforms. Using widely accepted encodings makes it possible to improve interoperability. |
| [ directory ] |
|