站内搜索: 请输入搜索关键词
当前页面: 图书首页 > XML and Java: Developing Web Applications, Second Edition

XML and Java: Developing Web Applications, Second Edition

[ directory ] Previous Section Next Section

6.2 General Tricks

You can solve many XML application problems by using general tricks that are parser independent. This section teaches you how to solve some of the most common problems that XML developers encounter.

6.2.1 Namespace Validation with DTDs

The "Namepaces in XML" specification defines how to separate XML elements and attributes into separate namespaces to avoid name collisions when documents are mixed together. However, DTDs were not designed to validate documents with namespaces. But there is a trick you can use to add namespace support to DTDs. Although this trick isn't really a parser trick, it is useful enough to warrant its inclusion in this chapter.

We use an example to show how to add namespace support to a DTD. The following grammar defines an XML document that can be used to store a music collection.

<!ELEMENT collection (album)*>
<!ELEMENT album (artist,title)>
<!ATTLIST album cd-id CDATA #IMPLIED>
<!ELEMENT artist (#PCDATA)>
<!ELEMENT title (#PCDATA)>

The first step is to add three literal parameter entities to the DTD: a prefix, a suffix, and a namespace declaration. The names of the entities are not important, but remember that a name collision can occur if the DTD defines parameter entities with the same names.

<!ENTITY % prefix ''>
<!ENTITY % suffix ''>
<!ENTITY % xmlns 'xmlns%suffix;'>

The prefix and suffix parameter entities redefine the namespace prefix of the XML instance document. When you leave the value of these entities empty, your DTD retains the same element and attribute names by default. So existing documents that conform to the DTD can be used without modification. However, instance documents can redefine these values in the internal subset of the DTD to add a namespace prefix. To accommodate this new prefix, we defined the xmlns parameter entity, which we will use in an upcoming step.

Leaving the values of the prefix and suffix parameter entities blank allows existing documents that conform to the original DTD to be validated with the new namespace-aware DTD. However, this changes the information set for existing documents because elements and attributes gain namespace information.

The next step is to define literal parameter entities for the name of each element declared in the DTD. The value of each entity must be a reference to the namespace prefix you defined in the first step, followed immediately by the name of the element. For example:

<!ENTITY % collection '%prefix;collection'>
<!ENTITY % album '%prefix;album'>
<!ENTITY % artist '%prefix;artist'>
<!ENTITY % title '%prefix;title'>

Once you declare the entities for element names, modify all the element declarations to include the element names by reference. For example:

<!ELEMENT %collection; (%album;)*>
<!ELEMENT %album; (%artist;,%title;)>
<!ATTLIST %album; cd-id CDATA #IMPLIED>
<!ELEMENT %artist; (#PCDATA)>
<!ELEMENT %title; (#PCDATA)>

Next, add a namespace declaration attribute to all the elements that may be used as root elements in your instance documents. In our example, the <collection> element is the top-level element, so we declare the xmlns attribute for this element. The namespace URI that we've assigned as the default, fixed value was arbitrary, but you should choose your URI appropriately for your grammar.

<!ATTLIST %collection; %xmlns; CDATA #FIXED 'http://www.example.com/
music'>

Putting everything together, we have the modified DTD shown in Listing 6.1.[1]

[1] Note that in all cases, parameter entities must be declared before they are referenced.

Listing 6.1 Modified music collection DTD, chap06/data/collection-ns.dtd
<!ENTITY % prefix ''>
<!ENTITY % suffix ''>
<!ENTITY % xmlns 'xmlns%suffix;'>

<!ENTITY % collection '%prefix;collection'>
<!ENTITY % album '%prefix;album'>
<!ENTITY % artist '%prefix;artist'>
<!ENTITY % title '%prefix;title'>

<!ELEMENT %collection; (%album;)*>
<!ATTLIST %collection; %xmlns; CDATA #FIXED 'http://www.example.com/
music'>
<!ELEMENT %album; (%artist;,%title;)>
<!ATTLIST %album; cd-id CDATA #IMPLIED>
<!ELEMENT %artist; (#PCDATA)>
<!ELEMENT %title; (#PCDATA)>

Now both of the documents shown in Listings 6.2 and 6.3 can be validated using the same DTD grammar with our namespace modifications.

Listing 6.2 Sample using default namespace
<!DOCTYPE collection SYSTEM 'collection-ns.dtd'>
<collection>
   <album cd-id='189EFCF'>
      <artist>They Might Be Giants</artist>
      <title>Flood</title>
   </album>
</collection>
Listing 6.3 Sample using namespace prefixes
<!DOCTYPE a:collection SYSTEM 'collection-ns.dtd' [
   <!ENTITY % prefix 'a:'>
   <!ENTITY % suffix ':a'>
]>
<a:collection xmlns:a='http://www.example.com/music'>
   <a:album cd-id='2A77609'>
      <a:artist>Shonen Knife</a:artist>
      <a:title>Brand New Knife</a:title>
   </a:album>
</a:collection>

Notice the redefinition of the prefix and suffix parameter entities in the internal subset of the DTD in Listing 6.3. If the namespace is other than the default namespace (that is, a specific namespace prefix is bound to the namespace URI), the value of the prefix parameter entity must be the namespace prefix followed by a colon, and the suffix parameter entity value must be a colon followed by the namespace prefix. The added colons allow the DTD parser to correctly expand the element names and the namespace declaration attribute to contain the namespace prefix.

You can use this simple trick to update your old DTDs to be namespace-aware as a first step in your migration to using XML Schemas.

6.2.2 Entity Resolution

Entity resolution is perhaps the most useful feature of XML parsers that application developers often overlook. Simply stated, entity resolution allows the application to control how the parser locates parts of the document. Every separate part of the XML document is an entity梖or example, the DTD external subset, external general entities referenced within the document, and external parameter entities referenced within the DTD. You can use an entity resolver to redirect the location of the declared entity.

Using entity resolution offers many advantages. One benefit is that you can improve application performance by redirecting system identifiers that specify resources located on the network to copies on the local file system. You can also use this feature to prevent a document from using an untrusted DTD grammar, which is especially important in a business-to-business scenario.

Simple Entity Resolver

You can create an entity resolver by writing a simple class that implements the org.xml.sax.EntityResolver interface. The example in Listing 6.4 maps an entity with the system identifier http://www.company.com/grammar.dtd to a local copy of the DTD file. Notice that the code always sets the public and system identifiers on the new InputSource object. This is important to allow the parser to resolve other entities that may be declared relative to the resolved entity.

Listing 6.4 Creating an entity resolver, chap06/SimpleEntityResolver.java
package chap06.resolver;

import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class SimpleEntityResolver
   implements EntityResolver {

   public InputSource resolveEntity(String publicId, String
   systemId)
      throws SAXException, IOException {

      // resolve known entity using system identifier
      if (systemId.equals("http://www.example.com/grammar.dtd")) {
          // open local file
          InputStream inputStream = new
FileInputStream("c:\\xml\\grammar.dtd");

           // create input source and return
           InputSource inputSource = new InputSource(inputStream);
           inputSource.setPublicId(publicId);
           inputSource.setSystemId(systemId);
           return inputSource;
       }
      // don't know how to resolve entity, let parser resolve it
      return null;

   }

}

To use your custom entity resolver, create an instance and register it with the parser of your choice. JAXP allows you to register an entity resolver on either a DocumentBuilder or a SAXParser.[2]

[2] We assume that you know how to instantiate a DocumentBuilder and a SAXParser using JAXP, as discussed in Chapter 2.

// import javax.xml.parsers.DocumentBuilder;
// import javax.xml.parsers.SAXParser;
// import org.xml.sax.EntityResolver;
// import chap06.resolver.SimpleEntityResolver;

// instantiate custom entity resolver
EntityResolver entityResolver = new SimpleEntityResolver();

// set entity resolver on document builder
DocumentBuilder documentBuilder = ...;
documentBuilder.setEntityResolver(entityResolver);

// set entity resolver on SAX parser
SAXParser saxParser = ...;
saxParser.getXMLReader().setEntityResolver(entityResolver);

This trick is an extremely powerful weapon and should be in every XML application developer's arsenal.

Caching Common Entities in Memory

As stated in the previous section, an entity resolver can improve application performance by redirecting network access to a local copy of an entity. However, disk access is slower than memory access, so performance can be further improved by caching often-used entities (such as DTDs) in memory. The code in Listing 6.5 implements an entity resolver that caches entities in memory.

Listing 6.5 Caching entities in memory, chap06/resolver/MemoryEntityResolver.java
package chap06.resolver;

import java.io.ByteArrayInputStream;
import java.io.EOFException;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;
import java.util.Hashtable;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class MemoryEntityResolver
   implements EntityResolver {

   protected Hashtable cache = new Hashtable();

   public void put(String systemId, File file) throws IOException {

      // create array
      long length = (int)file.length();
      if (length > Integer.MAX_VALUE) {
         throw new IOException("file too large to cache");
      }
      int size = (int)length;
      byte[] array = new byte[size];

      // load file into array
      InputStream inputStream = new FileInputStream(file);
      while (size > 0) {
          int count = inputStream.read(array, array.length - size,
size);
          if (count == -1) {
             throw new EOFException("unexpected end of file");
          }
          size -= count;
      }
      inputStream.close();

      // add to cache
      cache.put(systemId, array);
   }

   public InputSource resolveEntity(String publicId, String systemId)
      throws SAXException, IOException {
      // resolve known entity using system identifier
      byte[] array = (byte[])cache.get(systemId);
      if (array != null) {
         // wrap array with input stream
         InputStream inputStream = new ByteArrayInputStream(array);

         // create input source and return
         InputSource inputSource = new InputSource(inputStream);
         inputSource.setPublicId(publicId);
         inputSource.setSystemId(systemId);
         return inputSource;
      }

      // don't know how to resolve entity, let parser resolve it
      return null;

   }

}

To use MemoryEntityResolver, create an instance, add files to the cache by calling the put() method, and register the entity resolver with your parser of choice. For example:

// import java.io.File;
// import javax.xml.parsers.SAXParser;
// import org.xml.sax.EntityResolver;
// import chap06.resolver.MemoryEntityResolver;
// instantiate entity resolver and add files to cache

MemoryEntityResolver memoryEntityResolver = new MemoryEntityResolver();
memoryEntityResolver.put("http://www.example.com/music/collection.dtd",
                         new File("c:\\xml\\collection.dtd"));
memoryEntityResolver.put("http://www.foobar.com/candy/chocolate.dtd",
                         new File("c:\\xml\\chocolate.dtd"));

// set entity resolver on SAX parser
SAXParser saxParser = ...;
saxParser.getXMLReader().setEntityResolver(memoryEntityResolver);
Enforcing Validation Using Specific Grammars

Resolving entities referenced in a document can improve application performance, but it can also be used to enforce that documents use a specific grammar determined by the application. However, registering an entity resolver is not enough to force the document to use a specific grammar because certain things are beyond the entity resolver's control. For example, what if the instance document does not contain a DOCTYPE declaration or does not reference an external DTD using public or system identifiers? Then the registered entity resolver will not be called by the parser. At the very least, the instance document must contain a DOCTYPE declaration.

Even if the document contains the necessary information in the DOCTYPE declaration to allow the entity resolver to properly resolve the DTD grammar to be used, problems still exist. What if the document's DOCTYPE "lies" and references an invalid document grammar? In addition, a conformant XML parser must process the internal subset of the DTD, and these declarations take precedence over the declarations in the external subset. In either case, the document can "spoof" the application into using the wrong DTD grammar or the wrong declarations for validating the document.[3]

[3] Spoofing is a trick used to capture or transmit incorrect information to make a receiver falsely believe the information is correct.

Entity resolution is not enough to enforce validation rules defined by the application. However, entity resolution can be used in conjunction with more advanced techniques to solve this problem. This problem is best solved by grammar caching performed by the parser implementation. Unfortunately, at the time of this writing, the Xerces parser does not have a grammar caching facility. You can use the Xerces Native Interface, described in Section 6.4, Advanced Xerces Tricks, to implement a solution to this problem, but the actual implementation is outside the scope of this book. Later, Xerces will incorporate a general grammar caching facility into the standard release of the parser.

Taking control of entity resolution is only one of the general tricks that you can use in your XML application. The next section presents a solution to a set of common problems developers experience when working with sockets.

6.2.3 Working with Sockets

Client-server applications are traditionally written to communicate via sockets using a proprietary binary format. Now existing applications as well as new client-server applications are being written to use XML as the communication medium to achieve data independence and greater flexibility. Therefore, XML documents are typically written and read using a socket stream, but problems occur when XML documents are transferred using a socket stream. This section provides a detailed description of these problems and presents a general solution.

The Problem

The primary problem with writing an XML document to a socket stream is that the stream usually doesn't close after the XML document is serialized. Because XML does not define a definitive end to the document stream, an XML parser will not stop parsing the document until the stream ends or closes.[4] Because neither is the case for a typical socket stream, the parser cannot know where the document ends.

[4] Section 2.1 of the XML 1.0 specification (second edition) states that a well-formed XML document matches the production:


[1] document ::= prolog element Misc*

This states that an XML document can be followed by zero or more comments or processing instructions. In short, the parser cannot know that a document ends until the stream closes!

Because opening and closing network connections is time-consuming, we want to avoid closing the socket and, instead, reuse the stream to write multiple documents (or other data). An XML parser must have a definitive end to the document, so we must find a way to separate the documents within the socket stream. First, we briefly look at some common attempts to solve this problem and the reasons they don't generally work. Then we present a solution that works regardless of the length or content of the document.

Solutions with Problems

Several approaches to solving this problem are well intentioned but are naive in that they don't fully address the nature of the problem. Usually, these solutions are limited in their usefulness, don't perform well, and are ultimately doomed to fail for use with arbitrary XML documents.

One approach to solving this problem is to insert either a special non-XML character or a processing instruction to mark the end of a document. Then a specialized reader detects the marker and makes the document stream "appear" as if it has closed so that the parser can finish parsing the document. But this solution does not work for most XML files, and, depending on the implementation, does not perform well.

This solution doesn't work for most XML files for a variety of reasons. However, the primary reason this solution fails is due to the document's character encoding. XML is based on Unicode and can be encoded using different character encodings. Therefore, unless you can control the character encoding of the transmitted documents, inserting a special character or even a processing instruction cannot be done reliably. Additional code is needed to detect the encoding of the document and write the marker using the same character encoding.

Another solution with limitations uses a "superdocument." In this approach, a document is written to the socket stream in which the real transmitted documents appear as child elements of the new root element. This solution, however, is prohibitive because it requires extra processing on both the server and the client. In addition, arbitrary documents cannot be inserted into the superdocument as is, because the document may use a different character encoding or may contain an XML declaration (for example, <xml version='1.0' ...>) or DOCTYPE line that is not allowed to appear within the body of a document.

Although each of these solutions will work with limited success depending on how well you can control the environment and the XML document content, another solution must be found. So we will develop a solution that works regardless of the document's encoding and contents.

A Stream within a Stream

Because the parser can't know the end of the document until the stream closes, we embed a stream within the socket stream. In short, we "wrap" the output stream so that we can write documents of arbitrary size and encoding, and "unwrap" the input stream at the other end for the parser. The embedded stream then correctly signals an end-of-file condition to the parser. For this solution to work, you must have access to both the server and client code.

To embed a stream within a stream, we use two I/O classes called Wrapped OutputStream and WrappedInputStream. The output stream is used by the program sending the data (which is an XML document, but it also works for arbitrary-length binary files), whereas the input stream is used by the receiver to read the data. By using the output and input stream in conjunction, we can hide from the application the details of how the inner stream is encoded.[5]

[5] The WrappedInputStream and WrappedOutputStream classes are distributed as an Apache Xerces2 sample. The length of these classes prevents them from being listed in this text. Please refer to the accompanying CD-ROM for the complete source code.

Here are some additional facts for the technically minded. WrappedOutputStream is a filter output stream that writes packets of data to the underlying stream. Each packet contains a data size followed by the bytes of the data. A packet size of zero "closes" the stream. On the other end, WrappedInputStream first reads the packet data size, followed by the bytes in the data. This operation is transparent to the application using the input stream. To the application, it appears as if the packet data is contiguous, with no header information to indicate the packet size.

Listing 6.6 uses WrappedOutputStream to write an XML document to a socket stream.

Listing 6.6 Wrapping an output stream
// import java.io.FileOutputStream;
// import java.io.OutputStream;
// import chap06.socket.WrappedOutputStream;

// assumed to have socket stream open
OutputStream socketOutputStream = ...;

// wrap output stream
OutputStream wrappedOutputStream =
   new WrappedOutputStream(socketOutputStream);

// write document to output stream
InputStream xmlInputStream = new
FileInputStream("c:\\xml\document.xml");
byte[] array = new bytes[2048];
int count = 0;
while ((count = xmlInputStream.read(array)) != -1) {
       wrappedOutputStream.write(array, 0, count);
}
xmlInputStream.close();

// "close" wrapped output stream
wrappedOutputStream.close();

When using WrappedOutputStream, the application is required to call the close() method. This "closes" the wrapped output stream but does not close the underlying stream. On the other end of the connection, the reader must use WrappedInputStream to read the contents written by WrappedOutputStream. Listing 6.7 shows how to wrap the incoming stream and parse its contents.

Listing 6.7 Unwrapping an input stream for parsing
// import java.io.InputStream;
// import java.io.IOException;
// import javax.xml.parsers.SAXParser;
// import org.xml.sax.InputSource;
// import chap06.socket.WrappedOutputStream;

// assumed to have socket stream and parser
InputStream socketInputStream = ...;
SAXParser parser = ...;

// wrap input stream
InputStream wrappedInputStream =
   new WrappedInputStream(socketInputStream);
// parse document
try {
       InputSource inputSource = new InputSource(wrappedInputStream);
       parser.parse(inputSource);
}
finally {
       // "close" wrapped input stream
       wrappedInputStream.close();
}

Closing the wrapped input stream does not close the underlying stream. However, the application must call the close() method on WrappedInputStream. This is especially important for a fatal parsing error because the embedded stream will be left "open." Without an out-of-band method of communicating this error to the sender, the data stream will become corrupted. Closing the wrapped input stream skips to the end of the embedded document and leaves the underlying stream ready for reuse.

We have shown how to solve a few simple XML application problems, but many times the features provided by the standard interfaces are not sufficient. In those cases, options are available to application developers using the Xerces parser. We explore a few of these options in the next two sections.

    [ directory ] Previous Section Next Section