| [ directory ] |
|
2.2 Basics of Parsing DocumentsThis section describes how to parse well-formed and valid XML documents, and shows the differences between them. 2.2.1 Parsing Well-Formed DocumentsIn this section, we show how to read a simple XML document, called department.xml, using Xerces. This document represents a set of employee records in a department (see Listing 2.1). The meanings of the tags should be self-explanatory. Listing 2.1 Simple XML document, employee records for a department, chap02/department.xml
<?xml version="1.0" encoding="utf-8"?>
<department>
<employee id="J.D">
<name>John Doe</name>
<email>John.Doe@foo.com</email>
</employee>
<employee id="B.S">
<name>Bob Smith</name>
<email>Bob.Smith@foo.com</email>
</employee>
<employee id="A.M">
<name>Alice Miller</name>
<url href="http://www.foo.com/~amiller/"/>
</employee>
</department>
This is a well-formed XML document, and it should be parsed by a non-validating XML processor. The first task of this book is to read and parse the document by using Xerces. We run the sample program, SimpleParse, located in samples\chap02 on the CD-ROM, using the following commands: R:\samples>java chap02.SimpleParse chap02/department.xml R:\samples> This program, as in the previous section, produces no output. However, we know that Xerces did its job, because SimpleParse did the following:
The fact that you see no output means that there were no violations of well-formedness (missing end tags, improper nesting, and so on). Listing 2.2 gives the source code of SimpleParse. Although a very short program, it shows the basics of how you can use Xerces. Listing 2.2 Parsing an XML document (non-validating), chap02/SimpleParse.java
package chap02;
/**
* SimpleParse.java
**/
[5] import org.w3c.dom.Document;
[6] import org.apache.xerces.parsers.DOMParser;
import org.xml.sax.SAXException;
import java.io.IOException;
public class SimpleParse {
public static void main(String[] argv) {
if (argv.length != 1) {
System.err.println(
"Usage: java chap02.SimpleParse <filename>");
System.exit(1);
}
try {
[18] // Creates a parser object
[19] DOMParser parser = new DOMParser();
[20] // Parses an XML Document
[21] parser.parse(argv[0]);
[22] // Gets a Document object
[23] Document doc = parser.getDocument();
[24] // Does something
[25] } catch (SAXException se) {
[26] System.out.println("Parser error found: "
[27] +se.getMessage());
[28] System.exit(1);
} catch (IOException ioe) {
System.out.println("IO error found: "
+ ioe.getMessage());
System.exit(1);
}
}
}
Now we'll look at the program SimpleParse line by line, referring to the numbers in square brackets on the left side of the program listing. First, this class imports some classes to use with Xerces:
Also, two exception classes (SAXException and IOException) are imported. The heart of this program is in lines 19?2. [19] DOMParser parser = new DOMParser(); Line 19 creates a DOM-based processor to parse an XML document. [21] parser.parse(argv[0]); Next, line 21 parses an XML document specified by a command-line argument (argv[0]). In this case, the parse() method takes the filename of the XML document. The method has the following argument patterns (signatures), and you can choose the appropriate one.
The third one requires an object of the org.xml.sax.InputSource class, which is useful to wrap various input formats for an XML document to be parsed. Though it is originally from the SAX 1.0 API, it is widely used for a DOM parser as well as a SAX parser. The class has four constructors:
If you want to write a method (say, processWithParse()) that takes an input file name as an argument, processWithParse(InputSource is) is more reusable than processWithParse(File f) or processWithParse(String url). [23] Document doc = parser.getDocument(); Line 23 receives the Document instance. The org.w3c.dom.Document interface is specified by the DOM specification from W3C. The variable doc actually refers to an instance of an implementation class (org.apache.xerces.dom. Document/mpl) provided by Xerces. The instance represents the whole XML document and can contain (1) at most one DocumentType instance that represents a DTD, (2) one Element instance that represents a root element (which is called a document element), and (3) zero or more Comment and ProcessingInstruction instances. The interface provides methods to visit and modify child nodes of the root element. For example, an application can get the root (document) element of an XML document by using the getDocumentElement() method of the Document interface. This sample program is simple, but you can see many other programs in this book. When something goes wrong, the program throws an exception. The program shown in Listing 2.2 catches the following two exceptions:
You might think that this program has no practical value because it does not produce any output. However, it is useful as a syntax checker. It can tell you whether the input XML document is well-formed or not. To show you how this works, we give an XML document that is not well-formed, department2.xml, to SimpleParse in Listing 2.3. Listing 2.3 Not well-formed XML document, chap02/department2.xml
<?xml version="1.0" encoding="utf-8"?>
<department>
<employee id="J.D">
<name>John Doe</name>
<email>John.Doe@foo.com</email1>
</employee>
<employee id="B.S">
<name>Bob Smith</name>
<email>Bob.Smith@foo.com</email>
</employee>
<employee id="A.M">
<name>Alice Miller</name>
<url href="http://www.foo.com/~amiller/"/>
</employee>
</department>
This document is not well-formed, because the end tag of the first email element is </email1>, not </email>. The result of parsing the document is as follows: R:\samples>java chap02.SimpleParse chap02/department2.xml Parser error found: The element type "email" must be terminated by the matching end-tag "</email>". The XML processor recognizes the mismatch of the start and end tags, and reports it to applications by an exception (SAXException). In Listing 2.2, the exception is caught in lines 25?8. 2.2.2 Parsing Valid DocumentsIn this section, we parse a valid XML document according to a DTD. An example called department-dtd.xml is shown in Listing 2.4. The DOCTYPE declaration (the second line) tells an XML processor the location of the DTD. Listing 2.4 XML document with DTD, chap02/department-dtd.xml
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE department SYSTEM "department.dtd">
<department>
<employee id="J.D">
<name>John Doe</name>
<email>John.Doe@foo.com</email>
</employee>
<employee id="B.S">
<name>Bob Smith</name>
<email>Bob.Smith@foo.com</email>
</employee>
<employee id="A.M">
<name>Alice Miller</name>
<url href="http://www.foo.com/~amiller/"/>
</employee>
</department>
The DTD for the document is shown in Listing 2.5. Listing 2.5 DTD for XML document, chap02/department.dtd<!ELEMENT department (employee)*> <!ELEMENT employee (name, (email | url))> <!ATTLIST employee id CDATA #REQUIRED> <!ELEMENT name (#PCDATA)> <!ELEMENT email (#PCDATA)> <!ELEMENT url EMPTY> <!ATTLIST url href CDATA #REQUIRED> As shown in Section 1.4.2, a DTD specifies the structure of an XML document. For example, the first element declaration in Listing 2.5 says a department element must have zero or more employee elements. The second declaration says an employee element must have a name element as the first child element and an email or url element as the second child element. The third one indicates an employee element must have an id attribute. The word #PCDATA means characters, and the url element cannot have any children (it is an empty element). Refer to the XML 1.0 specification for the details. Xerces is a validating processor, but it does not validate by default. So we must tell Xerces to validate an input XML document against the DTD. Listing 2.6 shows a sample program for the validation. Listing 2.6 Parsing an XML document (validating), chap02/SimpleParseWith Validation.java
package chap02;
/**
* SimpleParseWithValidation.java
**/
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.ErrorHandler;
import org.apache.xerces.parsers.DOMParser;
import share.util.MyErrorHandler;
import java.io.IOException;
public class SimpleParseWithValidation {
public static void main(String[] argv) {
if (argv.length != 1) {
System.err.println("Usage: java "+
"chap02.SimpleParseWigthValidation <filename>");
System.exit(1);
}
try {
// Creates parser object
DOMParser parser = new DOMParser();
[25] // Sets an ErrorHandler
[26] parser.setErrorHandler(new MyErrorHandler());
[27] // Tells the parser to validate documents
[28] parser.setFeature(
"http://xml.org/sax/features/validation",
true);
[31] // Parses an XML Document
[32] parser.parse(argv[0]);
[33] // Gets a Document object
[34] Document doc = parser.getDocument();
// Does something
} catch (Exception e) {
e.printStackTrace();
}
}
}
Again, let's look at the program in detail. First, a DOMParser object is created. In SimpleParse, shown in Listing 2.2, we caught a SAXException exception when an input XML document was not well-formed. An XML processor provides an error handler to handle errors more flexibly. The error handler recognizes fatal errors that prevent it from continuing a parsing process, errors that are defined in the XML 1.0 Recommendation, and warnings for other problems. Error handlers should implement the org.xml.sax.ErrorHandler interface. To create an error handler, there are two well-known methods.
If you can work with a general error handler that can be shared with other applications, the latter approach is good in terms of software reuse. If you want to use an application-specific handler, or you don't want to create a new class for the handler for some reason, the former approach may be better. This book employs the latter approach. MyErrorHandler, shown in Listing 2.7, is a typical implementation of an error handler. Listing 2.7 Handling errors, share/util/MyErrorHandler.java
package share.util;
import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
public class MyErrorHandler implements ErrorHandler {
/** Constructor. */
public MyErrorHandler(){
}
/** Warning. */
public void warning(SAXParseException ex) {
System.err.println("[Warning] "+
getLocationString(ex)+": "+
ex.getMessage());
}
/** Error. */
public void error(SAXParseException ex) {
System.err.println("[Error] "+
getLocationString(ex)+": "+
ex.getMessage());
}
/** Fatal error. */
public void fatalError(SAXParseException ex) {
System.err.println("[Fatal Error] "+
getLocationString(ex)+": "+
ex.getMessage());
}
/** Returns a string of the location. */
private String getLocationString(SAXParseException ex) {
StringBuffer str = new StringBuffer();
String systemId = ex.getSystemId();
if (systemId != null) {
int index = systemId.lastIndexOf('/');
if (index != -1)
systemId = systemId.substring(index + 1);
str.append(systemId);
}
str.append(':');
str.append(ex.getLineNumber());
str.append(':');
str.append(ex.getColumnNumber());
return str.toString();
}
}
The org.xml.sax.ErrorHandler interface defines fatalError(), error(), and warning(). The MyErrorHandler class implements these methods to show a filename, line and column numbers, and the content of an error. In SimpleParseWithValidation (see Listing 2.6), MyErrorHandler is created in line 26 and set to a parser object. [25] // Sets an ErrorHandler [26] parser.setErrorHandler(new MyErrorHandler()); Next, we tell the XML processor to turn on validation by using the setFeature() method. This is a method of the org.xml.sax.XMLReader interface that is implemented by the DOMParser classes. The method is used to set various features of an XML processor. In this book, we use some of the features (see Section 6.3.1 for more on these features). Refer to http://xml.apache.org/xerces-j/features.html for the complete list of features. Note that the default value of the validation feature ("http://xml.org/sax/features/validation") is false, so SimpleParse in the previous section did not check the validity of the XML document.
[27] // Tells the parser to validate documents
[28] parser.setFeature("http://xml.org/sax/features/validation", true);
Finally, we start parsing. This is the same process as in SimpleParse. [31] // Parses an XML Document [32] parser.parse(argv[0]); [33] // Gets a Document object [34] Document doc = parser.getDocument(); Now we run this program to parse a valid XML document, department-dtd.xml. R:\samples>java chap02.SimpleParseWithValidation chap02/ department-dtd.xml R:\samples> Because department-dtd.xml shown in Listing 2.4 conforms to department.dtd (see Listing 2.5), it should be parsed without error. The next example is an invalid document, department-dtd2.xml, shown in Listing 2.8. Listing 2.8 Invalid XML document, chap02/department-dtd2.xml
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE department SYSTEM "department.dtd">
<department>
<employee>
<name>John Doe</name>
<email>John.Doe@foo.com</email>
</employee>
<employee id="B.S">
<name>Bob Smith</name>
<email>Bob.Smith@foo.com</email>
</employee>
<employee id="A.M">
<name>Alice Miller</name>
<url href="http://www.foo.com/~amiller/"/>
</employee>
</department>
When we parse the document with SimpleParseWithValidation, we can see an error because the document does not conform to the DTD. R:\samples>java chap02.SimpleParseWithValidation chap02/ department-dtd2.xml [Error] 4:13 Attribute "id" is required and must be specified for element type "employee". As shown in the previous output, the fourth line of department-dtd2.xml has an error. The email element does not have an id attribute, although it is required. Errors and warnings with line numbers make it possible to recognize where and why they occurred. NOTE The difference between an error and a fatal error is defined in the XML 1.0 specification. An error is a violation of the rules of the specification. A conforming XML processor may detect and report an error and may recover from it. That means an application may get the internal structure of parsed XML documents. Violations of validity constraints are errors. On the other hand, the XML processor must detect and report fatal errors to the application. Once a fatal error is detected, the processor must not continue normal processing. Violations of well-formedness constraints are fatal errors. 2.2.3 Design Point: Well-Formed versus ValidIn the previous sections, you learned how to parse well-formed and valid documents. In this section, we discuss which types of documents should be used when you design and develop real Web applications. In other words, what are the pros and cons of validation? This section discusses the design point from several viewpoints.
|
| [ directory ] |
|