| [ directory ] |
|
8.1 A Brief Introduction to XMLIn its most fundamental sense, XML simply provides a way to add structure to documents. Consider the problem that someone might face when e-mailing a list of CDs to a friend. Clearly, this e-mail will need to contain a list of artists, albums, and tracks: the same entities dealt with when constructing a CD database in Chapter 6. One approach might be to use tab stops to group information together, as in Listing 8.1. Listing 8.1 Structuring a document with tabs
The Crüxshadows
Telemetry of a Fallen Angel (1996)
Descension
Monsters
Jackal-Head
The Mystery of the Whisper (1999)
Isis & Osiris (Life/Death)
Cruelty
Leave me Alone
Wishfire (2002)
Before the Fire
Return (Coming Home)
Binary
Although this is certainly easy for a human to read, and not even too difficult for a computer, a lot of information is lacking. The numbers in parentheses indicate the year the CD was released, but if someone is unfamiliar with that particular convention, the numbers will appear meaningless. Also, simply looking at any particular word does not indicate what it represents. "Jackal-Head" could be an artist, album, or track or even the name of a store where the CD was purchased, a club where the band played, or a restaurant. If the recipient does not know to expect a list in exactly this precise form, the file becomes meaningless because the semantics of the information梬hat each piece means and how the pieces relate to one another梐re not present in the file. In addition to that fundamental problem, this format has no standard. Perhaps one person will choose to use tab stops of four spaces, whereas someone else will use eight. Maybe someone will choose to have one new line between each album and two before the start of each new artist. Although none of these changes will greatly impact the ability of a person to read the file, it may complicate the creation of a program to manage such lists. For simple data, such as a CD collection that deals with only three kinds of objects and two relationships, these problems are manageable. But for much more complex systems, these problems quickly become insurmountable. In a system that manages hundreds of relationships, six tab stops might mean one thing one place in a file and another somewhere else, and determining which is appropriate cannot be done without mentally processing the whole document. XML offers a way out of this nightmare by providing a very simple syntax with which to add semantic information to documents. This syntax looks very much like HTML, which is not surprising, as both XML and HTML have a common ancestor: SGML (Standard Generalized Markup Language). An HTML tag, such as <H1>...</H1>, was originally intended to convey a semantic meaning: that the body of the tag is a level 1 header. Over time, this meaning has become diluted; today, HTML is generally used to specify how data should be presented rather than what the data means. In the terms that have been used throughout this book, HTML has gone from describing a model to describing a view. Despite HTML's changing role, the fundamental idea of using such tags to denote meaning is still sound. The only major piece missing is a way to create new tags to describe arbitrary kinds of entities instead of a fixed set of headers, images, and so on. This is where the "extensible" in Extensible Markup Language comes in. Creating an XML document can be as simple as deciding what tags to use and how they relate. Listing 8.1 could be rewritten in a much better, more structured way using XML, as shown in Listing 8.2. Listing 8.2 Structuring a document with XML
<?xml version='1.0' encoding='iso-8859-1'>
<artist name="The Crüxshadows">
<album name="Telemetry of a Fallen Angel" year="1996">
<track>Descension</track>
<track>Monsters</track>
<track>Jackal-Head</track>
</album>
<album name="The Mystery of the Whisper" year="1999">
<track>Isis & Osiris (Life/Death)</track>
<track>Cruelty</track>
<track>Leave me Alone</track>
</album>
<album name="Wishfire" year="2002">
<track>Before the Fire</track>
<track>Return (Coming Home)</track>
<track>Binary</track>
</album>
</artist>
As this listing shows, the rules of XML are very much like those of HTML, despite some important differences in terminology. First, the file starts with a declaration of what kind of document it is and the character set it is using.[1] In XML, the entities in angle brackets, or tags in HTML, are called nodes. Every node has a name, which is the primary identifier. Listing 8.2 has nodes named artist, album, and track. Nodes are allowed to have attributes, as in HTML. The album node has the attributes name and year. The use of the word name as an attribute may be a bit misleading but is seen quite often. Here, name refers to the name of the album, not the name of the node.
Nodes can be nested arbitrarily, but a document can, and must, have one and only one top-level node, called the root node. In Listing 8.2, the artist node is the root. It would not be legal to list the CDs from another artist in this same document by simply adding a new artist node. Instead, both artist nodes would need to be contained within another node, which might be called collection. Besides containing other nodes, a node can contain a block of plain text, as the track nodes in Listing 8.2 do. More freedom is possible when deciding on the format of an XML document. For example, the name of each track could be placed in an attribute, such as <track name="Binary"/>, instead of in the body of the track node. The choice is completely free, although experience will often suggest one way over another. Note that if a node has no body, it must end with a slash?TT>/>梩o indicate that the file does not have a corresponding close tag. Listing 8.2 constitutes what is called a well-formed XML document, meaning that it follows the rules of XML syntax, such as providing a single root node, properly matching opening and closing tags, and so on. Beyond following these simple rules, an XML document can and should have much more information. Listing 8.2 implies certain things about the nodes that are used, such as the existence of the artist, album, and track nodes; that artist may have a name attribute; and so on. However, these rules are not explicitly stated; nor does the listing specify any others that may be important to enforce. Placing an album node within a track node would still result in well-formed XML, but this information would now be meaningless in context. The mechanism to fix this is called a document type definition (DTD). The DTD describes all the nodes that a document will use, their attributes, and their relationships. This information, and more, could also be specified using an XML schema; however, schemas are beyond the scope of this book, as are the art and science of creating DTDs. A possible DTD for describing a CD collection is shown in Listing 8.3. Listing 8.3 The document type definition<!ELEMENT artist (album*)> <!ATTLIST artist name CDATA #REQUIRED> <!ELEMENT album (track*)> <!ATTLIST album name CDATA #REQUIRED> <!ATTLIST album year CDATA #REQUIRED> <!ELEMENT track (#PCDATA)> Once such a DTD is created, the document can reference it with a single line at the top: <!DOCTYPE artist SYSTEM "cd.dtd"> With the inclusion of a DTD, like Listing 8.3, an XML document can be not only well formed but also valid. Such a document not only is syntactically correct but also follows all the rules and is therefore semantically correct. Flipping tags around in a meaningless way would now render a document invalid. This check can be done very early, when the document is first parsed, avoiding any potential errors that could result from bad data getting farther into the system. In addition, providing a DTD will often allow the data to be parsed and represented more efficiently. Many XML editors are also able to read a DTD and can ensure that the rules are followed while the document is being created or changed. |
| [ directory ] |
|