| [ directory ] |
|
11.3 Mapping from an XML Document to TablesOne of the most important characteristics of XML is that it is semistructured data. It is difficult to define this word precisely, but it means data for which a strict schema cannot be defined, or for which it is difficult to define a strict schema. XML allows an iteration of an arbitrary number of elements. This flexibility of XML makes it difficult to map an XML document to one or more relational tables because a table has a strict schema. This section concentrates on mapping a semistructured XML document with relational tables. Chapter 15 covers a broader topic on data binding. Figure 11.2 shows the mapping between an XML document and relational tables. The upper part of the figure (a) shows a mapping from an XML document to a set of tables. An XML document is decomposed into one or more tables. In this case, we need a technique to convert structured (and ordered) data to table (and unordered) data. We should determine what information should be stored (some information may not be needed). We also should determine how many tables are required to cover the XML document based on a given XML schema. There is no established strategy for the decomposition at this moment. We introduce some approaches in this section. Figure 11.2. Mapping between an XML document and relational tables
The lower part of Figure 11.2 (b) shows the mapping from relational tables to an XML document. In this case, table data is converted to structured data. Most techniques are common with the opposite direction, but there is a design choice for representing the table data as a structured XML document. The figure shows two possibilities for the mapping, which are covered in the next section. For example, how can we define the schema of an XML document (po.xml) to store it in a database? Listing 11.1 shows the XML document. Listing 11.1 A DTD for a purchase order, chap11/po.xml
<?xml version="1.0"?>
<!DOCTYPE purchaseOrder [
<!ELEMENT purchaseOrder (shipTo, items)>
<!ATTLIST purchaseOrder invoiceNo CDATA #REQUIRED>
<!ELEMENT shipTo (name, street?)>
<!ATTLIST shipTo country CDATA #REQUIRED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT items (item*)>
<!ELEMENT item (#PCDATA)>
<!ATTLIST item qty CDATA #REQUIRED>
]>
<purchaseOrder invoiceNo="2001-08-31-12345">
<shipTo country="US">
<name>Alice Smith</name>
<street>123 Maple Street</street>
</shipTo>
<items>
<item qty="2">ThinkPad X21</item>
<item qty="1">ThinkPad T22</item>
</items>
</purchaseOrder>
If a schema (or a DTD) for the document is finalized and a program knows the details of the schema, we can define a schema that depends on the document to be stored. The schemas for two tables to store the document po.xml are shown in Tables 11.1 and 11.2. Data items appearing once in a document (such as invoiceNo and name) can be represented in a table (see Table 11.1). Items appearing multiple times (such as item) should be defined in separate tables (see Table 11.2). That means XML documents are semistructured data so that it is impossible to map a single table in most cases. When you want to submit a query to find "people who ordered the ThinkPad X21," the two tables must be joined, because the name and item elements are mapped in different tables. However, if the structure of an XML document is not very complex and its schema can be defined, this approach is worth trying. If an XML document can be decomposed into tables, the data can be accessed via SQL. It can be integrated with other (non-XML) data stored in databases and existing applications, which is an advantage for managing XML documents in an RDBMS. As mentioned before, the mapping between an XML document and tables is a nontrivial work. Database vendors like IBM and Oracle provide tools to help developers (see http://www.ibm.com/software/data/db2/extenders/xmlext/). For example, DB2 XML Extender provides an XML-based mapping language. Users can create the mapping file with a wizard-like GUI. Oracle provides XML Developer's Kit (XDK) (see http://otn.oracle.com/tech/xml/xdkhome.html) that contains an XML processor that supports XML Schema and useful classes to develop real applications by using XML and databases. Using these tools can reduce the cost of schema design and mapping. Though some tools for mapping are available, there is no tool that maps them automatically. Some design points are covered in the next section.
11.3.1 Designing Relational TablesMost practical XML documents (for example, a purchase order or a contract) are represented with multiple tables. The structure of a typical XML document consists of the following components:
First, we can create a table for a top-level concept represented by the root element of an XML document. Generally speaking, an XML document can depict a contract, a purchase order, a technical report, and so on. It can be represented as a single table called the parent table. For the document po.xml, a table shown in Table 11.1 is the parent table. The table defines the invoice_ID, country, name, and street elements. 11.3.2 Defining the Primary Keys in a TableThe parent table contains a column whose value is unique, so it can be used as a primary key. A primary key is a value or a set of values that can uniquely identify a row in a table. In po.xml, the value of the invoice_ID attribute can be used as the primary key. RDBMS prohibits storing multiple records that have the same primary key value, so we should make sure the value is actually unique. If an XML document contains no such unique identifier, a primary key should be generated and added to the table to join with other tables. In Listings 11.6 and 11.11 later in this chapter, we define a column name that does not appear in po.xml. Its value is generated as a unique value by using the java.rmi.UID class. Some of the constraints defined for XML schema languages can be mapped to the constraints for tables. For example, if an element is not optional, the NOT NULL constraint should be applied to the corresponding column name. 11.3.3 Designing a Table for Multiply Occurring ElementsElements iterated in an XML document (defined by using "*" and "+" in a DTD) can be represented in a separate table. In po.xml, the element item is represented in a table called a child table. A parent table and child tables are joined when an RDBMS receives a query. For child tables, defining the primary key is important, though in most cases the original XML document contains the identifier for the element. RDMBSs provide various types of constraints among tables. For example, if a record in a parent table is deleted, related records should be deleted from child tables. The following constraint does this automatically.
FOREIGN KEY ("invoice_ID") REFERENCES PO_TBL ("invoice_ID")
ON DELETE CASCADE) ;
This constraint indicates that the invoice_ID column in the item table shown in Table 11.2 (ITEM_TBL) is the foreign key associated with the invoice_ID column in the purchase order table shown in Table 11.1 (PO_TBL). And if a record that has a certain invoice_ID value is deleted from the PO_TBL table, all the records that have same invoice_ID value are automatically deleted from the ITEM_TBL table. 11.3.4 Datatype MappingA database schema strictly defines datatypes for each column. On the other hand, a DTD does not support datatypes such as numerical and fixed-length character types. Therefore, to express datatype information in a DTD, you should represent them explicitly in an XML document by using the dtype attribute, for example. In the first edition of this book, Chapter 6 introduced a sample program using this approach. Now XML Schema has come into play, and it makes it possible to define datatypes in schemas (see Chapter 9). RELAX, another XML schema language described in Chapter 16, also employs the datatype specification the same as XML Schema. Section 9.2.2 discusses mapping among the datatypes of Java, XML Schema, and SQL. A DTD does not constrain the length of characters (though XML Schema does support it). You should take care when you extract a string from an element's content and store it into a database. For example, if a column is defined as CHARACTER(32), it is impossible to store any string exceeding this limitation. If the maximum length of the data cannot be estimated, you should use the VARCHAR type, which allows you to change the maximum size after you create a table. If the length of characters is very long, use the Character Large Object (CLOB) or the Binary Large Object (BLOB) datatype. 11.3.5 Semantics of DataAll elements that appear once are not always represented as a single table. When a relational table is designed, there are many techniques for data modeling. Creating an Entity-Relationship Diagram (ER diagram) is one well-known method of modeling. The table design essentially depends on what an XML document represents, and this fact causes difficulty for automatic mapping between an XML document and tables without knowing the semantics of the data to be stored. Next we introduced some points for designing relational schemas for a given XML document. 11.3.6 Two ApproachesIn Section 11.2, we showed how to map between an XML document and database tables. When we know the schema of an XML document, we can define tables that can store the XML document. In this section, we call this approach the mapping method. However, if we don't know the schema for the document, how can we store the document? One solution is to decompose XML documents by using XPath and store a pair of an XPath expression and the content addressed by the expression. We call this approach the XPath method. By using the XPath method, we can store po.xml in the table shown in Table 11.3. Figure 11.3 depicts the two approaches. The mapping method requires a DTD or XML Schema to associate a node value with a column value in a table with an appropriate datatype. On the other hand, the XPath method does not require a schema, but all node values are stored in a column with a single datatype. Figure 11.3. The mapping and XPath methods
In Section 11.5, we further discuss these two methods using sample programs.
|
| [ directory ] |
|