站内搜索: 请输入搜索关键词
当前页面: 图书首页 > XML and Java: Developing Web Applications, Second Edition

XML and Java: Developing Web Applications, Second Edition

[ directory ] Previous Section Next Section

3.6 Internationalization

XML is useful for internationalization. On the basis of Unicode, XML can represent text written in many natural languages. XML further supports many non-Unicode encodings by converting them to Unicode because such encodings have been already in use. In fact, Xerces supports more than 40 encodings.

However, XML programming still requires special care about internationalization. Here is a brief summary of what you should know.

3.6.1 XML Declarations

We recommend that XML documents and DTDs (as well as external DTD subsets, external parsed entities, and external parameter entities) always begin with an XML declaration.

<?xml version="1.0" encoding="charset-name"?>

Here "charset-name" announces which character-encoding scheme is used for representing the document. Depending on which natural language and which editor you use, you must specify a different charset-name.

If you use only those characters in US-ASCII (to be precise, ANSI X3.4-1986), specify "us-ascii" as the charset-name or omit encoding="charset-name" entirely. If you use European languages, you are probably using one of the ISO 8859 family (typically ISO-8859-1), then, specify "iso-8859-1".

If your text editor allows Unicode, we strongly recommend it. Although XML supports legacy encodings, conversion of such an encoding to Unicode is implemented differently by different XML processors. For example, different implementations provide different conversions for Shift-JIS encoding. Such non-unique conversions to Unicode are very harmful for Web applications because a digital signature is performed after the conversion. In fact, one of the authors has experienced some errors because of such non-unique conversions. According to the XML 1.0 specification, any XML processor should support UTF-8 and UTF-16. Support of the encodings depends on the implementations of the XML processors.

Unicode provides many encodings, such as UTF-8, UTF-16, and UTF-32, and each encoding has variations. Although the details of such encodings are beyond the scope of this book, we can give you a rule of thumb. If you use Notepad in Windows 2000, choose UTF-16 and specify "utf-16" as the encoding name.

3.6.2 Charset Parameter

Although an encoding declaration can be specified within an XML document, it is used only when the XML document is stored in a file on your hard disk.

If an XML document is sent or received via some protocol such as SOAP or HTTP, the encoding of the document is determined differently. Together with the document, a collection of information about the document is transmitted via the protocol. This information collection is the MIME header of this document. The encoding of the document is specified by the charset parameter of the field "Content-type" in the MIME header.

We strongly recommend use of the MIME type "application/xml" together with an appropriate charset parameter when an XML document is exchanged via a protocol. When a DTD is exchanged, use the MIME type application/xml-dtd with a charset parameter. You can find further discussions on this topic in Chapter 10.

    [ directory ] Previous Section Next Section