To ensure an XML document is processed correctly, it is essential that the encoding which is used to present the document is identified. XML permits this encoding to be specified in an encoding declaration within the document. When data is transferred between different data processing systems, the encodings used are generally also converted, but no changes are made to the contents. This can result in the specification of the encoding in the XML document no longer matching the encoding which is actually used for the presentation.
In order to identify the encoding declaration in the XML document, an assumption must have been made beforehand regarding the encoding used to permit the document to be read. This is roughly possible because a well-formed XML document must always begin with the string <?xml. The encoding currently used for the XML document can be derived by comparing the start of the document with the presentation of this characteristic string in the various encodings supported by the parser.
Furthermore, BS2000/OSD enables a file attribute to be assigned for files; this names an encoding (CODED-CHARACTER-SET), but does not force the file content to be presented in this encoding. When XML documents are made available in working memory (which is also possible in COBOL), the specifications in the program also allow an encoding to be derived which is used to present the document, see the “COBOL 2000 Compiler” manual [1], “ASSIGN clause” section.
Three sources consequently exist from which the same encoding which is used to present the document can be derived:
Z1 Z2 | from examining the start of the document; closed, assumed encoding external specification of the encoding as a file attribute or specifications in the program |
Z3 | encoding declaration in the XML document |
In order to as far as possible prevent manual intervention from being required before an XML document is processed, the COBOL system also to some extent accepts missing or contradictory specifications regarding encodings from these three sources.
The decision on the encoding which is ultimately assumed for processing purposes or the decision on the I-O status in the case of contradictions which cannot be resolved is taken in accordance with the table below. A dash (–) means that the existence or compatibility plays no part in this decision.
Existing situation | Decision taken | ||||||
Z1 | Z2 exists ** | Z3 exists ** | Z2 | Z3 | Z3 | Encoding | I-O |
yes | yes | yes | yes | – | yes | Z3 | |
yes | yes | yes | yes | – | no | Z2 | |
yes | yes | yes | no | yes | – | 3D | |
yes | yes | yes | no | no | – | 3D | |
yes | yes | no | yes | – | – | Z2 | |
yes | yes | no | no | – | – | 3D | |
yes | no | yes | – | yes | – | Z3 | |
yes | no | yes | – | no | – | Z1 | |
yes | no | no | – | – | – | Z1 | |
no | yes | – | – | – | – | Document | Document in |
no | no | – | – | – | – | 3D | |
– | Unknown | – | – | – | – | 3D | |
– | – | Unknown | – | – | – | 3D |
* | Only UTF-16, EBCDIC or UTF can be identified as encoding Z1. Here EBCDIC stands for a(n) (imprecise) superset for all special variants (such as EDF03IRV, EDF041, etc.) and UTF as a(n) (imprecise) superset for UTF-8 and all ISO variants supported by XHCS. |
** | Only UTF-8, UTF-16, EBCDIC, ISO646 and the special EBCSIC variants and ISO variants under the term 'exists', i.e. all those which also know XHCS, are understood as Z2 for documents in files and as Z3. All other encodings as regarded as 'unknown'. Only EBCDIC (for alphanumeric data items) and UTF-16 (for national data items) are possible as Z2 for documents in memory. |
*** | ’Encoding Zx compatible with encoding Zy' means that Zx and Zy designate the same encoding, or that Zx is a more precisely named encoding from the (imprecise) superset Zy. |
If the encoding ultimately selected only designates the imprecise superset EBCDIC, the special variant available at the time the program is compiled is used.
If the encoding ultimately selected only designates the imprecise superset UTF, UTF-8 is used.
This encoding identification takes place in every OPEN DOCUMENT statement (without an AT phrase), and during an XML PARSE statement both for the primary XML document and for the external entities or DTDs in this document which are addressed.