In the course of increasing internationalization, the Unicode character set is also becoming extremely important in BS2000 and its applications. Detailed information on this subject can be found in the manual “Unicode in BS2000”.
Unicode support in BS2000 is embedded in the existing concept of coded character sets (CCSs), see the manual “XHCS (BS2000)”.
A prerequisite for using Unicode is a suitable BS2000 system environment. The products which are currently being relevant for SESAM/SQL are ESQL-COBOL, COBOL, CRTE and XHCS, see the Release Notice for SESAM/SQL-Server.
The concept of Unicode support in SESAM/SQL enables Unicode characters to be used in the columns of tables and takes into account coded character sets in databases, in I/O files and for user programs. This concept influences the SQL language description, the utility functions, and the SESAM/SQL user programs.
EBCDIC character sets
The standard character set in BS2000 is EBCDIC.DF.03IRV (CCS name EDF03IRV
), a 7-bit character set whose character repertoire is the same as that of the ASCII 7-bit character set, extended by the second control character block of ISO8859-1.
EDF03IRV comprises only 191 characters, 95 of these being printable characters. In their character repertoire the 8-bit character sets EDF04x (x=1, 2, 3, 4, 5, 7, 8, 9, F) are the same as the corresponding character sets ISO8859-x.
EBCDIC character sets all contain the same EBCDIC core (79 characters). The 8-bit character sets also contain language-specific characters, e.g. Greek characters in EDF047. No special aspects had to be taken into account for your application in SESAM/SQL V4.0 or earlier. The database’s coded character set (CODE_TABLE
) had comment characters.
One byte is sufficient for the binary presentation (coding) of an EBCDIC character. The length of an EBCDIC character is consequently one byte.
To ensure compatibility with the Unicode character set, the length in code units (1 code unit = 1 character = 1 byte) is also defined for EBCDIC strings.
Values and columns with alphanumeric (EBCDIC) data types are compared and sorted in binary format:
Latin lower-case letters (a-z) < Latin upper-case letters (A-Z) < numbers (0-9).
In the SESAM/SQL suite of manuals the term “alphanumeric” expresses the affiliation to an EBCDIC character set, e.g. alphanumeric data type, alphanumeric value, alphanumeric literal.
Unicode character set
Unicode incorporates all the known text character in the world in a single character set. Furthermore, Unicode is not dependent on different vendors, systems and countries.
In Unicode each character is assigned a number, which is referred to as the code point. A Unicode code point is generally specified in the form U+n
, where n
consists of 4 to 6 hexadecimal digits. Example: the Euro character €: U+0020AC
.
The Unicode character set contains over one million code points.
So-called surrogate pairs are used to represent Unicode values above FFFF
hexadecimal):
Code point range | UTF-16 encoding (2-byte) | Comments |
| Leading Surrogates | |
| Trailing Surrogates | |
| Surrogate Pair | Leading Surrogates followed by |
Table 2: Surrogate representation
“Noncharacters” are code points which are reserved in Unicode for internal purposes. In SESAM/SQL they may not be used in Unicode character strings.
The table below shows the code point ranges of the noncharacters and how these characters are encoded in UTF-16.
Code point range | UTF-16 encoding (2-byte) | Comments |
| Not permitted | 32 code points |
| Not permitted | 32 code points |
| Not permitted | 2 code points |
Table 3: Unicode noncharacters (x is a hexadecimal digit)
Code points are represented in bytes (encoded) for use in data processing in various ways. For this purpose the Unicode Consortium defines three different encoding forms: UTF-8, UTF-16 and UTF-32.
The length of Unicode strings in the particular encoding form is specified in “code units” here. If only the so-called “Basic Multilingual Plane” (BMP, corresponds to UCS2) is used for encoding form UTF-16, then 1 code unit = 2 bytes.
SESAM/SQL uses the encoding form UTF-16, to represent Unicode characters in the databases.
Consequently precisesly two bytes, i.e. one code unit in UTF-16, are required for binary representation (encoding) of a UTF-16 character in SESAM/SQL. Values and columns with Unicode data types are compared and sorted (in relation to UTF-16) in binary format:Numbers (0-9) < Latin upper-case letters (A-Z) < Latin lower-case letters (a-z)
In the SESAM/SQL suite of manuals the term “national” expresses the affiliation to a Unicode character set, e.g. national data type, national value, national literal.
The technical report of the Unicode standard describes (by analogy to the encoding for UTF-8 for ASCII servers) a UTF-EBCDIC encoding for EBCDIC servers. The characters of UTF-8 are represented in one byte and correspond to the characters of the ASCII 7-bit character set. Analogously, the characters of UTF-EBCDIC are represented in one byte and correspond to the standard character set EDF03IRV of BS2000. The advantage of UTF-EBCDIC (BS2000, CCS name UTFE
) is that all BS2000 system programs which are traditionally restricted to the character repertoire EDF03IRV (e.g. the BS2000 command processor) can also process character strings in the encoding form UTF-EBCDIC without any adaptations.
Support of Unicode in SESAM/SQL
The data types NATIONAL CHARACTER (NCHAR) and NATIONAL CHARACTER VARYING (NVARCHAR) are provided in SESAM/SQL in order to support Unicode. Unicode data can be stored in columns with these data types.
In SESAM/SQL the data is stored in the encoding form UTF-16 (in the scope of the BMP). Since two bytes of memory (one code unit) are required for a UTF-16 character in the internal SESAM/SQL representation, the following maximum lengths apply for the Unicode data types:
NCHAR: 128 characters (256 bytes)
NVARCHAR: 16000 characters (32000 bytes)
The Unicode data types can be used in operations such as compare, concatenate and assign.
All functions that were previously provided in SESAM/SQL for CHARACTER (VARYING) are also available for the new data types NATIONAL CHARACTER (VARYING). The relevant constraints must be borne in mind here.
The new data types are also supported in the corresponding DDL and Utility statements, see the CREATE TABLE and ALTER TABLE statements in the „SQL Reference Manual Part 1: SQL Statements“ and the CREATE CATALOG, ALTER CATALOG, LOAD, UNLOAD, EXPORT and IMPORT statements in the „SQL Reference Manual Part 2: Utilities“.
Coded character set of the database
To interpret (alphanumeric) character sets in columns with the CHARACTER and CHARACTER VARYING data types correctly, SESAM/SQL must know the coded (EBCDIC) character set in which the data has been encoded.
When a new database is created (using CREATE CATALOG) or an existing database is modified (using ALTER CATALOG), you can therefore specify or change the database’s CCS name using the CODE_TABLE
parameter. In SESAM/SQL the database’s CCS name is used when data of the type (VAR)CHAR is converted to N(VAR)CHAR and vice versa. This is, for example, necessary when the data type of a column is changed and in some variants of the utility statements LOAD and UNLOAD.
Coded character set of the application
To interpret character sets correctly, SESAM/SQL must know the coded (EBCDIC) character set with which the application interprets the character strings.
With the new connection module parameter CCSN
(CCS name) you notify the independent DBH of the character set which the application uses to interpret character strings. You notify the linked-in DBH of this using the DBH option LINKED-IN-ATTRIBUTES.
SQL statements can be processed in the DBH only if the application’s CCS name is the same as the database’s CCS name or if no coded character set is used for the database (CODE_TABLE
has the value _NONE_
).
Activating Unicode in an existing database
In SESAM/SQL databases up to and including V4.0 the CODE_TABLE
parameter in CREATE CATALOG had comment characters and was stored as specified in the metadata. As this parameter now has a meaning, the first time an existing database is accessed by SESAM/SQL as of V5.0 it is assigned the value _NONE_
, i.e. the database does not use a coded character set.
In order to use Unicode the database administrator must enter the coded (EBCDIC) character set currently used for the database using the utility statement ALTER CATALOG, e.g. EDF03IRV or EDF041 for a German BS2000 environment. The coded character set should also be defined for the applications (see above).
If the value _NONE_
is retained for the database, no implicit conversions from (VAR)CHAR to N(VAR)CHAR and vice versa are possible. All requests to the database which require such a conversion are rejected.
Example 1
A STAFF
table contains the column LASTNAME
with the data type CHARACTER(40). The CCS name of the database is EDF041
; the data is therefore stored with the CCS name EDF041
. You can use Unicode in two different ways:
The following DDL statement changes the data type of the
LASTNAME
column to NATIONAL CHARACTER, and the data stored in this column is converted implicitly from EDF041 to UTF16:ALTER TABLE STAFF ALTER COLUMN LASTNAME SET NCHAR(40)
The following statement defines an additional column:
ALTER TABLE STAFF ADD COLUMN ALIAS_LASTNAME NCHAR(40)
After this, the data is converted explicitly from their original column of data type CHARACTER to the target column of data type UTF16. The database’s coded character set EDF041 is used here:
UPDATE STAFF SET ALIAS_LASTNAME=TRANSLATE(LASTNAME USING CATALOG_DEFAULT)
Example 2
A STAFF
table contains the column LASTNAME
with the data type CHARACTER(40). The CCS name of the database has the value _NONE_
. An additional column is defined using the following statement:
ALTER TABLE STAFF ADD COLUMN ALIAS_LASTNAME NCHAR(40)
After this, the data is converted explicitly from their original column of data type CHARACTER to the target column of data type UTF16. The coded character set EDF041 is specified explicitly here:
UPDATE STAFF SET ALIAS_LASTNAME = TRANSLATE(LASTNAME USING EDF041)
The following statement is rejected becasue the database’s CCS name is _NONE_
and the conversion can therefore not be performed.
UPDATE STAFF SET ALIAS_LASTNAME=TRANSLATE(LASTNAME USING CATALOG_DEFAULT)
Transliteration, transcoding
The SQL function TRANSLATE() is provided in SESAM/SQL to convert alphanumeric strings (data type (VAR)CHAR) to Unicode strings (data type N(VAR)CHAR, character set UTF-16) and vice versa (transliteration).
TRANSLATE() also converts alphanumeric strings in the Unicode character set UTF-EBCDIC (BS2000, CCS name UTFE
) to the Unicode character set UTF-16 and vice versa (transcoding). See the “SQL Reference Manual Part 1: SQL Statements”.
Normalization
The encoding of a character in Unicode is not unique, i.e. there could be more than one encoding for a character.
A typical example of this is provided by the German umlauts. For example, the character Ä has both the code point U+00C4
(composed form) and the code point combination U+0041
and U+0308
(decomposed form). In normalized presentation forms these differences do not occur. If two normalized strings differ, it is in their different code point presentations.
Non-normalized strings can result in consequent errors after a collation.
In SESAM/SQL the SQL function NORMALIZE() converts a string with national characters (data type N(VAR)CHAR) to a normalized form. Only those characters being taken into account which have code points in the range U+0000
through U+2FFF
. Other characters, e.g. surrogates, remain unchanged. See the “SQL Reference Manual Part 1: SQL Statements”.
The normalization procedure requires CPU performance. The data of a SESAM/SQL database should therefore be available in normalized and compressed form.
If it is not certain whether (input) data in normalized form is available, normalization to normalization form C should be performed using the SQL function NORMALIZE().
Example
The code point U+1ED6
corresponds to the Latin upper-case letter „O“ with circumflex and tilde. This character can be generated with the help of three code points: U+00D4
for “Ô” and U+0303
for tilde or U+004F
for “O” and U+0302
for circumflex and U+0303
for tilde. The code-point sequence U+00D5
for the „Õ“ with tilde and U+0302
for circumflex also produces this character. The only rule is that the base character must come before the diacritical marks linked to it.
In other words the result of normalization form C (compose) for NORMALIZE() for the uppercase Latin letter “O” with circumflex and tilde is U+1ED6
, and the result of the normalization form D (decompose) is the code point combination U+004F
, U+0302
, U+0303.
Sort sequence
In most programs, character strings are compared in binary form. The binary sort sequence of the characters depends on their encoding:
For all ISO8859 and Unicode encodings, the following applies:
Numbers (1-9) < Latin upper-case letters (A-Z) < Latin lower-case letters (a-z)For EBCDIC and UTFE, the following applies:
Latin lower-case letters (a-z) < Latin upper-case letters (A-Z) < numbers (1-9).
Umlauts and diacritical characters are not arranged unambiguously in these collating sequences. This can result in poor collating sequences, e.g. the name “Zuse” can appear ahead of the name “Öhler” in the collation of a list of names.
The Unicode standard defines a linguistic sort algorithm. Each Unicode character is assigned a collation element. The sequence in which the Unicode characters are sorted is defined with the aid of these collation elements. The collation elements are defined by means of a table supplied by XHCS (Default Unicode Collation Table, DUCET). This table contains a priority for the character at various levels. SESAM/SQL recognizes three levels, and these are displayed in the table below. The individual characters are always compared from left to right. The first difference determines the result of the comparison.
Comparison level | Description | Example |
Level 1 | Base character | a < b |
Level 2 | Diacritics | A < Å |
Level 3 | Uppercase/lowercase | a < A |
Level 1: | Each base character (a,b,c, etc.) is assigned a permanent priority in the Default |
Level 2: | The base character has a diacritic. A diacritic is an additional character (e.g. |
Level 3: | The sort sequence is defined by the distinction between upper- and lowercase |
In BS2000 you can obtain the collation element via XHCS, see also the „XHCS (BS2000)“ manual.
The SQL function COLLATE() supplies the corresponding collation element of the Default Unicode Collation Table for national strings.
See the “SQL Reference Manual Part 1: SQL Statements”.