SQL-92 Character Sets and Collations for International Database Applications

Databases can contain Greek characters, or Cyrillic (Russian and Ukrainian), or Latin characters but with the special accent marks that appear only in East European languages such as Polish, Slovenian, Romanian, or Turkish. Even with well-known West European languages such as French and German and Spanish, there are special rules of collation -- "how to sort" -- that baffle the incomplete SQL DBMSs.

Consider the letter A. You probably know that your computer stores this value using the 8-bit sequence 01000001, which is the hexadecimal value 41, which is the decimal value 65, and so you've probably heard that "the letter A has the ASCII value 65". Another thing that you know is that "A comes before B", but that has nothing to do with the ASCII value. The fact that A comes before B is a fact about the alphabet -- you learned the rule in kindergarten before you ever saw a computer, and you know that if we used a value of 0 to encode B and 5 billion to encode A, that wouldn't make any difference to the rule.

Now, the statement "the letter A has the ASCII value 65" is a statement about the CHARACTER SET. Internally, a character set is a series of statements about what symbols exist and what code values they have. Actually, ASCII is only a specification for 7 bits (the values 0 to 127); nowadays we use 8-bit character sets which all start off with the ASCII values but which differ for the values 128 through 255. These 8-bit character sets differ because of the operating system's whim (consider the difference between the "OEM" and the "Windows" character set) or because of national language support (consider that Greeks and Russians use different alphabets). THE OCELOT SQL DBMS supports several "national" character sets, as well as the FIPS subsets SQL_CHARACTER, GRAPHIC_IRV, LATIN1, and ISO8BIT. That's a lot of 8-bit character sets, and to top it off, the DBMS also supports a 16-bit character set whose repertoire includes all the characters in all the 8-bit character sets. This 16-bit character set is named SQL_TEXT, and all the encodings in this 16-bit character set are as specified by the Unicode standard.

As for "A comes before B", that's a specification about the COLLATION. Our default collation is the order that the characters appear in Unicode, which happens to correspond to the usual expectations for English, French, Italian, Russian, etc. But most languages have special rules which go beyond this (and which incidentally cannot be handled with simplistic lookup tables). To support these languages, THE OCELOT SQL DBMS includes special collations for: ALBANIAN AUSTRIAN CROATIAN CZECH DUTCH ESTONIAN GERMAN HUNGARIAN LATVIAN LITHUANIAN NORDIC POLISH ROMANIAN SLOVAK SLOVENIAN SPANISH TURKISH UKRAINIAN WELSH. Effectively, this covers 99% of text in European languages.

Most practical needs will be met by the built-in character sets and collations but there are SQL-92 statements for making more. There is also an object called TRANSLATION, designed for converting from one character set to another.

Return to Ocelot home page