Free Download
Links
Glossary
White Papers
SQL-99 Textbook
Company Info
SQL Tutorial
BOOK REVIEWS:
SQL Books
DBMS Books
JDBC Books
ADO Books
MySQL Books
Sybase Books
Informix Books
DB2 Books
Home
Get a free copy of our DBMS
Order our book
|
SQL-92 Character Sets and Collations for International Database Applications
Databases can contain Greek characters, or Cyrillic (Russian and Ukrainian),
or Latin characters but with the special accent marks that appear only in
East European languages such as Polish, Slovenian, Romanian, or Turkish. Even
with well-known West European languages such as French and German and Spanish,
there are special rules of collation -- "how to sort" -- that baffle the
incomplete SQL DBMSs.
Consider the letter A. You probably know that your computer stores this value
using the 8-bit sequence 01000001, which is the hexadecimal value 41, which is
the decimal value 65, and so you've probably heard that "the letter A has the
ASCII value 65". Another thing that you know is that "A comes before B",
but that has nothing to do with the ASCII value. The fact that A
comes before B is a fact about the alphabet -- you learned the rule in
kindergarten before you ever saw a computer, and you know that if we used a
value of 0 to encode B and 5 billion to encode A, that wouldn't make any
difference to the rule.
Now, the statement "the letter A has the ASCII value 65" is a statement about
the CHARACTER SET. Internally, a character set is a series of statements
about what symbols exist and what code values they have. Actually, ASCII is
only a specification for 7 bits (the values 0 to 127); nowadays we use 8-bit
character sets which all start off with the ASCII values but which differ for
the values 128 through 255. These 8-bit character sets differ because of the
operating system's whim (consider the difference between the "OEM" and the
"Windows" character set) or because of national language support (consider
that Greeks and Russians use different alphabets). THE OCELOT SQL DBMS supports
several "national" character sets, as well as the FIPS subsets SQL_CHARACTER,
GRAPHIC_IRV, LATIN1, and ISO8BIT. That's a lot of 8-bit character sets,
and to top it off, the DBMS also supports a 16-bit character set whose repertoire includes
all the characters in all the 8-bit character sets. This 16-bit character
set is named SQL_TEXT, and all the encodings in this 16-bit character set are
as specified by the Unicode standard.
As for "A comes before B", that's a specification about the COLLATION. Our
default collation is the order that the characters appear in Unicode, which
happens to correspond to the usual expectations for English, French, Italian, Russian, etc.
But most languages have special rules which go beyond this (and which
incidentally cannot be handled with simplistic lookup tables). To support
these languages, THE OCELOT SQL DBMS includes special collations for:
ALBANIAN AUSTRIAN CROATIAN CZECH DUTCH ESTONIAN GERMAN HUNGARIAN LATVIAN
LITHUANIAN NORDIC POLISH ROMANIAN SLOVAK SLOVENIAN SPANISH TURKISH UKRAINIAN
WELSH. Effectively, this covers 99% of text in European languages.
Most practical needs will be met by the built-in character sets and collations
but there are SQL-92 statements for making more. There is also an object
called TRANSLATION, designed for converting from one character set to another.
Copyright (c) 1997-2002 by Ocelot Computer Services Inc. All rights reserved.
Return to Ocelot home page
|