Font Basics
A font file contains graphic representations of characters called glyphs. Computers store information, including text, as numbers. So, how does a computer know which glyph in a font file to display for the character typed by the user? The font file contains one or more character maps that assign a number to every glyph in the font, according to the input language chosen by the user. When a key is pressed, the computer uses the map associated with that language to lookup the correct glyph in the font, retrieve the data, and draw it on the screen.
There have been many problems with this method. The mapping tables in older font files used 8 bit codes, and so could only handle a maximum of 256 characters. The first half of these were reserved, generally, for basic ASCII characters, leaving very little room for other characters, so separate fonts were required for different languages. Add to this the fact that different operating systems had different maps for the same characters, and one has confusion.
Enter ISO/IEC 10646 and Unicode. A Unicode font can map up to 65 536 characters (using 16 bit codes), allowing enough room for all currently used characters to be placed in one font. It is also an independent standard, so any computer using Unicode can communicate with any other.
ISO 10464 and Unicode
ISO/IEC 10646 is a relatively new character set standard, first published in 1993 by the International Organization for Standardization (ISO). Its name is "Universal Multiple-Octet Coded Character Set" or UCS. UCS is the first offcially standardized coded character set with the purpose to eventually include all characters used in all the written languages in the world (and, in addition, all mathematical and other symbols). To be able to give every character of this grand repertoire a unique coded representation, the designers of UCS chose a uniform encoding, using bit sequences consisting of 16 or 31 bits (in the two coding forms, UCS-2 and UCS-4). This is the reason for the phrase "multi-octet" in the name of the standard.
Unicode (as of 2005-06-30 at version 4.1) is a coded character set specified by a consortium of major American computer manufacturers, primarily to overcome the chaos of different coded character sets in use when creating multilingual programs and internationalizing software. From version 1.1 on, Unicode is scrupulously kept compatible with ISO/IEC 10646 and its extensions. The consortium is also an important contributor to the ISO work to further develop ISO/IEC 10646.
In short, Unicode can be characterized as the (restricted) 2-byte form of UCS on (the most general) implementation level 3, with the addition of a more precise specification of the bi-directional behavior of characters, when used in the Arabic and Hebrew scripts. Extensions in version 2.0 and greater make it possible to access the wider coding space of UCS-4, within this 16-bit encoding.
Unicode is intended to be usable both for internal data representation in computer systems and in data communication. It is already employed in commercial products from Microsoft, Novell, Apple and others. It is implemented in free software like Linux, and is the default for XML and Java.
Glyphs, characters and compatibility
In order to enable conversion between older character sets and unicode, the standard contains many compatibility characters. For example, character combinations [Lj Dž ...], symbols [µ Å], superscripts [² ™] and alternates [ſ] are present in unicode, tho they should ideally be encoded as the characters with additional formatting if necessary. Many compatibility characters for the Han character set also exist to enable round trip conversions to other standards.
Some glyphs are represented by more than one character in order to provide conversions between lower and upper cases. A prime example of this is the various encodings for d with stroke: croation [Đđ], icelandic [Ðð], african [Ɖɖ]. These pairs share the uppercase glyph, tho all have a different lower case glyph. Thus six characters are encoded, instead of just four. Without this, conversion from upper case to lower case would not be unambiguous. (Font designers need only one glyph to rerpresent all three upper case letters, since each character can point to the same glyph in the font.)
Since the actual form of a glyph is independent of its encoding, multiple glyphs for one character can exist in a single font, tho this is strictly speaking a software feature, rather than a unicode specification. (Different fonts may also employ alterate glyphs for a particular character: d with ̌ or ď.) It is possible using AAT or OpenType fonts to enable glyph substitution, which can provide for display of ligatures [fi, ff] or glyph variants (swashes, alternates), among many other things. Some of these character combinations or alternates are also present in the form of compatibility characters [fi ff ſ]. (These advanced features are generally not accessible in xhtml, thus I can’t readily show examples.)
Extending this, one could treat the 600+ latin, cyrillic and greek characters in Unicode that consist of a letter and diacritic(s) [č й ή] as compatibility characters that should be represented by the base character and a combining diacritical mark. The decomposition of such characters is part of the unicode specification in fact, and the normalization forms utilize this information.
Normalization
Normalization forms designate whether to compose or decompose characters consisting of a base glyph and a diacritic, such as those contained in most european languages.
- Normalization Form D (NFD): when possible, present a character in its decomposed form. Thus the precomposed character á [U+00E1] should be broken down into a + ́ [U+0061 U+0301].
- Normalization Form C (NFC): when possible, present a character in its precomposed form. Thus a + ́ [U+0061 U+0301] should be composed into á [U+00E1].
Both forms should avoid deprecated characters (e.g., use Å [A with ring above] instead of Å [angstrom sign]). Additional forms (NFKD and NFKC) exist that avoid compatibility characters (using [U+0032] and css 2 or html <sup>2</sup> as opposed to ² [U+00B2]).
Mac OS X, for example, will generally display the precomposed gylph of a character combination, if available in the selected font, tho it does not change the character encoding. I think this is more a case of glyph substitution, than normalization, since the actual characters are left as entered.
UCS coding space
| Category | BMP (Plane 0) |
Supplements (Planes 1-16) |
Total (UTF-16) |
|---|---|---|---|
| Glyphs | 54,264 | 52,892 | 107,156 |
| Format | 35 | 105 | 140 |
| Private Use (E000-F8FF) | 6,400 | 131,068 | 137,468 |
| Surrogate (for UTF-16, D800-DBFF) | 2,048 | N/A | 2,048 |
| Control (00-1F, 7F-AF) | 65 | 0 | 65 |
| Noncharacter (FDD0-FDEF, xxxEF-FF) | 34 | 32 | 66 |
| Total allocated | 62,846 | 184,097 | 246,943 |
| Reserved (not yet allocated/finalized) | 2,690 | 864,479 | 867,169 |
| Total available | 65,536 | 1,048,576 | 1,114,112 |
In the 4-byte form of UCS (UCS-4) more than 2 billion different characters can be represented (2 147 483 648 — the first bit of the first byte must be 0 so only 31 of the 32 bits are used). There are 128 groups, each containing 256 planes. Those characters that can be represented by the 2-byte form of UCS (UCS-2) belong to plane 0 of group 0, which is called the Basic Multilingual Plane, or BMP, and is the basis for Unicode. (Plane 1 of group 0 is the Supplementary Multilingual Plane, SMP, and contains some ancient, little used and fictional character sets.)
The 65 536 positions in UCS-2 are divided into 256 rows with 256 cells in each. The first byte of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO/IEC 8859-1 (ISO Latin 1). The first 128 characters are thus the ASCII characters. The byte representing an ISO/IEC 8859-1 character is easily transformed to the representation in UCS, by putting a 0 byte in front of it.
To guarantee that the coding space will not be filled up even in the future, a transformation format (UTF-16) defines a 2 byte form to access supplementary planes 1 thru 16 of group 0 from within UCS-2 for an additional 1 048 576 characters. In Unicode 5.2 (released 2009.10.01) 62 846 code spaces have been assigned, with an additional 184 097 assigned in various supplementary planes.
Adaptation to data communication needs
Many data communication protocols treat bytes with values in the hexadecimal range 00-1F specially; they represent control characters in most 7-bit and 8-bit character sets. It is even the case that the most used protocol for electronic mail, classical SMTP, explicitly forbids the 128 code positions above ASCII (i.e., greater than hex 7F). In certain datatypes used in data communication, e.g. domain names on Internet, even harder restrictions are imposed on allowed bytes. In most operating systems, even some bytes that in ASCII represent glyphs can not be used in file names (/ in unix, : in Mac OS, \ in DOS).
When UCS is used in these contexts, the simple solution to just partition the 16-bit or 31-bit codes into 2 or 4 bytes does not work. For many graphic characters this will produce bytes in the ranges forbidden by the above mentioned protocols and operating system designs.
For these reasons, several algorithmic transformation methods have been defined for UCS data:
UTF-8: The codes in the first half of the first row of the BMP, i.e. the characters that also can be found in ASCII, are in this transformation format replaced by their ASCII codes, which are bytes in the range hex 00-7F. The other codes of UCS are transformed to between two and six bytes in the range hex 80-FF. A text containing only characters in the BMP is transformed to the same byte sequence, irrespective of whether it was coded with UCS-2 or UCS-4. UTF-8 can be used to encode all characacters in the 31-bit USC-4 code range, tho at present UTF-8 sequences greater than 4 characters are not well supported. (Three characters suffice to encode plane 0 of group 0, the BMP, and 4 cover the first 32 planes of group 0. Nothing past plane 16 has been used to date.)
UTF-16: This transformation reduces UCS-4-coded text to a 16-bit encoding and the result can only be used by so called 16-bit safe programs and processes, where all byte values are allowed. BMP code points in the range U+D800 through U+DFFF (2048 code points) are reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point. This makes 1 048 576 characters in planes 1 thru 16 available in a 16-bit coded character set. The other code positions in UCS-4 are still unusable in the UTF-16 transformation format. One motivation for defining UTF-16 has been that it will make it possible for software implementing Unicode to cope with the expansion of UCS outside the BMP for the foreseeable future.
UTF-32: Not really for communication purposes as such, this transformation format simply redefines the 32-bit space of USC2 to the range of UTF-16. All codes greater than U+0010FFFF are considered invalid. It is a result of the realization that we are unlikely to ever need to utilize any codes above the 17 planes provided for by UTF-16.
Some of this is borrowed from an excellent overview by Olle Järnefors. I have rewritten most of it in order to update it, making corrections and my own additions. Any errors are probably mine.