The Linux Cyrillic HOWTO: Characters and codesets

2. Characters and codesets

In order to understand and print characters of various languages, the system and software should be able to distinguish them from other characters. That is, each unique character must have a unique representation inside the operating system, or the particular software package. Such collection of all unique characters, that the system is able to represent at once, is called a codeset.

At the time of the most operating system's creation, nobody cared about software being multilingual. Therefore, the most popular codeset was (and actually is) an ASCII (American Standard Code for Information Interchange).

The standard ASCII (aka 7-bit ASCII) comprises 128 unique codes. Some of them ASCII defines as real printable characters, and some are so-called control characters, which had special meanings in the old communication protocols. Each element of the set is identified by an integer character code (0-127). The subset of printable characters represents those found on the typewriter's keyboard with some minor additions. Each character occupies 7 least significant bits of a byte, whereas the most significant one was used for control purposes (say, transmission control in old communication packages).

The 7-bit ASCII concept was extended by 8-bit ASCII (aka extended ASCII). In this codeset, the characters' codes' range is 0-255. The lower half (0-127) is pure ASCII, whereas the upper one contains 127 more characters. Since this codeset is backward compatible with the ASCII (character still occupies 8 bit, the codes correspond the old ASCII), this codeset gained wide popularity.

Although the extended ASCII doesn't define the contents of the upper half of the codeset, the most popular and widespread implementation of it is the Latin 1 codeset. In Latin 1, the upper half of the table defines various characters which are not part of the English alphabet, but are present in various european languages (german umlauts, french accentes etc). Another popular extended ASCII implementation is IBM (named after some computer company, that developed this codeset for it's infamous personal computers). This one contains pseudo-graphic characters in the upper half.

Software, that doesn't make any assumptions about the 8-th bit of the ASCII data is called 8-bit clean. Some older programs, designed with 7-bit ASCII in mind are not 8-bit clean and may work incorrectly with your extended ASCII data. Most of packages, however, are able to deal with the extended ASCII by default, or require some very basic setup. NOTE: before posting the question "I did all setup right, but I cannot enter/view Cyrillic characters!", please consult the section misc-setup for the notes on the program, you are using.

For information about making your software 8-bit clean, see section programmers-tools.

Since on most systems character occupies 8 bits, there is no way to extend ASCII more and more. The way to implement new symbols in ASCII-based codesets is creation of other extended ASCII implementations. This is the way, the Cyrillic ASCII set is implemented.

Although, there were many of them, nowadays there are three. The most popular and widespread are only two. One is the Alt codeset (so-called "alternative codeset"); the other one is KOI-8. This one is specified in RFC 1489 ("Registration of a Cyrillic Character Set").

These two standards differ only in positions of the cyrillic characters in the table (that is in cyrillic character codes).

The principal difference is that the Alt codeset is used by MS-DOS users only, whereas KOI-8 is used in Unix, as well as in MS-DOS (though in the latter KOI-8 is much less popular). Since we are doing the right thing (namely working in the Unix operating system), we shall focuse mostly on KOI-8

There are other standards, which are different from ASCII and much more flexible. Unicode is most known. However, they are not implemented as good as the basic ones in Unix in general and Linux in particular. Therefore, I am not describing them here.