Asian Languages

Asian Languages

Asian languages are commonly ideographic and employ large numbers of characters for their representation. For example, Japanese and Korean can be practically encoded in 16 bits. Daily-use Chinese can be, also, but archives and scholars frequently need more, so Chinese is often encoded with up to four bytes per character.

Some Standards

Various Asian character sets have been developed, some of which are considered standard. Encodings for these sets are less standardized. Asian character sets usually require larger-than-byte character types like those described in "Multibyte Characters." Table 6-11 lists some of these standard character sets. Note that some of these character sets have multiple associated codesets, usually designated by appending the year the codeset was adopted to the character set name. (For example, JIS X 208-1983 is different from JIS X 208-1990.)

Character Sets for Asian Languages
Language Character Set Standards Support
Japanese JIS X 0201.1976-0

JIS X 0208.1983-0

JIS X 0212.1990-0
Katakana

Kanji, kana, Latin, Greek, Cyrillic, symbols, others

Supplemental kanji, others
Chinese GB 2312.1980-0
Korean KSC 5601.1987-0 Hangul
Taiwan CNS 11643

Character Sets for Asian Languages
Language	Character Set Standards	Support
Japanese	JIS X 0201.1976-0 JIS X 0208.1983-0 JIS X 0212.1990-0	Katakana Kanji, kana, Latin, Greek, Cyrillic, symbols, others Supplemental kanji, others
Chinese	GB 2312.1980-0
Korean	KSC 5601.1987-0	Hangul
Taiwan	CNS 11643

EUC

EUC is Extended UNIX Code, an encoding methodology that supports concurrent use of four codesets in one encoding. It employs two special "shift state" bytes:

ss1 = 0x8e
ss2 = 0x8f

These are used to identify codesets within a string. The EUC encoding scheme uses the following patterns to indicate which codeset is in use at any given time:

Codeset #0: 0xxxxxxx
Codeset #1: 1xxxxxxx [ 1xxxxxxx ...]
Codeset #2: ss1 1xxxxxxx [ 1xxxxxxx ...]
Codeset #3: ss2 1xxxxxxx [ 1xxxxxxx ...]

So if ss1 appears in a string, it means that the next character--however many bytes long it is--should be interpreted as a character from codeset #2. If there are multiple characters in a row from codeset #2, each one is preceded by ss1. Similarly, ss2 indicates that the following character belongs to codeset #3. If any other byte whose high bit is 1 appears in the string (without being preceded by ss1 or ss2), it is interpreted as all or part of a character from codeset #1.

In EUC, codeset #1 is always ASCII. The other codesets are implementation- or user-defined. This is why EUC cannot support Latin 1 in Asian locales.

EUC implementations exist (but are not standardized) for all ideographic Asian languages.