Next | Prev | Up | Top | Contents | Index

Asian Languages

Asian languages are commonly ideographic and employ large numbers of characters for their representation. For example, Japanese and Korean can be practically encoded in 16 bits. Daily-use Chinese can be, also, but archives and scholars frequently need more, so Chinese is often encoded with up to four bytes per character.


Some Standards

Various Asian character sets have been developed, some of which are considered standard. Encodings for these sets are less standardized. Asian character sets usually require larger-than-byte character types like those described in "Multibyte Characters." Table 6-11 lists some of these standard character sets. Note that some of these character sets have multiple associated codesets, usually designated by appending the year the codeset was adopted to the character set name. (For example, JIS X 208-1983 is different from JIS X 208-1990.)

Character Sets for Asian Languages
LanguageCharacter Set StandardsSupport
JapaneseJIS X 0201.1976-0

 

JIS X 0208.1983-0

 

 

JIS X 0212.1990-0

Katakana

 

Kanji, kana, Latin, Greek, Cyrillic, symbols, others

 

Supplemental kanji, others

ChineseGB 2312.1980-0 
KoreanKSC 5601.1987-0Hangul
TaiwanCNS 11643 


EUC

EUC is Extended UNIX Code, an encoding methodology that supports concurrent use of four codesets in one encoding. It employs two special "shift state" bytes:

ss1 = 0x8e
ss2 = 0x8f
These are used to identify codesets within a string. The EUC encoding scheme uses the following patterns to indicate which codeset is in use at any given time:

Codeset #0: 0xxxxxxx
Codeset #1: 1xxxxxxx [ 1xxxxxxx ...]
Codeset #2: ss1 1xxxxxxx [ 1xxxxxxx ...]
Codeset #3: ss2 1xxxxxxx [ 1xxxxxxx ...]
So if ss1 appears in a string, it means that the next character--however many bytes long it is--should be interpreted as a character from codeset #2. If there are multiple characters in a row from codeset #2, each one is preceded by ss1. Similarly, ss2 indicates that the following character belongs to codeset #3. If any other byte whose high bit is 1 appears in the string (without being preceded by ss1 or ss2), it is interpreted as all or part of a character from codeset #1.

In EUC, codeset #1 is always ASCII. The other codesets are implementation- or user-defined. This is why EUC cannot support Latin 1 in Asian locales.

EUC implementations exist (but are not standardized) for all ideographic Asian languages.


Next | Prev | Up | Top | Contents | Index