Japanology A (日本学 A)

コンピュータ内の文字の扱い方・Handling Characters on the Computer

2006 年 6月 5日

Warm-up Exercises

Coverage (e.g. what scripts/languages,...)
Units (e.g. words, ligatures, strokes,...): easy for some scripts, but difficult for others
Coded character set: Character table
Encoding: From numbers to bytes (computer memory)

Available online, but not printable

Characters: abstract, small units of information exchange
Glyphs: Identifiable character variations, e.g. $ with one or two (connected or broken) strokes, a/g/Q vs. a/g/Q; dependent on context (ligatures) or font
Glyph shapes: minor font variations (incl. size, weight,...)

漢字の符号化の難しさ

All problems can ultimately be traced back to their large number

Unification problems inside a country are very similar to cross-country unification problems

X: Semantics (士 vs. 土, 大 vs. 犬 vs. 太, but not 机 (table) vs. 机 (Chinese abbr. of 機)) → separate
Y: Abstract Shape (体 vs. 躯 vs. 體, 沢 vs. 澤,...) → separate
Z: Type Face: minor differences such as semantically irrelevant strokes, stroke sequence, protrusions, contacts, termination shapes, and so on → do not separate

Additionally: Source separation rule: Separate if already separated in a major (national/regional) source standard

Based on work in Japan in the 1970ies; very closely following the unification rules of JIS X 0208

Ultimately based on how average people use Han characters in average texts

(Texts about font design, character history, stroke order,... are on a meta-level. Designing an en

Very strong in the mid to late 1990ies, slowly dying out
Close identification of national culture, national language, and national script
National independence and preservation of culture at stake?
Protecting investment in older (more complex) technology
Used as 'placeholder' battleground for general battle for basic Japanese software technology (OS, e.g. TRON)
Criticism mostly by 'technologists' with narrow view of correct Kanji from bad school experience, and by 'literates' who didn't check the facts
Mostly based on rumors, mostly written with tools that used Unicode inside (Ichitaro, MS Word)

Unicode has only about 20'000 Han characters:
JIS X 0208 only has about 6'000, Unicode currently has 70'000
Latin, Cyrillic, and Greek are separated, but Chinese, Korean, and Japanese Han characters are unified:
LCG are very often used distinctively in the same text (e.g. Mathematics), have been separated for more than 1000 years, and very small in number and are also separated e.g. in JIS X 0208
Unicode Han Unification is totally wrong:
As seen with the 迂 example, Han Unification is based on sound principles
Unicode messes up sort order of Japanese characters:
The order of Kanji is different, but neither is really good. Computers can do the mapping easily.

[Unicode is not the best of all worlds, but it is an extremely useful compromize]

Check out the links from this presentation
Look around yourself and find various different glyphs/shapes of characters you know (Latin, 漢字,....)
For checking Han character variants in Unicode, use the Lookup of the Unihan Database
prepare for next lecture by reading Computers and Culture: The Case of the Japanese Writing System