Japanology A (日本学
A)
コンピュータ内の文字の扱い方・Handling
Characters on the Computer
2006 年 6月 5日
http://www.sw.it.aoyama.ac.jp/2006/Japanology/lecture1.html
© 2006 Martin
J. Dürst 青山学院大学
Warm-up Exercises
- Connect the dots: How many characters?
- Can you spot the difference? Bones are bones
Self-Introduction
Overview・目次
- Characters as numbers・文字と番号
- Proprietary, national, regional, and global
encodings・文字符号化:
企業の提案から世界的な符号化へ
- Unicode・ユニコード
- What is a character・文字の特定
- Han Unification・漢字の包摂
- Unicode criticism in
Japan・日本でのユニコードへの批判
Character representations
- Human mind: very flexible
- Paper and other visual media: quite flexible
- Computer: not so flexible
Steps to Encode Characters in Computers
- Coverage (e.g. what scripts/languages,...)
- Units (e.g. words, ligatures, strokes,...): easy for some scripts, but
difficult for others
- Coded character set: Character table
- Encoding: From numbers to bytes (computer memory)
Character Encoding History
- Company-based: US (196x), Japan (197x),..., EBCDIC (IBM, still
used)
- National:
- ISO 646 (variant for each Western European Country),...
- JIS X 0208 (Japan) (Encodings: JIS, Shift_JIS, EUC)
- GB 2312 (China)
- KS 5610 (Korea)
- BIG5 (Taiwan/Hong Kong)
- Supranational: ISO-8859-1: Western Europe covered
- Global: Unicode/ISO 10646: Roadmap, Code Charts
Unicode Equivalents
- ISO/IEC 10646 (developped in close coordination with the Unicode
Consortium)
- JIS X 221 (国際符号化文字集合 (UCS))
- GB 13000 (China)
Unicode 4.0 Book
- ~1500-page, roughly A4-size, $74.99
- ~400 pages of explanations about characters and scripts
- ~800 pages of code charts
- ~300 pages of additional reference material (indices,...)
- CD with data (character properties,...)
Available online,
but not printable
What is a character・文字の特定
- Characters: abstract, small units of information exchange
- Glyphs: Identifiable character variations, e.g. $ with one or two
(connected or broken) strokes, a/g/Q vs. a/g/Q; dependent
on context (ligatures) or font
- Glyph shapes: minor font variations (incl. size, weight,...)
Encoding Han Characters: What's so special
漢字の符号化の難しさ
All problems can ultimately be traced back to their large number
Unification problems inside a country are very similar to cross-country
unification problems
Han Unification・漢字の包摂
XYZ
conceptual model:
- X: Semantics (士 vs. 土, 大 vs. 犬 vs. 太, but not 机 (table) vs.
机 (Chinese abbr. of 機)) → separate
- Y: Abstract Shape (体 vs. 躯 vs. 體, 沢 vs. 澤,...) →
separate
- Z: Type Face: minor differences such as semantically irrelevant
strokes, stroke sequence, protrusions, contacts, termination shapes, and
so on → do not separate
Additionally: Source separation rule: Separate if already
separated in a major (national/regional) source standard
Based on work in Japan in the 1970ies; very closely following the
unification rules of JIS X 0208
Ultimately based on how average people use Han characters in
average texts
(Texts about font design, character history, stroke order,... are on a
meta-level. Designing an en
Criticism of Unicode in Japan・日本でのユニコード批判
- Very strong in the mid to late 1990ies, slowly dying out
- Close identification of national culture, national language, and
national script
- National independence and preservation of culture at stake?
- Protecting investment in older (more complex) technology
- Used as 'placeholder' battleground for general battle for basic
Japanese software technology (OS, e.g. TRON)
- Criticism mostly by 'technologists' with narrow view of correct Kanji
from bad school experience, and by 'literates' who didn't check the
facts
- Mostly based on rumors, mostly written with tools that used Unicode
inside (Ichitaro, MS Word)
Example Criticism
- Unicode has only about 20'000 Han characters:
JIS X 0208 only has about 6'000, Unicode currently has 70'000
- Latin, Cyrillic, and Greek are separated, but Chinese, Korean, and
Japanese Han characters are unified:
LCG are very often used distinctively in the same text (e.g.
Mathematics), have been separated for more than 1000 years, and very
small in number and are also separated e.g. in JIS X 0208
- Unicode Han Unification is totally wrong:
As seen with the 迂 example, Han Unification is based on sound
principles
- Unicode messes up sort order of Japanese characters:
The order of Kanji is different, but neither is really good. Computers
can do the mapping easily.
[Unicode is not the best of all worlds, but it is an extremely useful
compromize]
Homework (no need to submit)
- Check out the links from this presentation
- Look around yourself and find various different glyphs/shapes of
characters you know (Latin, 漢字,....)
- For checking Han character variants in Unicode, use the Lookup of the
Unihan
Database
- prepare for next lecture by reading Computers and Culture: The
Case of the Japanese Writing System