Tutorial
Equivalence, Mapping, and Normalization 
http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut
IUC 37, Santa Clara, CA, U.S.A.,
20 October 2013
Martin J. DÜRST
duerst@it.aoyama.ac.jp
Aoyama Gakuin
University
   
© 2013 Martin J.
Dürst, Aoyama Gakuin University
About the Slides
The most up-to-date version of the slides, as well as additional materials,
is available at http://www.sw.it.aoyama.ac.jp/2013/pub/NormEquivTut.
Audience
  - Basic knowledge about Unicode (e.g. you attended some tutorials
  today)
 
  - Responsible for Unicode data or programs dealing with Unicode data
 
Main Points
  - Unicode Normalization in Context
 
  - Some technical details
 
  - Some history and politics
 
  - Some strategy
 
Speaker Normalization
 
Tell me the Difference
What's the difference between the following characters:
৪ and 8 and ∞
Bengali 4, 8, and infinity
க௧
Tamil ka and 1
骨 and
骨
meaning 'bone', same codepoint, different language
preference/font
樂, 樂 and 樂
different Korean pronounciation, different
codepoints
 
A Big List
  
  
  
    
      | Linguistic and Semantic Similarities | 
      color, colour, Farbe, 色 | 
    
    
      | Accidental Graphical Similarities | 
      T, ⊤, ┳, ᅮ | 
    
    
      | Script Similarities | 
      K, Κ, К; ヘ、へ | 
    
    
      | Cross-script 'Equivalences' | 
      な, ナ | 
    
    
      | Numeric 'Equivalence' | 
      7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎7⓻❼➆➐ (59
        in total, not including a few Han ideographs) | 
    
    
      | Case 'Equivalence' | 
      g, G; dž, Dž, DŽ; Σςσ | 
    
    
      | Compatibility Equivalence | 
      ⁴, ₄, ④, ⒋, 4 | 
    
    
      | Canonical Equivalence | 
      Å, Å; Ṝ, Ṛ +  ̄, R + ̣ + ̄ | 
    
    
      | Unicode Encoding Forms | 
      UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71 | 
    
    
      | (Legacy) Encodings | 
      ISO-8859-X, Shift_JIS,... | 
    
  
 
Why all these Similarities and Variants?
  - Historical cultural evolution
    
      - Scripts and characters borrowed
 
      - Changed by writing tools and customs
 
      - From stylistic to orthographic distinctions
 
      - Independent quest for simple graphics
 
    
   
  - Encoding realities 
    
      - Encoding structure choices
 
      - Round-tripping requirements
 
      - Encoding compromises
 
      - Encoding accidents
 
    
   
 
Applications
  - Identification: 
    
      - Domain Names/Filenames/Usernames/Passwords
 
      - Element names/attribute names/class names in XML/HTML/CSS
 
      - Variable names in programming languages
 
    
   
  - Searching
 
  - Sorting
 
Equivalences/Mappings Defined by Unicode
  - Canonical Equivalence 
    
      - Normalization Form D (NFD)
 
      - Normalization Form C (NFC)
 
    
   
  - Kompatibility Equivalence 
    
      - Normalization Form KD (NFKD)
 
      - Normalization Form KC (NFKC)
 
    
   
  - Case equivalence: Case folding
 
  - Sorting: Sorting keys
 
  - Numeric Values
 
  - Accidental graphical similarities: confusability skeletons
 
 
Equivalence vs. Mapping
  - Equivalence is defined between pairs of strings
    Two strings A and B are equivalent (or not)
   
  - Mapping is defined from one string to another
    String C maps to string D 
  - Equivalence is often defined using mapping
    Example: For equivalence FOOeq and mapping FOOmap,
    FOOmap(A) = FOOmap(B) ⇔ FOOeq(A,
    B) 
 
Sorting
  - Sorting isn't just about equivalence (=), but about ordering (≤)
 
  - Mapping to sort keys serves the same purpose:
    String A sorts before B if sort_key(A) ≤
    sort_key(B) 
  - Sorting is language/locale-dependent (viewer, not data)
 
  - Sorting uses levels, such as: 
    
      - Base character (A<B)
 
      - Diacritics (u<ü)
 
      - Case (g<G or G<g)
 
      - Others (space, articles,...)
 
    
   
 
Casing
  - Case is only relevant for Latin, Greek, Cyrillic, Armenian, and ancient
    Georgian
 
  - Unicode knows three cases: lowercase, titlecase, and uppercase
 
  - Case may be language/locale-dependent
    (e.g. English: i↔I; Turkish: i↔İ, ı↔I) 
  - Case may not be character-to-character
    (e.g. ß↔SS) 
  - Case may be context-dependent
    e.g. σ↔Σ, but at the end of a word: ς↔Σ 
  - Case may depend on typographic tradition
    (accents on French uppercase letters,...) 
 
Case Equivalence/Mapping
  - Case Folding: 
    
      - Aggressively maps case-related strings together
 
      - To lower case
 
      - Suited e.g. for search
 
    
   
  - Default Case Mapping: 
    
      - Not as aggressive as case folding
 
      - Goes both ways
 
      - Use if no context information available
 
      - Use after context-specific mappings
 
    
   
 
Other Equivalences/Mappings
  - Numerical equivalence
 
  - Character similarities for spoofing detection
 
 
Canonical Equivalence
From Unicode
Standard Annex (UAX) #15, Section 1.1:
  Canonical equivalence is a fundamental equivalency between
  characters or sequences of characters that represent the same abstract
  character, and when correctly displayed should always have the same
  visual appearance and behavior.
Why Canonical Equivalence
  - Order of combining characters
    R + ̣ + ̄ vs. R + ̄ + ̣
   
  - Precomposed vs. Decomposed
    Ṝ vs. Ṛ +  ̄ vs. R + ̣ + ̄ 
  - Singletons
    Å vs. Å, 樂 vs. 樂 vs. 樂 
  - Hangul syllables vs. Jamos
    가 vs. ᄀ + ᅡ 
Combining Characters
  - Many scripts use base characters and diacritics
 
  - Diacritics and similar characters are called Combining Characters
 
  - Often, all combinations may appear (e.g. Arabic, Hebrew)
 
  - Often, combinations limited per language,
    but open-ended for the script (Latin) 
  - Separate encoding seems advantageous
    (was only one in Unicode 1.0) 
  - Order meaningful if e.g. all diacritics above
 
  - Order arbitrary if e.g. above and below
 
  - ⇒Canonical Ordering
 
 
Canonical Combining Class
  - A character property, integer between 0 and 254
 
  - All base characters have ccc=0
 
  - Only combining marks (but not all of them) have ccc≠0
 
  - Ccc is the same for combining characters that
    
      - Go on the same side as the base character
 
      - Therefore, need to maintain order
 
    
   
  - UnicodeData
    file (4th field, 0 when no entry)
    (see also DerivedCombiningClass.txt) 
 
Canonical Ordering
  - For any two consecutive characters ab
    reorder as ba if ccc(a) >
    ccc(b) > 0
    until you find no more such reorderings 
  - Very similar to bubblesort (but not the same!)
 
  - Local operation: most characters have ccc=0, and don't move
 
  - Example (ş̤̂̏, ccc in (), characters being exchanged):
    
      
      
      
      
      
      
      
        
          | original | 
          s | 
           ̂ (230) | 
           ̧ (202) | 
           ̏ (230) | 
           ̤ (220) | 
        
        
          | after first step | 
          s | 
           ̧ (202) | 
           ̂ (230) | 
           ̏ (230) | 
           ̤ (220) | 
        
        
          | after second step | 
          s | 
           ̧ (202) | 
           ̂ (230) | 
           ̤ (220) | 
           ̏ (230) | 
        
        
          | final | 
          s | 
           ̧ (202) | 
           ̤ (220) | 
           ̂ (230) | 
           ̏ (230) | 
        
      
    
    4 diacritics, 24 permutations, of which 12 are equivalent  
 
Precomposed Characters
  - Legacy encodings had characters including diacritics
 
  - Limited transcoding technology
 
  - Limited display technology
 
  - European national pride and voting power
 
  - Unicode - ISO 10646 merger
 
  - Precomposed characters in Unicode 1.1
 
  - Equivalences between precomposed characters and combining sequences
    defined
 
  - ⇒Normalization Form D
 
Hangul Syllables
  - Korean is written in square syllables (Hangul, 한글)
 
  - Syllables consist of 
    
      - Consonant(s) + vowel(s) (가)
 
      - Consonant(s) + vowel(s) + consonant(s) (민)
 
    
   
  - Individual pieces (Conjoining Jamo) are also encoded: 
    
      - Leading consonant(s) (ᄍ)
 
      - Vowel(s) (ᅱ)
 
      - Trailing consonant(s) (ᆹ)
 
    
   
  - For all L, V, and T in modern use, every LV and LVT syllable is encoded
    (over 10'000)
 
  - Syllables with historical L (ᅑ), V (ᆑ), and T (ᇷ) must use Jamo
 
 
Normalization Form D (NFD)
(D stands for Decomposed)
Apply Canonical Decomposition, consisting of:
Advantages of NFD:
  - Flat and straightforward
 
  - Fast operations
 
Disadvantages of NFD:
  - Most data is not in NFD
 
  - Difficult to force everybody to use it
 
 
Usage Layers
  - Electric current, light, electromagnetic waves
 
  - TCP/IP: Bytestrems
 
  - DNS/HTTP/SMTP/FTP: Bytes or characters
 
  - User application: Meaningful characters
 
Different Mindsets
  - Multilingual Editor: 
    
      - Character boundaries, complex display
 
      - Preferred internal normalization
 
      - Normalization not a major burden
 
    
   
  - Application Protocols (e.g. IETF): 
    
      - Textual content just transported as bytes+charset info
 
      - Identifiers ASCII only, comparison bytewise or ASCII-case
      insensitive
 
    
   
  - Internationalized Identifiers: 
    
      - E.g. XML element names (W3C)
 
      - Byte or codepoint comparison
 
      - Normalization close to actual practice desired
 
    
   
  - ⇒Normalization Form C
 
 
Normalization Form C (NFC)
(C stands for Composed)
After Canonical Decomposition (see NFD):
Advantages of NFC:
  - Very close to usage on Web
    (design goal, in some ways actually too close) 
  - More compact
 
Disadvantages of NFC:
 
Canonical Recomposition
  - Works repeatedly pairwise
 
  - Starts with base character and first combining character
 
  - Only look at first character of each combining class
    (to maintain canonical equivalence) 
  - When combination exists, combine and restart
 
  - Exclude Composition Exclusions
 
  - Data from code charts, using ≡, or UnicodeData
    file (6th field, when no 
<...>) 
  - Hangul syllable decomposition (algorithmic)
   
 
Composition Exclusions
  - Singletons (nothing to recombine)
 
  - Script specific:
    (precomposed rarely used) 
    
      - Indic (क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਖ਼ਗ਼ਜ਼ੜਫ਼ଡ଼ଢ଼)
 
      - Tibetan
 
      - Hebrew (שׁשׂשּׁשּׂאַאָאּבּגּדּהּוּ...)
 
    
   
  - Non-starter decompositions
    (cases with oddball ccc) 
  - Post Composition Exclusions
 
  - Data from CompositionExclusions.txt
 
 
Stability
  - Normalized text should stay so in future Unicode versions
 
  - Okay for NFD
 
  - Needs some work for NFC
 
  - Problem is new precomposed characters
    old: q +  ̂, new: q̂ (precomposed) 
  - New (post composition) precomposed characters: 
    
      - Need to be excluded from composition (NFC)
        (unless both parts are new) 
    
    
      - Discourages encoding in the first place
 
    
   
  - Careful stability guarantees
 
  - Effective stop for encoding precomposed characters 
    
      - Strengthens compromise between Unicode and ISO
 
      - Named character sequences
 
    
   
 
Normalization Corrigenda
(4 out of 9 for all of Unicode)
 
NFC/NFD Variants
  - MacIntosh file system NFD (roughly: Hangul NFC, rest NFD)
 
  - Traditional Hangul (improve display)
 
  - Compatibility Han Ideographs (use variant selectors)
 
Kompatibility Equivalence
From Unicode
Standard Annex (UAX) #15, Section 1.1:
  Compatibility equivalence is a weaker equivalence between characters
  or sequences of characters that represent the same abstract character, but
  may have a different visual appearance or behavior.
  - May not conserve semantics: 2³=8, but 2³ ≈ 23
 
  - Is not necessarily very consistent (e.g. ④≈4, but ➃≁4)
 
  - Use different characters when semantically different (e.g. Mathematical
    notation)
 
  - Use markup/styling for stylistic differences (e.g.
  emphasis)
 
<tag> indicates and classifies kompatibility
in the Unicode Data file
  <noBreak>: Difference in breaking properties only 
  <super>: Superscripts 
  <sub>: Subscripts 
  <fraction>: Fractions 
  <circle>: Circled letters 
  <square>: Square blocks 
  <font>: HEBREW LETTER WIDE;
    MATHEMATICAL/ARABIC MATHEMATICAL 
  <wide>: Fullwidth (double width) variants 
  <narrow>: Halfwidth variants 
  <small>: Small variants 
  <vertical>: Vertical variants 
  <isolated>, <final>,
    <initial>, <medial>: Arabic
    contextual variants and ligatures 
  <compat>: General compatibility (e.g.
    spacing/nonspacing), ligatures, double characters (e.g. double prime),
    roman numerals, script variants: long s, greek theta, space width variants
    (EM space,...), parenthesized, with full stop/COMMA, IDEOGRAPHIC
    TELEGRAPH; (CJK/KANGXI) radicals, HANGUL LETTER
  variants 
 
Kompatibility Decomposition
  - Kompatibility decompositions as such proceed in one go (e.g. FDFA ≈
    0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648
    0633 0644 0645)
 
  - However, kompatibility and canonical decomposition need to be applied
    repeatedly: 
    
      - First compatibility decomposition, then canonical decomposition:
        U+01C4, LATIN CAPITAL LETTER DZ WITH CARON
        ≈<compat> 0044 017D ≡> 0044 005A
        030C 
      - First canonical decomposition, then compatibility
        decomposition:
        U+0385, GREEK DIALYTIKA TONOS ≡> 00A8 0301
        ≈<compat> 0020 0308 0301 
    
   
Normalization Form KD (NFKD)
  - Kompatibility Decomposition
 
  - Canonical Reordering
 
Normalization Form KC (NFKC)
  - Kompatibility Decomposition
 
  - Canonical Reordering
 
  - Canonical Recomposition
 
Take Care with Normalization Forms
  - Normalization forms interact with case conversion
 
  - Normalization forms interact with string operations
    e.g. concatenation 
  - Normalization forms interact with markup
    e.g. ≮≡<+ ̸ 
  - Normalization forms interact with escaping
 
  - Normalization interacts with Unicode Versions
    (but greatest care has been taken to limit this) 
Similar concerns may apply to other kinds of mappings
 
Case Study: From IDNA2003 to Precis
  - IETF need for 'normalized' identifiers
 
  - IDNA 2003: 
    
      - Case folding
 
      - NFKC
 
      - Wide repertoire (what's not forbidden is allowed)
 
      - Based on Unicode 3.2
 
    
   
  - IDNA 2008: 
    
      - Mappings (incl. case) outside spec
 
      - NFC
 
      - Contextual rules
 
      - Narrow repertoire (what's not allowed is forbidden)
 
      - Based on character properties
 
    
   
  - Precis (framework): 
    
      - Allowed characters based on character properties
 
      - Width mapping
      (
<wide>/<narrow>) 
      - Additional mappings
 
      - Case mapping
 
      - Normalization (NFC or NFKC)
 
    
   
 
Case Study: Windows File System vs. IDNs
  - Both are case-insensitive
 
  - Windows File System keeps original casing for display
    (needs data in two forms for efficiency) 
  - IDNs don't keep original casing
    (on the Web, there's no IBM, only ibm) 
Case Study: Normalization Breakin
(actual case!)
  - Username normalization different for 
    
      - Account creation
 
      - Login
 
      - Actual access
 
    
   
  - Create account with compatibility equivalent to the targeted one
 
  - Log in and take over
 
Strategy
  - To check equivalence, use (one of) the corresponding mappings and compare
    for equality
 
  - Mapping can appear at system boundary or on actual equivalence check
 
  - System boundary may vary
    
      - All of the Internet/WWW
 
      - Specific servers, clients
 
      - Some kinds/types of data
 
    
   
  - Check for consistency between system components
 
  - Check for consistency between software versions
 
More about Normalization Implementation
Talk tomorrow (Tuesday, 22 October), Track 3, Session 4 (14:30 –
15:20):
Implementing Normalization in Pure Ruby - the Fast and Easy Way
Questions and Answers
Questions may be unnormalized, but answers will be normalized!
Acknowledgments
Mark Davis for many years of collaboration and (and some disagreements) on
normalization, and for proposing a wider approach to the topic of
normalization.
The IME Pad for facilitating character input.
Amaya and Web technology for slide editing and display.