Tutorial
Equivalence, Mapping, and Normalization

http://www.sw.it.aoyama.ac.jp/2014/pub/NormEquivTut

IUC 38, Santa Clara, CA, U.S.A., 3 November 2014

Martin J. DÜRST

duerst@it.aoyama.ac.jp

About the Slides

The most up-to-date version of the slides, as well as additional materials, is available at http://www.sw.it.aoyama.ac.jp/2014/pub/NormEquivTut.

These slides are created in HMTL, for projection with Opera (≤12.16 Windows/Mac/Linux, use F11). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Abstract

The wealth of characters in Unicode means that there are many ways in which characters or strings can be equivalent, similar, or otherwise related. In this tutorial, you will learn about all these relationships, in order to be able to use this knowledge for dealing with Unicode data or programs handling Unicode data.

Character relationships and similarities in Unicode range from linguistic and semantic similarities at the 'top' to equivalent representations in different character encodings and Unicode encoding forms at the bottom, with numerical and case equivalences, compatibility and canonical equivalences, and graphic similarities in the middle. The wealth of equivalences and relationships is due to the rich history of human writing as well as to the realities of character encoding policies and decisions.

Each of these relationships are ignorable in some processing contexts, but may be crucial in others. Processing contexts may range from use as identifiers (e.g. user ids and passwords) to searching and sorting. For most of the equivalences, data is available in the Unicode Standard and its associated data files, but the use of this data or the functions provided by various libraries requires understanding the background of the equivalences. When testing for equivalence of two strings, the general strategy is to normalize both strings to a form that eliminates accidental (in the given context) differences, and then compare the strings on a binary level.

The tutorial will not only look at officially defined equivalences, but will also discuss variants that may be necessary in practice to cover specialized needs. We will also discuss the relationships between various classes of equivalences, necessary to avoid pitfalls when combining them, and the stability of the equivalences over time and under various operations such as string concatenation.

The tutorial assumes that participants have a basic understanding about the scope and breadth of Unicode, possibly from attending tutorials earlier in the day.

Audience

Basic knowledge about Unicode (e.g. you attended some tutorials today)
Responsible for Unicode data or programs dealing with Unicode data

Main Points

Lots of characters: Lots of similarities
Why should you care?
Equivalences and mappings:
- Casing
- Numbers
- Security confusables
- Sorting
- Canonical/Kompatibility Equivalence
- Normal forms
Technical details
History and politics
Strategic advice

Speaker Normalization

Interested in Unicode since ca. 1992
Instigator of NFC (see draft-duerst-i18n-norm, 1997)
Co-creator of UAX #15, Unicode Normalization Forms
Advocate of normalization in the IETF and W3C
Three normalization implementations (charlint, checking for XML1.1, eprun (now part of Ruby))

Let's normalize the speaker, or in other words see how the speaker was involved in normalization!

Tell me the Difference

What's the difference between the following characters:

৪ and 8 and ∞
Bengali 4, 8, and infinity

க௧
Tamil ka and Tamil 1

骨 and 骨
same codepoint, different language preference/font

樂, 樂 and 樂
different Korean pronounciation, different codepoints

A Big List

Linguistic and Semantic Similarities	color, colour, Farbe, 色	not discussed
Transliterations/Transcriptions	Putin, Путин, 푸틴, プーチン, 普京, پوتین, بوتين	not discussed
Accidental Graphical Similarities	T, ⊤, ┳, ᅮ	this talk's topic
Script Similarities	K, Κ, К; ヘ、へ
Cross-script 'Equivalences'	な, ナ
Numeric 'Equivalence'	7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎７⓻❼➆➐ (59 in total, not including a few Han ideographs)
Case 'Equivalence'	g, G; ǆ, ǅ, Ǆ; Σ, ς, σ
Compatibility Equivalence	⁴, ₄, ④, ⒋, ４
Canonical Equivalence	Å, Å; Ṝ, Ṛ + ̄, R + ̣ + ̄
Unicode Encoding Forms	UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71	not discussed
(Legacy) Encodings	ISO-8859-X, Shift_JIS,...	not discussed

Similarities and equivalences range from binary (at the bottom) to semantic (at the top). In this tutorial, we assume an uniform Unicode Encoding Form (e.g. UTF-8) and don't deal with linguistic and semantic similarities, but discuss all the layers in-between them.

Why all these Similarities and Variants?

Historical cultural evolution
- Scripts and characters borrowed
- Changed by writing tools and customs
- From stylistic to orthographic distinctions
- Independent quest for simple graphics
Encoding realities
- Encoding structure choices
- Round-tripping requirements
- Encoding compromises
- Encoding accidents

Why Does it Matter?

(actual case!)

Possible to hijack user account
How:
- Create account with compatibility equivalent to the targeted one (e.g. ᴮᴵᴳᴮᴵᴿᴰ)
- Ask for a password reset
- Log in and take over
Why:
- Username normalization incomplete
- ᴮᴵᴳᴮᴵᴿᴰ → BIGBIRD
- BIGBIRD → bigbird
Correct: ᴮᴵᴳᴮᴵᴿᴰ → bigbird

Applications

Identification:
- Domain Names/Filenames/Usernames/Passwords
- Element names/attribute names/class names in XML/HTML/CSS
- Variable names in programming languages
Searching
Sorting

Equivalences/Mappings Defined by Unicode

Canonical Equivalence
- Normalization Form D (NFD)
- Normalization Form C (NFC)
Kompatibility Equivalence
- Normalization Form KD (NFKD)
- Normalization Form KC (NFKC)
Case equivalence: Case folding/mapping
Sorting: Sorting keys
Numeric Values
Accidental graphical similarities: confusability skeletons

It's Compatibility Equivalence officially, but we will use Kompatibility Equivalence (as in NFKC/NFKD) hereafter for better distinction with Canonical Equivalence.

Definitions by Others

Stringprep (RFC 3454)
Nameprep (RFC 3491, for IDNA 2003)
IDNA 2008 (RFC 5890-4)
(cf. also UTS #46)
Precis (problem statement, framework, see also UTS #31)
....

Normalization/Casing
Character (type) restrictions
Bidi restrictions
Script restrictions

Something Familiar: Casing

A ↔ a

...

Z ↔ z

Is it actually that easy?

Special Casings

Case is only relevant for Latin, Greek, Cyrillic, Armenian, Deseret, old Hungarian, Cherokee (?), and ancient Georgian
Unicode knows three cases: lowercase, titlecase, and uppercase
Case may be language/locale-dependent
(e.g. English: i↔I; Turkish: i↔İ, ı↔I)
Case may depend on typographic tradition
(accents on French uppercase letters,...)
Case may not be character-to-character
(e.g. ß↔SS)
Case may be context-dependent
e.g. σ↔Σ, but at the end of a word: ς↔Σ
Very vaguely related: Japanese カタカナ↔ひらがな

Case Equivalence/Mapping

Case Folding:
- Aggressively maps case-related strings together
- To lower case (T/Τ/Т but t/τ/т, see Wikipedia)
- Suited e.g. for search
Default Case Mapping:
- Not as aggressive as case folding
- Goes both ways
- Use if no context information available
- Use after context-specific mappings
Language-dependent Case Mappings
Simple Case Mapping:
- Character-to-character only
- No change of string length

Data for Casing

Columns 13 (upper), 14 (lower), and 15 (title) of UnicodeData.txt:

01C4;Ǆ;Lu;...;    ;01C6;01C5
01C5;ǅ;Lt;...;01C4;01C6;01C5
01C6;ǆ;Ll;...;01C4;    ;01C5

Not only Lu/Lt/Ll characters have case mappings
SpecialCasing.txt:
Contains all special cases

Equivalence vs. Mapping

Equivalence is defined between pairs of strings
Two strings HELLO and hello are equivalent (or not)
Mapping is defined from one string to another
String Good Bye maps to string good bye
Equivalence is often defined using mapping
Example: For equivalence FOOeq and mapping FOOmap,
FOOmap(A) = FOOmap(B) ⇔ FOOeq(A, B)

Equivalence is often defined using a mapping. As an example, an equivalence FOOeq may be defined with a mapping FOOmap. Two strings A and B are FOOequivalent if (and only if) the FOOmapping of A and the FOOmapping of B are equal. Put the other way round, every mapping defines an equivalence.

Security Conditions:

Stable mappings
Store only mapped form? (but may want to keep a display form)
Make sure mapping is idempotent, or check for a fixpoint

Idempotence

A function is idempotent if you get the same result irrespective of applying the function once or more.

f(f(x)) = f(x) ⇔ f is idemponent

Caution: The combination of two independently idempotent functions may not be an idempotent function

f(f(x)) = f(x), g(g(x)) = g(x) ↛ f(g(f(g(x)) = f(g(x)), i.e. f(g(x)) may not be idempotent

Examples:

Canonical decomposition and Kompatibility decomposition
Normalization and case mapping

Fixpoints

Replacement for idempotence
Apply a function until the result doesn't change anymore

Pseudocode:

def fixpoint(f, start)
  while true
    next = f(start)
    return next if next==start
    start = next
  end
end

Caution: Fixpoints may not exist
(not usually a problem for I18N)

Numerical Equivalence

Columns 6 (decimal), 7 (digit), and 8 (fraction) of UnicodeData.txt
(ASCII) hexadecimal digits in PropList.txt
Other ways to represent numbers:
- Greek, Hebrew: Letters with specific values
- Numbers as spoken out

Security Confusables

Described in UTS #39: Unicode Security Mechanisms
Data at http://www.unicode.org/Public/security/latest/
Detecting:
- Visible lookalikes
- Script confusables
- Number confusables
- ...
Purpose is spoofing detection
Apply whenever user is in control of creating identifier
Better be safe than sorry

Multilevel complexity because not only characters, but also scripts are involved. Not necessarily using equivalence relations, because it may be possible for a to be confusable to b, and b to c, but not a to c.

Sorting

Sorting isn't just about equivalence (=), but about ordering (≤)
Mapping to sort keys serves the same purpose:
String A sorts before B if sort_key(A) ≤ sort_key(B)
Sorting is language/locale-dependent (viewer, not data)
Sorting uses levels, such as:
- Base character (A<B)
- Diacritics (u<ü)
- Case (g<G or G<g)
- Others (space, articles,...)

Sorting is related to equivalence and mappings, but is a talk of its own (going on now in track 1).

Canonical Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior.

(emphasis added by presenter)

Why Canonical Equivalence

Order of combining characters
R + ̣ + ̄ vs. R + ̄ + ̣
Precomposed vs. Decomposed
Ṝ vs. Ṛ + ̄ vs. R + ̣ + ̄
Singletons
Å vs. Å, 樂 vs. 樂 vs. 樂
Hangul syllables vs. Jamos
가 vs. ᄀ + ᅡ

Combining Characters

Many scripts use base characters and diacritics
Diacritics and similar characters are called Combining Characters
Often, all combinations may appear (e.g. Arabic, Hebrew)
Often, combinations limited per language,
but open-ended for the script (Latin)
Separate encoding seems advantageous
(only one available in Unicode 1.0)
Order meaningful if e.g. all diacritics above
Order arbitrary if e.g. above and below
⇒Canonical Ordering

Canonical Combining Class

A character property, integer between 0 and 254
All base characters have ccc=0
Only combining marks (but not all of them) have ccc≠0
Ccc is the same for combining characters that
- Go on the same side as the base character
- Therefore, need to maintain order
UnicodeData file (4th field, 0 when no entry)
(see also DerivedCombiningClass.txt)

Canonical Ordering

For any two consecutive characters ab
reorder as ba if ccc(a) > ccc(b) > 0
until you find no more such reorderings
Very similar to bubblesort (but not the same!)
Local operation: most characters have ccc=0, and don't move

Example (ş̤̂̏, ccc in (), characters being exchanged):

original	s	̂ (230)	̧ (202)	̏ (230)	̤ (220)
after first step	s	̧ (202)	̂ (230)	̏ (230)	̤ (220)
after second step	s	̧ (202)	̂ (230)	̤ (220)	̏ (230)
final	s	̧ (202)	̤ (220)	̂ (230)	̏ (230)

4 diacritics, 24 permutations, of which 12 are equivalent

Precomposed Characters

Legacy encodings had characters including diacritics
Limited transcoding technology
Limited display technology
European national pride and voting power
Unicode - ISO 10646 merger
Precomposed characters in Unicode 1.1
Equivalences between precomposed characters and combining sequences defined
⇒Normalization Form D

Hangul Syllables

Korean is written in square syllables (Hangul, 한글)
Syllables consist of
- Consonant(s) + vowel(s) (가)
- Consonant(s) + vowel(s) + consonant(s) (민)
Individual pieces (Conjoining Jamo) are also encoded:
- Leading consonant(s) (ᄍ)
- Vowel(s) (ᅱ)
- Trailing consonant(s) (ᆹ)
For all L, V, and T in modern use, every LV and LVT syllable is encoded (over 10'000)
Syllables with historical L (ᅑ), V (ᆑ), and T (ᇷ) must use Jamo

Normalization Form D (NFD)

(D stands for Decomposed)

NFD is the result of applying Canonical Decomposition:

Apply Decomposition Mappings
1. Code charts, using ≡, or UnicodeData file (6th field, when no <...>)
2. Hangul syllable decomposition (algorithmic)
until you find no more decompositions
Canonical Reordering

(Dis)Advantages of NFD

Advantages of NFD:

Flat and straightforward
Fast operations

Disadvantages of NFD:

Most data is not in NFD
Difficult to force everybody to use it

Usage Layers

topmost: User application: Meaningful characters
higher: DNS/HTTP/SMTP/FTP: Bytes or characters
lower: TCP/IP: Bytestrems
bottom: Electric current, light, electromagnetic waves

Different Mindsets

Multilingual Editor:
- Character boundaries, complex display
- Preferred internal normalization
- Normalization not a major burden
Application Protocols (e.g. IETF):
- Textual content just transported as bytes+charset info
- Identifiers ASCII only, comparison bytewise or ASCII-case insensitive
Internationalized Identifiers:
- E.g. XML element names (W3C)
- Byte or codepoint comparison
- Normalization close to actual practice desired
⇒Normalization Form C

Normalization Form C (NFC)

(C stands for Composed)

Definition:

Canonical Decomposition (see NFD)
Canonical Recomposition

Advantages of NFC:

Very close to usage on Web
(design goal, in some ways actually too close)
More compact

Disadvantages of NFC:

More complex

The phrase "in some ways actually too close" refers to the fact that NFC is so close to actual usage on the Web that there isn't too much motivation to actively normalize content. This is not exactly the purpose of a normalization form, as it means that there is still some content out there that is not normalized.

Canonical Recomposition

Works repeatedly pairwise
Starts with base character, ends before next base character
When combination exists, combine and continue with next combining character
Skip additional characters of same combining class after combining failure
Data from code charts, using ≡, or UnicodeData.txt (field 6, when no <...>)
Exclude Composition Exclusions
Hangul syllable decomposition (algorithmic)

Composition Exclusions

Singletons (nothing to recombine)
Script specific:
(precomposed rarely used)
- Indic (क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਖ਼ਗ਼ਜ਼ੜਫ਼ଡ଼ଢ଼)
- Tibetan
- Hebrew (שׁשׂשּׁשּׂאַאָאּבּגּדּהּוּ...)
Non-starter decompositions
(cases with oddball ccc)
Post Composition Exclusions
Data from CompositionExclusions.txt

Stability

Normalized text should stay so in future Unicode versions
Okay for NFD
Needs some work for NFC
Problem is new precomposed characters
old: q + ̂, new: q̂ (precomposed)
New (post composition) precomposed characters:
- Need to be excluded from composition (NFC)
  (unless both parts are new)
- Discourages encoding in the first place
Careful stability guarantees
Effective stop for encoding precomposed characters
- Strengthens compromise between Unicode and ISO
- Named character sequences
Danger to interpret stability guarantee strictly formally
(anything can be added if it's not defined to be equivalent)

Normalization Corrigenda

(4 out of 9 for all of Unicode)

Yod with Hiriq Normalization
(forgotten post composition exclusion)
U+F951 Normalization
(wrong data entry: 陋 ≈電 rather than 陋≈陋)
Five CJK Canonical Mapping Errors
(some data entries simply wrong)
Normalization Idempotency
(difference between intent/sample implementation and description)

Kompatibility Equivalence

From Unicode Standard Annex (UAX) #15, Section 1.1:

Compatibility equivalence is a weaker equivalence between characters or sequences of characters that represent the same abstract character, but may have a different visual appearance or behavior.

(emphasis added by presenter)

May not conserve semantics: 2³=8, but 2³ ≈ 23
Is not necessarily very consistent (e.g. ④≈4, but ➃≁4)
Use different characters when semantically different (e.g. Mathematical notation)
Use markup/styling for stylistic differences (e.g. emphasis)

All Things Kompatibility

<tag> indicates and classifies kompatibility in UnicodeData.txt

<noBreak>: Difference in breaking properties only
<super>: Superscripts
<sub>: Subscripts
<fraction>: Fractions
<circle>: Circled letters
<square>: Square blocks
<font>: HEBREW LETTER WIDE; MATHEMATICAL/ARABIC MATHEMATICAL
<wide>: Fullwidth (double width) variants
<narrow>: Halfwidth variants
<small>: Small variants
<vertical>: Vertical variants
<isolated>, <final>, <initial>, <medial>: Arabic contextual variants and ligatures
<compat>: General compatibility (e.g. spacing/nonspacing), ligatures, double characters (e.g. double prime), roman numerals, script variants: long s, greek theta, space width variants (EM space,...), parenthesized, with full stop/comma, IDEOGRAPHIC TELEGRAPH; (CJK/KANGXI) radicals, HANGUL LETTER variants

Kompatibility Decomposition

Kompatibility decompositions as such proceed in one go (e.g. FDFA ≈ 0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645)
However, kompatibility and canonical decomposition need to be applied repeatedly:
- First compatibility decomposition, then canonical decomposition:
  U+01C4, LATIN CAPITAL LETTER DZ WITH CARON ≈<compat> 0044 017D ≡> 0044 005A 030C
- First canonical decomposition, then compatibility decomposition:U+0385, GREEK DIALYTIKA TONOS ≡> 00A8 0301 ≈<compat> 0020 0308 0301

Normalization Form KD (NFKD)

Definition:

Kompatibility Decomposition
Canonical Reordering

Normalization Form KC (NFKC)

Definition:

Kompatibility Decomposition
Canonical Reordering
Canonical Recomposition

Take Care with Normalization Forms

Normalization forms interact with case conversion
Normalization forms interact with string operations
e.g. concatenation
Normalization forms interact with markup
e.g. ≮≡<+ ̸
Normalization forms interact with escaping
Normalization interacts with Unicode Versions
(but greatest care has been taken to limit this)

Similar concerns may apply to other kinds of mappings

Because of the wide range of phenomena encoded by Unicode, there is always the chance that different phenomena interact in strange and unpredictable ways. Before assuming no interaction, carefully check. If you don't find any interactions, don't assume that will be necessarily so in all future versions.

Limits of Normalization

Designed mainly for Latin
Koreans don't like it for historical Korean
MacIntosh file system NFD (roughly: Hangul NFC, rest NFD)
Canonical equivalence of compatibility Han Ideographs
(use variant sequences)
Diacritic reordering may be too strict or too loose
Stability guarantees may be interpreted purely formally

Case Study: From IDNA2003 to Precis

IETF need for 'normalized' identifiers
IDNA 2003:
- Case folding
- NFKC
- Wide repertoire (what's not forbidden is allowed)
- Based on Unicode 3.2
IDNA 2008:
- Mappings (incl. case) outside spec
- NFC
- Contextual rules
- Narrow repertoire (what's not allowed is forbidden)
- Based on character properties
Precis (framework):
- Allowed characters based on character properties
- Width mapping (<wide>/<narrow>)
- Additional mappings
- Case mapping
- Normalization (NFC or NFKC)

Case Study: Windows File System vs. IDNs

Both are case-insensitive
Windows File System keeps original casing for display
(needs data in two forms for efficiency)
IDNs don't keep original casing
(on the Web, there's no IBM, only ibm)

Strategy

To check equivalence, use (one of) the corresponding mappings and compare for equality
Mapping can appear at system boundary or on actual equivalence check
System boundary may vary
- All of the Internet/WWW
- Specific servers, clients
- Some kinds/types of data
Check for consistency between system components
Check for consistency between software versions

Questions and Answers

Questions may be unnormalized, but answers will be normalized!

Acknowledgments

Mark Davis for many years of collaboration and (and some disagreements) on normalization, and for proposing a wider approach to the topic of normalization.

The IME Pad for facilitating character input.

Tutorial Equivalence, Mapping, and Normalization

IUC 38, Santa Clara, CA, U.S.A., 3 November 2014

Martin J. DÜRST

About the Slides

Abstract

Audience

Main Points

Speaker Normalization

Tell me the Difference

A Big List

Why all these Similarities and Variants?

Why Does it Matter?

Applications

Equivalences/Mappings Defined by Unicode

Definitions by Others

Something Familiar: Casing

Special Casings

Case Equivalence/Mapping

Data for Casing

Equivalence vs. Mapping

Security Conditions:

Idempotence

Fixpoints

Numerical Equivalence

Security Confusables

Sorting

Canonical Equivalence

Why Canonical Equivalence

Combining Characters

Canonical Combining Class

Canonical Ordering

Precomposed Characters

Hangul Syllables

Normalization Form D (NFD)

(Dis)Advantages of NFD

Usage Layers

Different Mindsets

Normalization Form C (NFC)

Canonical Recomposition

Composition Exclusions

Stability

Normalization Corrigenda

Kompatibility Equivalence

All Things Kompatibility

Kompatibility Decomposition

Normalization Form KD (NFKD)

Normalization Form KC (NFKC)

Take Care with Normalization Forms

Limits of Normalization

Case Study: From IDNA2003 to Precis

Case Study: Windows File System vs. IDNs

Strategy

Questions and Answers

Acknowledgments

Tutorial
Equivalence, Mapping, and Normalization