Tutorial
Equivalence, Mapping, and Normalization
http://www.sw.it.aoyama.ac.jp/2014/pub/NormEquivTut
IUC 38, Santa Clara, CA, U.S.A.,
3 November 2014
Martin J. DÜRST
duerst@it.aoyama.ac.jp
Aoyama Gakuin
University
© 2013-4 Martin
J. Dürst, Aoyama Gakuin
University
About the Slides
The most up-to-date version of the slides, as well as additional materials,
is available at http://www.sw.it.aoyama.ac.jp/2014/pub/NormEquivTut.
These slides are created in HMTL, for projection with Opera
(≤12.16 Windows/Mac/Linux,
use F11). Texts in gray, like this one, are comments/notes which do not appear
on the slides. Please note that depending on the browser and OS you use, some
of the characters and character combinations may not display as intended, but
e.g. as empty boxes, question marks, or apart rather than composed.
Abstract
The wealth of characters in Unicode means that there are
many ways in which characters or strings can be equivalent, similar, or
otherwise related. In this tutorial, you will learn about all these
relationships, in order to be able to use this knowledge for dealing with
Unicode data or programs handling Unicode data.
Character relationships and similarities in Unicode range
from linguistic and semantic similarities at the 'top' to equivalent
representations in different character encodings and Unicode encoding forms at
the bottom, with numerical and case equivalences, compatibility and canonical
equivalences, and graphic similarities in the middle. The wealth of
equivalences and relationships is due to the rich history of human writing as
well as to the realities of character encoding policies and decisions.
Each of these relationships are ignorable in some processing
contexts, but may be crucial in others. Processing contexts may range from use
as identifiers (e.g. user ids and passwords) to searching and sorting. For most
of the equivalences, data is available in the Unicode Standard and its
associated data files, but the use of this data or the functions provided by
various libraries requires understanding the background of the equivalences.
When testing for equivalence of two strings, the general strategy is to
normalize both strings to a form that eliminates accidental (in the given
context) differences, and then compare the strings on a binary level.
The tutorial will not only look at officially defined
equivalences, but will also discuss variants that may be necessary in practice
to cover specialized needs. We will also discuss the relationships between
various classes of equivalences, necessary to avoid pitfalls when combining
them, and the stability of the equivalences over time and under various
operations such as string concatenation.
The tutorial assumes that participants have a basic
understanding about the scope and breadth of Unicode, possibly from attending
tutorials earlier in the day.
Audience
- Basic knowledge about Unicode (e.g. you attended some tutorials
today)
- Responsible for Unicode data or programs dealing with Unicode data
Main Points
- Lots of characters: Lots of similarities
- Why should you care?
- Equivalences and mappings:
- Casing
- Numbers
- Security confusables
- Sorting
- Canonical/Kompatibility Equivalence
- Normal forms
- Technical details
- History and politics
- Strategic advice
Speaker Normalization
Let's normalize the speaker, or in other words see how the
speaker was involved in normalization!
Tell me the Difference
What's the difference between the following characters:
৪ and 8 and ∞
Bengali 4, 8, and infinity
க௧
Tamil ka and Tamil 1
骨 and 骨
same codepoint, different language
preference/font
樂, 樂 and 樂
different Korean pronounciation, different
codepoints
A Big List
Linguistic and Semantic Similarities |
color, colour, Farbe, 色 |
not discussed |
Transliterations/Transcriptions |
Putin, Путин, 푸틴, プーチン, 普京, پوتین,
بوتين |
Accidental Graphical Similarities |
T, ⊤, ┳, ᅮ |
this talk's topic |
Script Similarities |
K, Κ, К; ヘ、へ |
Cross-script 'Equivalences' |
な, ナ |
Numeric 'Equivalence' |
7, ۷७৭௭౭๗໗༧፯៧᠗⁷₇⑦⑺⒎7⓻❼➆➐ (59
in total, not including a few Han ideographs) |
Case 'Equivalence' |
g, G; dž, Dž, DŽ; Σ, ς, σ |
Compatibility Equivalence |
⁴, ₄, ④, ⒋, 4 |
Canonical Equivalence |
Å, Å; Ṝ, Ṛ + ̄, R + ̣ + ̄ |
Unicode Encoding Forms |
UTF-8: E9 9D 92 E5 B1 B1, UTF-16: 9752 5C71 |
not discussed |
(Legacy) Encodings |
ISO-8859-X, Shift_JIS,... |
Similarities and equivalences range from binary (at the
bottom) to semantic (at the top). In this tutorial, we assume an uniform
Unicode Encoding Form (e.g. UTF-8) and don't deal with linguistic and semantic
similarities, but discuss all the layers in-between them.
Why all these Similarities and Variants?
- Historical cultural evolution
- Scripts and characters borrowed
- Changed by writing tools and customs
- From stylistic to orthographic distinctions
- Independent quest for simple graphics
- Encoding realities
- Encoding structure choices
- Round-tripping requirements
- Encoding compromises
- Encoding accidents
Why Does it Matter?
(actual
case!)
- Possible to hijack user account
- How:
- Create account with compatibility equivalent to the targeted one
(e.g. ᴮᴵᴳᴮᴵᴿᴰ)
- Ask for a password reset
- Log in and take over
- Why:
- Username normalization incomplete
- ᴮᴵᴳᴮᴵᴿᴰ → BIGBIRD
- BIGBIRD → bigbird
- Correct: ᴮᴵᴳᴮᴵᴿᴰ → bigbird
Applications
- Identification:
- Domain Names/Filenames/Usernames/Passwords
- Element names/attribute names/class names in XML/HTML/CSS
- Variable names in programming languages
- Searching
- Sorting
Equivalences/Mappings Defined by Unicode
- Canonical Equivalence
- Normalization Form D (NFD)
- Normalization Form C (NFC)
- Kompatibility Equivalence
- Normalization Form KD (NFKD)
- Normalization Form KC (NFKC)
- Case equivalence: Case folding/mapping
- Sorting: Sorting keys
- Numeric Values
- Accidental graphical similarities: confusability skeletons
It's Compatibility Equivalence officially,
but we will use Kompatibility Equivalence (as in NFKC/NFKD)
hereafter for better distinction with Canonical Equivalence.
Definitions by Others
- Normalization/Casing
- Character (type) restrictions
- Bidi restrictions
- Script restrictions
Something Familiar: Casing
A ↔ a
...
Z ↔ z
Is it actually that easy?
Special Casings
- Case is only relevant for Latin, Greek, Cyrillic, Armenian, Deseret, old
Hungarian, Cherokee (?), and ancient Georgian
- Unicode knows three cases: lowercase, titlecase, and uppercase
- Case may be language/locale-dependent
(e.g. English: i↔I; Turkish: i↔İ, ı↔I)
- Case may depend on typographic tradition
(accents on French uppercase letters,...)
- Case may not be character-to-character
(e.g. ß↔SS)
- Case may be context-dependent
e.g. σ↔Σ, but at the end of a word: ς↔Σ
- Very vaguely related: Japanese カタカナ↔ひらがな
Case Equivalence/Mapping
- Case Folding:
- Aggressively maps case-related strings together
- To lower case (T/Τ/Т but t/τ/т, see Wikipedia)
- Suited e.g. for search
- Default Case Mapping:
- Not as aggressive as case folding
- Goes both ways
- Use if no context information available
- Use after context-specific mappings
- Language-dependent Case Mappings
- Simple Case Mapping:
- Character-to-character only
- No change of string length
Data for Casing
Equivalence vs. Mapping
- Equivalence is defined between pairs of strings
Two strings HELLO and hello are equivalent (or
not)
- Mapping is defined from one string to another
String Good Bye maps to string good bye
- Equivalence is often defined using mapping
Example: For equivalence FOOeq and mapping FOOmap,
FOOmap(A) = FOOmap(B) ⇔ FOOeq(A,
B)
Equivalence is often defined using a mapping. As an example,
an equivalence FOOeq may be defined with a mapping FOOmap. Two strings
A and B are FOOequivalent if (and only if) the FOOmapping
of A and the FOOmapping of B are equal. Put the other way
round, every mapping defines an equivalence.
Security Conditions:
- Stable mappings
- Store only mapped form? (but may want to keep a display form)
- Make sure mapping is idempotent, or check for a
fixpoint
Idempotence
A function is idempotent if you get the same result irrespective of applying
the function once or more.
f(f(x)) = f(x) ⇔
f is idemponent
Caution: The combination of two independently idempotent functions may not
be an idempotent function
f(f(x)) = f(x),
g(g(x)) = g(x) ↛
f(g(f(g(x)) =
f(g(x)), i.e.
f(g(x)) may not be idempotent
Examples:
- Canonical decomposition and Kompatibility decomposition
- Normalization and case mapping
Fixpoints
Numerical Equivalence
- Columns 6 (decimal), 7 (digit), and 8 (fraction) of UnicodeData.txt
- (ASCII) hexadecimal digits in PropList.txt
- Other ways to represent numbers:
- Greek, Hebrew: Letters with specific values
- Numbers as spoken out
Security Confusables
Multilevel complexity because not only characters, but also
scripts are involved. Not necessarily using equivalence relations, because it
may be possible for a to be confusable to b, and b to c, but not a to c.
Sorting
- Sorting isn't just about equivalence (=), but about ordering (≤)
- Mapping to sort keys serves the same purpose:
String A sorts before B if sort_key(A) ≤
sort_key(B)
- Sorting is language/locale-dependent (viewer, not data)
- Sorting uses levels, such as:
- Base character (A<B)
- Diacritics (u<ü)
- Case (g<G or G<g)
- Others (space, articles,...)
Sorting is related to equivalence and mappings, but is a
talk of its own (going on now in track 1).
Canonical Equivalence
From Unicode
Standard Annex (UAX) #15, Section 1.1:
Canonical equivalence is a fundamental equivalency between
characters or sequences of characters that represent the same abstract
character, and when correctly displayed should always have the same
visual appearance and behavior.
(emphasis added by presenter)
Why Canonical Equivalence
- Order of combining characters
R + ̣ + ̄ vs. R + ̄ + ̣
- Precomposed vs. Decomposed
Ṝ vs. Ṛ + ̄ vs. R + ̣ + ̄
- Singletons
Å vs. Å, 樂 vs. 樂 vs. 樂
- Hangul syllables vs. Jamos
가 vs. ᄀ + ᅡ
Combining Characters
- Many scripts use base characters and diacritics
- Diacritics and similar characters are called Combining Characters
- Often, all combinations may appear (e.g. Arabic, Hebrew)
- Often, combinations limited per language,
but open-ended for the script (Latin)
- Separate encoding seems advantageous
(only one available in Unicode 1.0)
- Order meaningful if e.g. all diacritics above
- Order arbitrary if e.g. above and below
- ⇒Canonical Ordering
Canonical Combining Class
- A character property, integer between 0 and 254
- All base characters have ccc=0
- Only combining marks (but not all of them) have ccc≠0
- Ccc is the same for combining characters that
- Go on the same side as the base character
- Therefore, need to maintain order
- UnicodeData
file (4th field, 0 when no entry)
(see also DerivedCombiningClass.txt)
Canonical Ordering
- For any two consecutive characters ab
reorder as ba if ccc(a) >
ccc(b) > 0
until you find no more such reorderings
- Very similar to bubblesort (but not the same!)
- Local operation: most characters have ccc=0, and don't move
- Example (ş̤̂̏, ccc in (), characters being exchanged):
original |
s |
̂ (230) |
̧ (202) |
̏ (230) |
̤ (220) |
after first step |
s |
̧ (202) |
̂ (230) |
̏ (230) |
̤ (220) |
after second step |
s |
̧ (202) |
̂ (230) |
̤ (220) |
̏ (230) |
final |
s |
̧ (202) |
̤ (220) |
̂ (230) |
̏ (230) |
4 diacritics, 24 permutations, of which 12 are equivalent
Precomposed Characters
- Legacy encodings had characters including diacritics
- Limited transcoding technology
- Limited display technology
- European national pride and voting power
- Unicode - ISO 10646 merger
- Precomposed characters in Unicode 1.1
- Equivalences between precomposed characters and combining sequences
defined
- ⇒Normalization Form D
Hangul Syllables
- Korean is written in square syllables (Hangul, 한글)
- Syllables consist of
- Consonant(s) + vowel(s) (가)
- Consonant(s) + vowel(s) + consonant(s) (민)
- Individual pieces (Conjoining Jamo) are also encoded:
- Leading consonant(s) (ᄍ)
- Vowel(s) (ᅱ)
- Trailing consonant(s) (ᆹ)
- For all L, V, and T in modern use, every LV and LVT syllable is encoded
(over 10'000)
- Syllables with historical L (ᅑ), V (ᆑ), and T (ᇷ) must use Jamo
Normalization Form D (NFD)
(D stands for Decomposed)
NFD is the result of applying Canonical Decomposition:
- Apply Decomposition Mappings
- Code charts, using ≡, or UnicodeData
file (6th field, when no
<...>
)
- Hangul syllable decomposition (algorithmic)
until you find no more decompositions
- Canonical Reordering
(Dis)Advantages of NFD
Advantages of NFD:
- Flat and straightforward
- Fast operations
Disadvantages of NFD:
- Most data is not in NFD
- Difficult to force everybody to use it
Usage Layers
- topmost: User application: Meaningful characters
- higher: DNS/HTTP/SMTP/FTP: Bytes or characters
- lower: TCP/IP: Bytestrems
- bottom: Electric current, light, electromagnetic waves
Different Mindsets
- Multilingual Editor:
- Character boundaries, complex display
- Preferred internal normalization
- Normalization not a major burden
- Application Protocols (e.g. IETF):
- Textual content just transported as bytes+charset info
- Identifiers ASCII only, comparison bytewise or ASCII-case
insensitive
- Internationalized Identifiers:
- E.g. XML element names (W3C)
- Byte or codepoint comparison
- Normalization close to actual practice desired
- ⇒Normalization Form C
Normalization Form C (NFC)
(C stands for Composed)
Definition:
- Canonical Decomposition (see NFD)
- Canonical Recomposition
Advantages of NFC:
- Very close to usage on Web
(design goal, in some ways actually too close)
- More compact
Disadvantages of NFC:
The phrase "in some ways actually too close" refers to the
fact that NFC is so close to actual usage on the Web that there isn't too much
motivation to actively normalize content. This is not exactly the purpose of a
normalization form, as it means that there is still some content out there that
is not normalized.
Canonical Recomposition
- Works repeatedly pairwise
- Starts with base character, ends before next base character
- When combination exists, combine and continue with next combining
character
- Skip additional characters of same combining class after combining
failure
- Data from code charts, using ≡, or UnicodeData.txt
(field 6, when no
<...>
)
- Exclude Composition Exclusions
- Hangul syllable decomposition (algorithmic)
Composition Exclusions
- Singletons (nothing to recombine)
- Script specific:
(precomposed rarely used)
- Indic (क़ख़ग़ज़ड़ढ़फ़य़ড়ঢ়য়ਖ਼ਗ਼ਜ਼ੜਫ਼ଡ଼ଢ଼)
- Tibetan
- Hebrew (שׁשׂשּׁשּׂאַאָאּבּגּדּהּוּ...)
- Non-starter decompositions
(cases with oddball ccc)
- Post Composition Exclusions
- Data from CompositionExclusions.txt
Stability
- Normalized text should stay so in future Unicode versions
- Okay for NFD
- Needs some work for NFC
- Problem is new precomposed characters
old: q + ̂, new: q̂ (precomposed)
- New (post composition) precomposed characters:
- Need to be excluded from composition (NFC)
(unless both parts are new)
- Discourages encoding in the first place
- Careful
stability guarantees
- Effective stop for encoding precomposed characters
- Strengthens compromise between Unicode and ISO
- Named character sequences
- Danger to interpret stability guarantee strictly formally
(anything can be added if it's not defined to be equivalent)
Normalization Corrigenda
(4 out of 9 for all of Unicode)
Kompatibility Equivalence
From Unicode
Standard Annex (UAX) #15, Section 1.1:
Compatibility equivalence is a weaker equivalence between characters
or sequences of characters that represent the same abstract character, but
may have a different visual appearance or behavior.
(emphasis added by presenter)
- May not conserve semantics: 2³=8, but 2³ ≈ 23
- Is not necessarily very consistent (e.g. ④≈4, but ➃≁4)
- Use different characters when semantically different (e.g. Mathematical
notation)
- Use markup/styling for stylistic differences (e.g.
emphasis)
All Things Kompatibility
<tag>
indicates and classifies kompatibility
in UnicodeData.txt
<noBreak>
: Difference in breaking properties only
<super>
: Superscripts
<sub>
: Subscripts
<fraction>
: Fractions
<circle>
: Circled letters
<square>
: Square blocks
<font>
: HEBREW LETTER WIDE
;
MATHEMATICAL
/ARABIC MATHEMATICAL
<wide>
: Fullwidth (double width) variants
<narrow>
: Halfwidth variants
<small>
: Small variants
<vertical>
: Vertical variants
<isolated>
, <final>
,
<initial>
, <medial>
: Arabic
contextual variants and ligatures
<compat>
: General compatibility (e.g.
spacing/nonspacing), ligatures, double characters (e.g. double prime),
roman numerals, script variants: long s, greek theta, space width variants
(EM space,...), parenthesized, with full stop/comma, IDEOGRAPHIC
TELEGRAPH
; (CJK/KANGXI) radicals, HANGUL LETTER
variants
Kompatibility Decomposition
- Kompatibility decompositions as such proceed in one go (e.g. FDFA ≈
0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648
0633 0644 0645)
- However, kompatibility and canonical decomposition need to be applied
repeatedly:
- First compatibility decomposition, then canonical decomposition:
U+01C4, LATIN CAPITAL LETTER DZ WITH CARON
≈<compat> 0044 017D
≡> 0044 005A
030C
- First canonical decomposition, then compatibility
decomposition:
U+0385, GREEK DIALYTIKA TONOS
≡> 00A8 0301
≈<compat> 0020 0308 0301
Normalization Form KD (NFKD)
Definition:
- Kompatibility Decomposition
- Canonical Reordering
Normalization Form KC (NFKC)
Definition:
- Kompatibility Decomposition
- Canonical Reordering
- Canonical Recomposition
Take Care with Normalization Forms
- Normalization forms interact with case conversion
- Normalization forms interact with string operations
e.g. concatenation
- Normalization forms interact with markup
e.g. ≮≡<+ ̸
- Normalization forms interact with escaping
- Normalization interacts with Unicode Versions
(but greatest care has been taken to limit this)
Similar concerns may apply to other kinds of mappings
Because of the wide range of phenomena encoded by Unicode,
there is always the chance that different phenomena interact in strange and
unpredictable ways. Before assuming no interaction, carefully check. If you
don't find any interactions, don't assume that will be necessarily so in all
future versions.
Limits of Normalization
- Designed mainly for Latin
- Koreans don't like it for historical Korean
- MacIntosh file system NFD (roughly: Hangul NFC, rest NFD)
- Canonical equivalence of compatibility Han Ideographs
(use variant sequences)
- Diacritic reordering may be too strict or too loose
- Stability guarantees may be interpreted purely formally
Case Study: From IDNA2003 to Precis
- IETF need for 'normalized' identifiers
- IDNA 2003:
- Case folding
- NFKC
- Wide repertoire (what's not forbidden is allowed)
- Based on Unicode 3.2
- IDNA 2008:
- Mappings (incl. case) outside spec
- NFC
- Contextual rules
- Narrow repertoire (what's not allowed is forbidden)
- Based on character properties
- Precis (framework):
- Allowed characters based on character properties
- Width mapping
(
<wide>
/<narrow>
)
- Additional mappings
- Case mapping
- Normalization (NFC or NFKC)
Case Study: Windows File System vs. IDNs
- Both are case-insensitive
- Windows File System keeps original casing for display
(needs data in two forms for efficiency)
- IDNs don't keep original casing
(on the Web, there's no IBM, only ibm)
Strategy
- To check equivalence, use (one of) the corresponding mappings and compare
for equality
- Mapping can appear at system boundary or on actual equivalence check
- System boundary may vary
- All of the Internet/WWW
- Specific servers, clients
- Some kinds/types of data
- Check for consistency between system components
- Check for consistency between software versions
Questions and Answers
Questions may be unnormalized, but answers will be normalized!
Acknowledgments
Mark Davis for many years of collaboration and (and some disagreements) on
normalization, and for proposing a wider approach to the topic of
normalization.
The IME Pad for facilitating character input.