Implementing Normalization
in Pure Ruby
- the Fast and Easy Way -

http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm

IUC 37, Santa Clara, CA, U.S.A., 22 October 2013

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Abstract

Unicode normalization eliminates encoding variants for what is essentially the same character. We present a new efficient way to implement normalization in scripting languages such as Ruby. Our implementation takes advantage of two facts: First, that large percentages of many kinds of text do not need to be changed when normalizing, and second, that the number of different sequences that need to be normalized in a certain language or group of languages is always much lower than the number of theoretically possible sequences. We exploit each of these facts with built-in functionality of Ruby. Specially-crafted regular expressions capture sequences of characters in possible need for normalization, and a hash structure is used to look up their normalization. The hash structure starts empty and is updated by a default block of code whenever a new sequence of characters needing normalization is identified. As a result, no actual normalization code is executed once the hash structure incorporates the character sequences used in a particular language.

Performance tests with many different implementations and a wide range of languages (German, Vietnamese, Korean,...) with different normalization needs show that our method is one or more orders of magnitude faster than straightforward implementations in scripting languages, and in some cases close to the speed of implementations in compiled languages. Because the code is written in pure Ruby, it is very easily portable and configurable, and similars techniques may be useful for other internationalization operations such as case conversions.

About the Slides

The most up-to-date version of the slides with links is available at http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm.

These slides are created in HMTL, for projection with Opera (≤12.16 Windows/Mac/Linux, use F11). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Main Points

Unicode Normalization
Scripting language Ruby
New approach to make it fast
Published on GitHub

Audience

No need to know Ruby
No need to be a programmer
No need to know much about normalization

Speaker Intro

Co-creator of UAX #15, Unicode Normalization Forms
Third normalization implementation (after charlint and checking for XML1.1)
Ruby committer (transcoding engine)
Work done together with Ayumu NOJIMA

Why Normalization

Fundamental for Unicode: Are two strings the same or not

Order of diacritics
Precomposed vs. Decomposed
Singletons
Compatibility Characters,...

Four Forms

Original	NFD	NFC	NFKD	NFKC
R + ̄ + ̣; Ṛ + ̄	R + ̣ + ̄	Ṝ	R + ̣ + ̄	Ṝ
Å	A + ̊	Å	A + ̊	Å
晴	晴	晴	晴	晴
가; ᄀ + ᅡ	ᄀ + ᅡ	가	ᄀ + ᅡ	가
⁴₄④¾⑷⒋	⁴₄④¾⑷⒋	⁴₄④¾⑷⒋	4443⁄4(4)4.	4443⁄4(4)4.

Algorithmic Definition

Decompose everything
- For NFD and NFC, use canonical decomposition
- For NFDK and NFCK, use compatibility decomposition
Reorder Combining Marks
For NFC and NFKC, recompose using canonical composition

Works on characters one by one

What's needed for Normalization

Convert code units (UTF-8/UTF-16) to scalar values (U+BA98)
Look up decompositions (Ṛ → R + ̣)
Look up combining classes ( ̣ → 220)
Search for composable character pairs (Ṛ + ̄ → Ṝ)
Convert back to code units

⇒ Lots of calculations and lookups

⇒ Slow in scripting language

Ruby

Ruby - A Programmer's Best Friend

Programming/scripting language
Created by Yukihiro Matsumoto (Matz)
Easy for beginners,
deep for experts
Object-oriented throughout, but not obtrusive
Ruby on Rails Web framework

Why Pure Implementation

Implementation in C faster, but
Many Ruby implementations:
- MRI (CRuby, original and reference implementation)
- JRuby (Ruby on the JVM)
- Rubinius (Ruby in Ruby+LLVM)
- MacRuby (Objective C)
- MagLev (Smalltalk)
- IronRuby (CLR)
Pure implementation usable immediately on all Ruby implementations

How to Speed up Scripts

Scripts work in two layers:

Scripting language
Implementation (e.g. in C)

Powerful primitives

Scripting languages are inherently slow because they are interpreted. Very simply put, every line of the program is analysed again every time it is executed. The implementation of a scripting language usually consists of two parts. One is the interpretation engine for the scripting language commands. The other is the implementation of the powerful primitives.

The way to speed up programs in a scripting language is to use the powerful scripting primitives wherever possible. Let's look at an example.

A Script Example

Problem: Replace all occurrences of "<" in a string by "<"

Low-level solution:

Scan characters one by one
If you find a '&', check if the next three characters are "lt;"
If yes, replace with "<" (the remainder of the string needs to be shifted)

Solution (in a scripting language such as Ruby):

globally substitute "<" with "<"
In code: string.gsub! "<", "<"

This shows an example that we will be able to use later.

The problem is unescaping for XML/HTML. Adding ">", "&", "'", and """ into the mix, the problem is solved by the better students in my sophomore programming class in around 100 lines of C code. Such a solution can be transferred quite mechanically to Ruby, too, it would be very slow, apart from being very unwieldy.

The real solution in Ruby (or any similar scripting language) is trivial. It uses the gsub operation, which globally searches for a string in another string and replaces it with a third string.

Normalization: Analyzing the Problem

In theory, unbound number of combinations of combining accents,...
In practice:
- Texts consist of single or very few natural languages
- Each natural language has a limited number of characters affected by normalization
- The same normalization patterns may appear many times

Now we come to the basic idea underlying our implementation. This idea is based on an analysis of the actual data patterns occurring in text normalization. The number of combinations of characters that may be involved in normalization is potentially unlimited. Each of these character combinations has to be dealt with, and it's impossible to predict which ones will appear in the data. Any kind of optimization or shortcut seems to be futile.

However, the situation is not that bleak. Texts usually consists of only a single natural language, maybe a few natural languages in some cases. And each natural language has only a limited number of characters or character combinations affected by normalization. Even in large text corpora, the number of character combinations affected by normalization is bound to be limited by the orthographic needs of the languages involved. On the other hand, the same normalization pattern may appear repeatedly.

This is the core idea for our simple and efficient implementation of normalization.

Example: German to NFC

Letters in core German affected by normalization:

Ä, ä, Ö, ö, Ü, ü

Converting German to NFC:

string.gsub! "A\u0308", "Ä"
string.gsub! "a\u0308", "ä"
...

To explain the idea with a simplistic example, let's take a look at German. Core German (not including loanwords) only has six letters that are affected by normalization. Normalization to NFC could be done simply by replacing all occurrences of "A" followed by a combining diaeresis (U+0308) with "Ä", and so on for the other umlaut characters.

Of course, this approach is not directly usable. Even for German, the string would have to be scanned six times, which is way slower than just scanning it one time, and the solution would have to be hand-tailored to different languages and language combinations.

But we can make things more automatic using more of Ruby's primitives.

Regular Expressions

Catch string patterns
Written between /
Alternatives (|), repetitions (*, +), ...
Example:
string.gsub! /(P|p)(erl|ython)/, "Ruby"

Normalization Runs

Smallest strings that can be normalized independently
- Singleton
- Base character + more than zero combining marks
Catch with a large regular expression
(Roughly?) equivalent to quick check

Normalization Run Examples

Diacritics: R + ̄ + ̣; Ṛ + ̄ ; Singleton: Å; CJK Compat: 晴; Hangul: 가; Compat: ④

Replacement

String
Block of code (doing normalization)
Lookup Table (Hash)
- normalization run → normalized run
- Example (German):
  { "A\u0308" => "Ä", "a\u0308" => "ä", ... }

Building the Hash

Start empty
When an entry is found:
- use it
When an entry is not found:
- create it (actual normalization goes here)
- add it
- use it
Hash works as a cache
One each for D, C, and K

Code Skeleton

NFC_HASH = Hash.new do |h, run| # create hash for NFC
  h[run] = nfc_run run          # add nfc for run to hash
end

class String
  def nfc(string)               # add nfc to String class
    string.gsub NFC_REGEXP, NFC_HASH
  end
end

Preprocessing

From UnicodeData.txt and CompositionExclusions.txt
Generate Decomposition (C and K) and Composition tables
Regular expressions for:
- Decomposition
- Composition
- Kompatibility
- Separating D and C improves efficiency
- K pre-integrates some special cases (≡ followed by ≈, e.g. U+0385)

Actual Normalization

NFD: D regexp, D Hash Cache
NFC: C regexp, C Hash Cache
Kompatibility decomposition: K regexp, K Hash Cache
NFKD: Kompatibility decomposition, then NFD
NFKC: Kompatibility decomposition, then NFC

Based on performance tests

Ruby Specifics

gsub with Hash
since Ruby 1.9
Hash with updates
Not standard in Perl, Python, JavaScript,...

Security: DOS Attacks

Problem: Canonical ordering slow for long diacritic sequences
- Common to most implementations
- Solution: log a warning
Problem: Attacker sending wide variety of nonsense combining sequences
- Balooning hash cache
- Specific to this implementation
- Solution: remove hash entries if number above threshold

Publication

On October 18, 2013
On GitHub (social coding site)
Efficient Pure Ruby Unicode Normalization (eprun)
Pronounced e-prune

Testing for Correctness

Tests based on NormalizationTest.txt
45 tests, 526918 assertions
Runs in less than 5 seconds
Discovered missing testing patterns (Hangul LV+T→LVT)
For some more details, see Lightning Talk at Ruby Hiroba 2013 (in Japanese)

Performance Experiments

Machine
- Panasonic Let's note notebook CF-J10
- Intel Core i7-2620M CPU 2.7GHz, 8MB main memory
Software
- Windows 7 Professional 64bit, Service Pack 1
- Ruby Installer Ruby 1.9.2p180 (2011-02-18) [i386-mingw32]
Data
- Randomly downloaded Wikipedia pages
- German (89KB), Japanese (223KB), Vietnamese (85KB), Korean (59KB)

Results

(German, more at GitHub, milliseconds)

	eprun	unicode_utils (1.4.0)	twitter_cldr (2.4.2)
NFD	12.48	75.19	3884
NFC	15.60	191.73	4072
NFKD	18.56	118.71	8627
NFKC	22.00	215.75	10577

Cross-Language

(Perl v5.14.2 built for cygwin-thread-multi-64int, in milliseconds)

	eprun	Perl (native)	Perl (pure)
NFD	12.48	47.7	683.4
NFC	15.60	30.3	253.3
NFKD	18.56	48.8	742.4
NFKC	22.00	36.0	259.0

Why so Fast

Normalizations of runs are calculated only once
After warmup, everything runs in C
Regular expression serves as quick check

Gets faster if implementation (gsub, Regexp, Hash) get faster

Demo

Depending on the time available, we will show a demo of testing, and/or some performance demos.

Conclusions

Scripting language implementation needs different approach
from low-level programming language implementation
Our solution takes advantage of Ruby-specific powerful primitives
Similar approach possible for other conversion tasks:
- Normalization variants (Mac-specific NFD,...)
- Partial Kanonical decomposition (e.g. wide→narrow)
- Case conversion
- Stringprep, Precis

Q&A

Future Work

More Performance Experiments:
- Other Ruby normalization implementations
  (e.g. Rails ActiveSupport::Multibyte::Chars#normalize, done)
- Other Ruby implementations (e.g. JRuby)
- Native implementations (e.g. idnkit, libidn, ICU4C, ICU4J, Python)
Evaluate memory consumption

Acknowledgments

Ayumu NOJIMA for parts of implementation, performance measurement, and testing.

Cameron Dutro for interesting discussions and encouragement to publish the code.

The IME Pad for facilitating character input.

Amaya and Web technology for slide editing and display.

Implementing Normalization in Pure Ruby - the Fast and Easy Way -