http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm
© 2013 Martin J. Dürst, Aoyama Gakuin University
Unicode normalization eliminates encoding variants for what is essentially the same character. We present a new efficient way to implement normalization in scripting languages such as Ruby. Our implementation takes advantage of two facts: First, that large percentages of many kinds of text do not need to be changed when normalizing, and second, that the number of different sequences that need to be normalized in a certain language or group of languages is always much lower than the number of theoretically possible sequences. We exploit each of these facts with built-in functionality of Ruby. Specially-crafted regular expressions capture sequences of characters in possible need for normalization, and a hash structure is used to look up their normalization. The hash structure starts empty and is updated by a default block of code whenever a new sequence of characters needing normalization is identified. As a result, no actual normalization code is executed once the hash structure incorporates the character sequences used in a particular language.
Performance tests with many different implementations and a wide range of languages (German, Vietnamese, Korean,...) with different normalization needs show that our method is one or more orders of magnitude faster than straightforward implementations in scripting languages, and in some cases close to the speed of implementations in compiled languages. Because the code is written in pure Ruby, it is very easily portable and configurable, and similars techniques may be useful for other internationalization operations such as case conversions.
The most up-to-date version of the slides with links is available at http://www.sw.it.aoyama.ac.jp/2013/pub/RubyNorm.
These slides are created in HMTL, for projection with Opera (≤12.16 Windows/Mac/Linux, use F11). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.
Fundamental for Unicode: Are two strings the same or not
Original | NFD | NFC | NFKD | NFKC |
---|---|---|---|---|
R + ̄ + ̣; Ṛ + ̄ | R + ̣ + ̄ | Ṝ | R + ̣ + ̄ | Ṝ |
Å | A + ̊ | Å | A + ̊ | Å |
晴 | 晴 | 晴 | 晴 | 晴 |
가; ᄀ + ᅡ | ᄀ + ᅡ | 가 | ᄀ + ᅡ | 가 |
⁴₄④¾⑷⒋ | ⁴₄④¾⑷⒋ | ⁴₄④¾⑷⒋ | 4443⁄4(4)4. | 4443⁄4(4)4. |
Works on characters one by one
⇒ Lots of calculations and lookups
⇒ Slow in scripting language
Scripts work in two layers:
Powerful primitives
Scripting languages are inherently slow because they are interpreted. Very simply put, every line of the program is analysed again every time it is executed. The implementation of a scripting language usually consists of two parts. One is the interpretation engine for the scripting language commands. The other is the implementation of the powerful primitives.
The way to speed up programs in a scripting language is to use the powerful scripting primitives wherever possible. Let's look at an example.
Problem: Replace all occurrences of "<"
in a string by
"<"
Low-level solution:
'&'
, check if the next three characters
are "lt;"
"<"
(the remainder of the string
needs to be shifted)Solution (in a scripting language such as Ruby):
"<"
with "<"
string.gsub! "<", "<"
This shows an example that we will be able to use later.
The problem is unescaping for XML/HTML. Adding ">", "&", "'", and """ into the mix, the problem is solved by the better students in my sophomore programming class in around 100 lines of C code. Such a solution can be transferred quite mechanically to Ruby, too, it would be very slow, apart from being very unwieldy.
The real solution in Ruby (or any similar scripting
language) is trivial. It uses the gsub
operation, which globally
searches for a string in another string and replaces it with a third string.
Now we come to the basic idea underlying our implementation. This idea is based on an analysis of the actual data patterns occurring in text normalization. The number of combinations of characters that may be involved in normalization is potentially unlimited. Each of these character combinations has to be dealt with, and it's impossible to predict which ones will appear in the data. Any kind of optimization or shortcut seems to be futile.
However, the situation is not that bleak. Texts usually consists of only a single natural language, maybe a few natural languages in some cases. And each natural language has only a limited number of characters or character combinations affected by normalization. Even in large text corpora, the number of character combinations affected by normalization is bound to be limited by the orthographic needs of the languages involved. On the other hand, the same normalization pattern may appear repeatedly.
This is the core idea for our simple and efficient implementation of normalization.
Letters in core German affected by normalization:
Ä, ä, Ö, ö, Ü, ü
Converting German to NFC:
string.gsub! "A\u0308", "Ä" string.gsub! "a\u0308", "ä" ...
To explain the idea with a simplistic example, let's take a look at German. Core German (not including loanwords) only has six letters that are affected by normalization. Normalization to NFC could be done simply by replacing all occurrences of "A" followed by a combining diaeresis (U+0308) with "Ä", and so on for the other umlaut characters.
Of course, this approach is not directly usable. Even for German, the string would have to be scanned six times, which is way slower than just scanning it one time, and the solution would have to be hand-tailored to different languages and language combinations.
But we can make things more automatic using more of Ruby's primitives.
/
|
), repetitions (*
,
+
), ...string.gsub! /(P|p)(erl|ython)/, "Ruby"
Diacritics: R + ̄ + ̣; Ṛ + ̄ ; Singleton: Å; CJK Compat: 晴; Hangul: 가; Compat: ④
Hash
)
{ "A\u0308" => "Ä",
"a\u0308" => "ä",
... }
NFC_HASH = Hash.new do |h, run| # create hash for NFC h[run] = nfc_run run # add nfc for run to hash end class String def nfc(string) # add nfc to String class string.gsub NFC_REGEXP, NFC_HASH end end
Based on performance tests
gsub
with Hash
Hash
with updates(German, more at GitHub, milliseconds)
eprun | unicode_utils (1.4.0) | twitter_cldr (2.4.2) | |
---|---|---|---|
NFD | 12.48 | 75.19 | 3884 |
NFC | 15.60 | 191.73 | 4072 |
NFKD | 18.56 | 118.71 | 8627 |
NFKC | 22.00 | 215.75 | 10577 |
(Perl v5.14.2 built for cygwin-thread-multi-64int, in milliseconds)
eprun | Perl (native) | Perl (pure) | |
---|---|---|---|
NFD | 12.48 | 47.7 | 683.4 |
NFC | 15.60 | 30.3 | 253.3 |
NFKD | 18.56 | 48.8 | 742.4 |
NFKC | 22.00 | 36.0 | 259.0 |
Gets faster if implementation (gsub
, Regexp
,
Hash
) get faster
Depending on the time available, we will show a demo of testing, and/or some performance demos.
ActiveSupport::Multibyte::Chars#normalize
, done)Ayumu NOJIMA for parts of implementation, performance measurement, and testing.
Cameron Dutro for interesting discussions and encouragement to publish the code.
The IME Pad for facilitating character input.
Amaya and Web technology for slide editing and display.