http://www.sw.it.aoyama.ac.jp/2018/pub/IUC42_properties/
duerst@it.aoyama.ac.jp, Aoyama Gakuin University
© 2017-8 Martin J. Dürst, Aoyama Gakuin University
Unicode defines a large number of character properties for character classification and algorithms. This talk compares the two main ways of implementing property support, and discusses their advantages and disadvantages based on actual implementation experience. This talk is suited for users dealing with Unicode properties as well as for implementers.
Examples of Unicode character properties range from general category (letter, number, symbol,...) to specialized properties for tasks such as bidirectional layout and normalization. Many programming languages, libraries, and regular expression engines provide access to these properties. We provide some insights and results based on experimental work on the Onigmo regular expression engine that is used by the programming language Ruby.
Leaving special properties such as character name aside, the two main ways of implementing property support are inversion lists and folded tries. Moving from the former to the later, we succeeded to increase the number of properties covered from 62 to 76, and the number of property values covered from 554 to 1009, all while reducing the memory necessary from 240kB to 214kB. In addition, elimination of binary search made raw property checks up to 9 times faster, and lookup of specific property values from characters up to 65 times faster.
Comparing the two methods from a conceptual viewpoint, it is interesting to observe that when arranging property values in a huge table with characters as rows and properties as columns, inversion lists look at this table one column a time, whereas folded tries look at it one row at a time. Folded tries take advantage of the fact that most property value combinations are shared by a large number of characters, whereas inversion lists take advantage of the fact that subsequent characters often share the same property value. This understanding can lead to further improvements.
The work described in this presentation was done jointly with Takumi Koyama in 2016/17.
These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). The slides have been optimized for a screen size of 1920x1080pixels, but can easily be viewed on other screens, too. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.
Regexp
'Юに코δ' =~ /\p{Hiragana}/
'Ю'
?
5.times { puts "Hello Ruby!" }
{
... }
or
do
... end
)Code is mostly green, monospace
puts 'Hello Ruby!'
Variable parts are orange
puts "some string"
Results are indicated with "⇒"
1 + 1
⇒ 2
Matching of regular expressions is indicated with ≟, ≅, or ≇
'abc'
≅ /a/
; 'def'
≇
/a/
Matching substrings are indicated with orage background
'abcba'
≅ /b/
This slide shows you the conventions we'll use throughout the talk. (Code is in green monospace font. Variable parts are in orange. Where necessary, the character encoding of a string is indicated with a blue subsccript. Results are show after a double arrow. We'll use two symbols to show when regular expressions match or don't match.)
Mainly in the following areas:
String#encode
, Ruby 1.9)String#unicode-normalize
, Ruby
2.2)String#upcase
,..., Ruby 2.4)As a Ruby committer, I'm mostly responsible for the areas listed on this slide.
Year (y) | Ruby version (VRuby) | Supported Unicode version (VUnicode) |
published around Christmas | published in Summer | |
2014 | 2.2 | 7.0.0 |
2015 | 2.3 | 8.0.0 |
2016 | 2.4 | 9.0.0 |
2017 | 2.5 | 10.0.0 |
2018 (planned) | 2.6 | 11.0.0 |
NOTE about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.
RbConfig::CONFIG["UNICODE_VERSION"]
⇒
'10.0.0'
VUnicode = y - 2007; VRuby = (y - 1992) · 0.1
VRuby = 1.5 + VUnicode · 0.1 VUnicode = VRuby · 10 - 15
Don't extrapolate too far!
For a list of properties, see Unicode Character Database, UAX #44, in particular Table 9
For some background on Unicode properties, see The Unicode Property Model, UTR #23 (caution: very dry reading)
b
s in the string 'abcba'
:
'abcba'
≅ /b/
See also Unicode Regular Expressions, UTS #18
\p{property}
\p{Digit}
(match a digit)\p{Hiragana}
(match a Hiragana)\P{property}
or
\p{^property}
Examples:
\p{^Digit}
or \P{Digit}
(match everything else
than a digit)
\p{^Hiragana}
or \P{Hiragana}
(match everything
else than a Hiragana)
\p{property=value}
'a€青ሙឈ'
≅ /\p{Age=2.0}/
(all characters defined in Unicode 2.0)'a'
, '青'
, "\u03A2"
'あ'.property('script')
⇒ Hiragana
, or'あ'.script
⇒ Hiragana
Properties | |||||
---|---|---|---|---|---|
Category | Bidi | Name | ... | ||
Cha- | 1 | ||||
rac- | 2 | ||||
ters | a | ||||
b | |||||
c | |||||
青 | |||||
... |
⇒ List them from top to bottom at the left
⇒ List them from left to right at the top
⇒ Put each property value in the corresponding cell
⇒ All properties represented in a big table
Last row in Excel (2013): 1'048'576 = 220 < 1'114'112
Properties of properties that we may be able to exploit:
Properties | |||||
---|---|---|---|---|---|
P1 | P2 | P3 | ... | ||
Cha- | 1 | T | T | F | |
rac- | 2 | T | T | T | |
ters | a | F | T | F | |
b | F | T | T | ||
c | F | T | F | ||
青 | T | F | T | ||
... |
Properties | |||||
---|---|---|---|---|---|
P1 | P2 | P3 | ... | ||
Cha- | 1 | ||||
rac- | 2 | ||||
ters | a | Z | |||
b | |||||
c | |||||
青 | X | ||||
... |
Hash
for Canonical Combining Class
class_table = { "\u0300"=>230, "\u0301"=>230, "\u0302"=>230, "\u0303"=>230, "\u0304"=>230, "\u0305"=>230, "\u0306"=>230, "\u0307"=>230, "\u0308"=>230, "\u0309"=>230, "\u030A"=>230, "\u030B"=>230,
Properties | |||||
---|---|---|---|---|---|
P1 | P2 | P3 | ... | ||
Cha- | 1 | 0 | 2 | ||
rac- | 2 | 99 | 101 | ||
ters | a | 5 | 7 | ||
b | 44 | 46 | |||
c | 7 | 9 | |||
青 | 20 | 22 | |||
... |
Properties | |||||
---|---|---|---|---|---|
P1 | P2 | P3 | ... | ||
Cha- | 1 | 1 | |||
rac- | 2 | 2 | |||
ters | 3 | 3 | |||
4 | 4 | ||||
5 | 5 | ||||
青 | |||||
... |
CJK UNIFIED IDEOGRAPH-3400
Properties | |||||
---|---|---|---|---|---|
P1 | P2 | P3 | ... | ||
Cha- | 1 | x | 7 | ||
rac- | 2 | x | 7 | ||
ters | a | y | 7 | ||
b | y | 7 | |||
c | y | 7 | |||
青 | z | 9 | |||
... |
Properties | |||||
---|---|---|---|---|---|
P1 | P2 | P3 | ... | ||
Cha- | 1 | 1 | X | Q | |
rac- | 2 | 2 | Y | P | |
ters | a | 1 | X | Q | |
b | 3 | Z | P | ||
c | 2 | Y | P | ||
青 | 3 | Z | P | ||
... |
How each exploit works:
\p{}
\p{Digit}
(match a digit)\p{Hiragana}
(match a Hiragana)\p{^Digit}
or \P{Digit}
(match everything
else than a digit)\p{^Hiragana}
or \P{Hiragana}
(match
everything else than a Hiragana)[\p{Digit}\p{Hiragana}]
(match a digit or a Hiragana)[\p{Tamil}&&\p{Digit}]
(match Tamil digits: ௦௧௨௩௪௫௬௭௮௯)In regular expressions, we can check for character properties with backslash-p. We can use this also inside character classes, where logical operations are also available.
(as of Unicode 9.0.0)
Due to Onigmo, Ruby currently has very good support for Boolean properties. But the support for enumerated properties is not up to par. As we will see soon, the reason for this lies in the details of the Onigmo implemenation.
From enc/unicode/10.0.0/name2ctype.h
/* 'Radical': Binary Property */ static const OnigCodePoint CR_Radical[] = { 3, /* number of intervals */ 0x2e80, 0x2e99, /* ⺀ ~ ⺙ */ 0x2e9b, 0x2ef3, /* ⺛ ~ ⺀ */ 0x2f00, 0x2fd5, /* ⼀ ~ ⺀ */ }; /* CR_Radical */
This is the data used for implementing the "Radical" property, which indicates whether a character is a Kanji radical or not. The values are the boundaries of intervals. Inside the boundaries, the property is true, outside, it is false.
0x2e80, 0x2e99, /* ⺀ ~ ⺙ */
U+2e80
and goes up to U+2e99
,
and so onU+2e80
U+2e99
The data values are the boundaries of the intervals. At the boundaries, the value of the property is inverted. This kind of data structure is therefore called "Inversion List".
Inversion lists work well for binary properties. But they cannot be used
directly for enumerated properties. For enumerated properties, an inversion
list has to be provided for each property value. As an example, the
Script Property needs 138 inversion lists.
/* 'Katakana': Script */ static const OnigCodePoint CR_Katakana[] = { 8, /* number of intervals */ 0x30a1, 0x30fa, /* ァ ~ ヺ */ 0x30fd, 0x30ff, /* ヽ ~ ヿ */ 0x31f0, 0x31ff, /* ㇰ ~ ㇿ */ 0x32d0, 0x32fe, /* ㋐ ~ ㋾ */ 0x3300, 0x3357, /* ㌀ ~ ㍗ */ 0xff66, 0xff6f, /* ヲ ~ ッ */ 0xff71, 0xff9d, /* ア ~ ン */ 0x1b000, 0x1b000, /* KATAKANA LETTER ARCHAIC E */ }; /* CR_Katakana */
Here's another example, the inversion list for the property value "Katakana" of the property "Script".
From regcomp.c
, function onig_is_in_code_range
:
for (low = 0, high = n; low < high; ) { x = (low + high) >> 1; if (code > data[x * 2 + 1]) low = x + 1; else high = x; } return ((low < n && code >= data[low * 2]) ? 1 : 0);
Binary search!
Execution time dependent on length of inversion list
As you can see from this code, properties are checked by using binary search on an inversion list. The execution time depends on the length of the inversion list.
enc/unicode/10.0.0/name2ctype.h
is 38'088 lines,
767'874 bytes/* 'Grapheme_Base': Derived Property */ static const OnigCodePoint CR_Grapheme_Base[] = { 791, 0x0020, 0x007e,
OnigCodePoint
is 4 bytesenc/unicode.o
(688'263 bytes on Cygwin)This slide shows the calculations for the total memory needed by the current implementation. For Unicode Version 9, there are 535 inversion lists, about 30'000 intervals. This results in a total memory usage of about 250KB.
enc/unicode/10.0.0/name2ctype.h
is 38'088 lines,
767'874 bytes
Trying to download from https://svn.ruby-lang.org/:
Display of files larger than 512 KB disallowed by configuration
403 Forbidden
Fortunately, checkout with svn (or github) works
One more way to show how large the file is.
See Feature #13240
So can we improve on the current implementation? We'll try with the following approach. First, in our big spreadsheet, we look not at columns, but rows. We find that many rows are exactly identical. We group such rows/characters into equivalence classes. In Unicode Version 9, there are 3747 equivalence classes.
We represent each property with a number of bits. Boolean propertios need only one bit. The Script property, which has 137 varues, needs 8 bits.
/* get Property Value data */ propVal = propValInfo[ctype]; /* get bit pattern for equivalence class */ return (propertyData[equivClass] /* select word */ [propVal.word] /* extract property value bit pattern */
& propVal.mask /* check for matching property value pattern */ ) == propVal.value ? 1 : 0;
Constant-time lookup!
The processing needed to check for a property value is slightly complicated, but can be done in constant time.
/* data for equivalence class of
Hiragana 「こ」 */
10000111001000000011000010100010
/* mask for Script property
*/
& 00000000111111110000000000000000
/* bit pattern for property value
Hiragana */
= 00000000001000000000000000000000
/* result is same as bit pattern
⇒「こ」 is Hiragana */
== 00000000001000000000000000000000
This is an example with actual data. We are checking whether the Hiragana character "ko" is a Hiragana or not.
In this new implementation, we need to know the equivalence class of each code point. That still needs too much memory. We need another improvement.
By using a two-step lookup for equivalance classes, we can save a lot of memory.
As a result, we happen to get by with 213KB of data.
This is the setup for our performance experiments.
As you can see, in the current implementation, processing gets slower if the number of intervals in the inversion list increases. Consistent with binary lookup, the increase is proportional to the logarithm of the number of intervals. In the new implementation, for every property value, we achieve about the same low time as for the property value with the lowest time in the current implemenation.
\p
checks for one specific value'Юに코δ' =~ /\p{Hiragana}/
'Ю '
?See Feature#13241
With regular expressions, you can check characters for specific property values. However, it's not possible to ask what property value for a property that a character has. With an implementation based on eqivalence classes, providing this functionality becomes easy.
Old (current) | New | ||
---|---|---|---|
Properties Supported | Boolean Properties | 55 | 56 |
Enumerated Properties | 7 | 20 | |
Property Values | 554 | 1009 | |
Memory Usage (bytes) | 252'284 | 213'908 | |
Lookup Speedup | depending on list length | up to 9 times faster | |
Discrimination Speed | very slow | very fast |
This is a comparison of the old and the new implementation.
Regexp
'Юに코δ' =~ /\p{Hiragana}/
'Ю'
?Comming back to the slide from Ruby Kaigi 2016, we see that most of our predictions were realized.
The new implementation also has some problems, but we are continuing research.
I want to deeply thank the many people who have contributed to this research and this preparation.
For more questions and comments, please contact me
(mailto:duerst@it.aoyama.ac.jp)
The latest version of this presentation is available at: