Slicing and Dicing Unicode Properties

Abstract

Unicode defines a large number of character properties for character classification and algorithms. This talk compares the two main ways of implementing property support, and discusses their advantages and disadvantages based on actual implementation experience. This talk is suited for users dealing with Unicode properties as well as for implementers.

Examples of Unicode character properties range from general category (letter, number, symbol,...) to specialized properties for tasks such as bidirectional layout and normalization. Many programming languages, libraries, and regular expression engines provide access to these properties. We provide some insights and results based on experimental work on the Onigmo regular expression engine that is used by the programming language Ruby.

Leaving special properties such as character name aside, the two main ways of implementing property support are inversion lists and folded tries. Moving from the former to the later, we succeeded to increase the number of properties covered from 62 to 76, and the number of property values covered from 554 to 1009, all while reducing the memory necessary from 240kB to 214kB. In addition, elimination of binary search made raw property checks up to 9 times faster, and lookup of specific property values from characters up to 65 times faster.

Comparing the two methods from a conceptual viewpoint, it is interesting to observe that when arranging property values in a huge table with characters as rows and properties as columns, inversion lists look at this table one column a time, whereas folded tries look at it one row at a time. Folded tries take advantage of the fact that most property value combinations are shared by a large number of characters, whereas inversion lists take advantage of the fact that subsequent characters often share the same property value. This understanding can lead to further improvements.

The work described in this presentation was done jointly with Takumi Koyama in 2016/17.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). The slides have been optimized for a screen size of 1920x1080pixels, but can easily be viewed on other screens, too. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Introduction

Motivation:
Slide from Ruby Kaigi 2016

Outline

Audience Self-Intro

Speaker Self-Intro

Ruby

Basic Ruby

Conventions Used

Matching of regular expressions is indicated with ≟, ≅, or ≇
'abc' ≅ /a/; 'def' ≇ /a/

This slide shows you the conventions we'll use throughout the talk. (Code is in green monospace font. Variable parts are in orange. Where necessary, the character encoding of a string is indicated with a blue subsccript. Results are show after a double arrow. We'll use two symbols to show when regular expressions match or don't match.)

Our Contributions to Ruby

Ruby Versions and Unicode Versions

NOTE about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

Georgian Mtavruli

Outline

Unicode is ...

Examples of Properties and Algorithms

Properties and Regular Expressions

Properties in Regular Expressions

How to Think about Properties

How to Organize Property Data

How Big is this Table?

Any Better Ideas?

Exploiting Property Values

Exploiting Sparsity

Exploiting Sparsity: Example 1

Exploiting Sparsity: Example 2

Exploiting Dependence

Exploiting Regularity

Exploiting Continuity

Exploiting Equivalence

Any Better Ideas?

Outline

Improving Unicode Property Support in Ruby

Importance of Unicode Properties in Ruby

Ruby Regular Expression Engine

Properties in Regular Expressions

In regular expressions, we can check for character properties with backslash-p. We can use this also inside character classes, where logical operations are also available.

Currently Supported Properties

Due to Onigmo, Ruby currently has very good support for Boolean properties. But the support for enumerated properties is not up to par. As we will see soon, the reason for this lies in the details of the Onigmo implemenation.

Current Implementation: Data for Boolean Property

This is the data used for implementing the "Radical" property, which indicates whether a character is a Kanji radical or not. The values are the boundaries of intervals. Inside the boundaries, the property is true, outside, it is false.

Current Implementation: Inversion Lists

The data values are the boundaries of the intervals. At the boundaries, the value of the property is inverted. This kind of data structure is therefore called "Inversion List".

Current Implementation: Enumerated Properties

	Properties
P1	P2	P3	...
Cha-	1	T	T	F
rac-	2	T	T	T
ters	a	F	T	F
b	F	T	T
c	F	T	F
青	T	F	T
...

	Properties
P1	P2	P3	...
Cha-	1		0	2
rac-	2		99	101
ters	a		5	7
b		44	46
c		7	9
青		20	22
...

	Properties
P1	P2	P3	...
Cha-	1		x	7
rac-	2		x	7
ters	a		y	7
b		y	7
c		y	7
青		z	9
...

	Properties
P1	P2	P3	...
Cha-	1	1	X	Q
rac-	2	2	Y	P
ters	a	1	X	Q
b	3	Z	P
c	2	Y	P
青	3	Z	P
...

Inversion lists work well for binary properties. But they cannot be used directly for enumerated properties. For enumerated properties, an inversion list has to be provided for each property value. As an example, the Script Property needs 138 inversion lists.

Additional Example: Data for Katakana

Here's another example, the inversion list for the property value "Katakana" of the property "Script".

Current Implementation: Property Check

As you can see from this code, properties are checked by using binary search on an inversion list. The execution time depends on the length of the inversion list.

Current Implementation: Data Size

This slide shows the calculations for the total memory needed by the current implementation. For Unicode Version 9, there are 535 inversion lists, about 30'000 intervals. This results in a total memory usage of about 250KB.

File Size

An Exception Has Occurred

HTTP Response Status

New Approach: Equivalence Classes

So can we improve on the current implementation? We'll try with the following approach. First, in our big spreadsheet, we look not at columns, but rows. We find that many rows are exactly identical. We group such rows/characters into equivalence classes. In Unicode Version 9, there are 3747 equivalence classes.

New Approach: Property Representation

We represent each property with a number of bits. Boolean propertios need only one bit. The Script property, which has 137 varues, needs 8 bits.

New Approach: Property Check

The processing needed to check for a property value is slightly complicated, but can be done in constant time.

Property Check Example

    /* data for equivalence class of
Hiragana 「こ」 */

    10000111001000000011000010100010

    /* bit pattern for property value
Hiragana */

=   00000000001000000000000000000000

    /* result is same as bit pattern
⇒「こ」 is Hiragana */

==  00000000001000000000000000000000

This is an example with actual data. We are checking whether the Hiragana character "ko" is a Hiragana or not.

New Approach: Data Size (first try)

In this new implementation, we need to know the equivalence class of each code point. That still needs too much memory. We need another improvement.

New Approach: Two-Step Equivalence Class Lookup

By using a two-step lookup for equivalance classes, we can save a lot of memory.

New Approach: Data Size (second try)

Performance Experiments

Processing Speedsparsity

As you can see, in the current implementation, processing gets slower if the number of intervals in the inversion list increases. Consistent with binary lookup, the increase is proportional to the logarithm of the number of intervals. In the new implementation, for every property value, we achieve about the same low time as for the property value with the lowest time in the current implemenation.

Property Discrimination

With regular expressions, you can check characters for specific property values. However, it's not possible to ask what property value for a property that a character has. With an implementation based on eqivalence classes, providing this functionality becomes easy.

Old versus New

	Old (current)	New
Properties Supported	Boolean Properties	55	56
Enumerated Properties	7	20
Property Values	554	1009
Memory Usage (bytes)	252'284	213'908
Lookup Speedup	depending on list length	up to 9 times faster
Discrimination Speed	very slow	very fast

Slide from Ruby Kaigi 2016

Comming back to the slide from Ruby Kaigi 2016, we see that most of our predictions were realized.

Future Work

Acknowledgments

I want to deeply thank the many people who have contributed to this research and this preparation.

Year (`y`)	Ruby version (`V`_Ruby)	Supported Unicode version (`V`_Unicode)
	published around Christmas	published in Summer
2014	2.2	7.0.0
2015	2.3	8.0.0
2016	2.4	9.0.0
2017	2.5	10.0.0
2018 (planned)	2.6	11.0.0

Internationalization and Unicode Conference 42 Santa Clara, CA, U.S.A., September 11, 2018

Martin J. DÜRST

Abstract

For Best Viewing

Introduction

Motivation: Slide from Ruby Kaigi 2016

Outline

Audience Self-Intro

Audience Self-Intro

Audience Self-Intro

Audience Self-Intro

Audience Self-Intro

Speaker Self-Intro

Ruby

Basic Ruby

Conventions Used

Our Contributions to Ruby

Ruby Versions and Unicode Versions

Georgian Mtavruli

Outline

Unicode is ...

Unicode is ...

Unicode is ...

Unicode is ...

Examples of Properties and Algorithms

Properties and Regular Expressions

Properties in Regular Expressions

How to Think about Properties

How to Organize Property Data

How Big is this Table?

How Big is this Table?

Any Better Ideas?

Exploiting Property Values

Exploiting Sparsity

Exploiting Sparsity: Example 1

Exploiting Sparsity: Example 2

Exploiting Dependence

Exploiting Regularity

Exploiting Continuity

Exploiting Equivalence

Any Better Ideas?

Outline

Improving Unicode Property Support in Ruby

Importance of Unicode Properties in Ruby

Ruby Regular Expression Engine

Properties in Regular Expressions

Currently Supported Properties

Current Implementation: Data for Boolean Property

Current Implementation: Inversion Lists

Current Implementation: Enumerated Properties

Additional Example: Data for Katakana

Current Implementation: Property Check

Current Implementation: Data Size

File Size

An Exception Has Occurred

HTTP Response Status

New Approach: Equivalence Classes

New Approach: Property Representation

New Approach: Property Check

Property Check Example

New Approach: Data Size (first try)

New Approach: Two-Step Equivalence Class Lookup

New Approach: Data Size (second try)

Performance Experiments

Processing Speedsparsity

Property Discrimination

Old versus New

Slide from Ruby Kaigi 2016

Future Work

Acknowledgments

Summary

Q & A

Internationalization and Unicode Conference 42
Santa Clara, CA, U.S.A., September 11, 2018

Motivation:
Slide from Ruby Kaigi 2016