Design Considerations for Internationalization in Ruby 2.2

Abstract

Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses how to add internationalization functionality such as Unicode normalization, case conversion, and number formatting to Ruby in a true Ruby way.

Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often used with UTF-8.

Support for internationalization facilities beyond character encoding is available in various external libraries, but not yet in the Ruby core. As a result, libraries and applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. An internationalization library may add separate functions for case conversion for the whole range of Unicode characters.

We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way.

This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.

[These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.]

Introductions

Overview

Conventions Used

Ruby

Ruby Implementations

Basic Ruby

Ruby Library Hierarchy

Object-Orientation

Current I18N Support

Point 2 is fundamentally different from other programming languages, which in general use just a single encoding internally (Java, JavaScript: UTF-16; Python: 8-bit/UTF-16/UTF-32; Perl: 8-bit/UTF-8; C/C++: mb/wc)

Important Encodings

In Ruby, all encodings are equal, but some encodings are more equal than others:

Missing I18N Support

External Libraries

Design Considerations

Case Studies

Case Study: Normalization

Normalization in Other Languages

Normalization in Ruby Libraries

Observations

I18N Functionality Where It Belongs

Monkey Patching

Memory Footprint

Dynamic Loading to the Rescue (0)

Dynamic Loading to the Rescue (1)

Implementation in separate module String::UnicodeNormalize (including internal methods)

Dynamic Loading to the Rescue (2)

Dynamic Loading to the Rescue (3)

We have to use metaprogramming (send, define_method) to redefine the method because Ruby doesn't allow to directly define a method inside a method.

Three New Methods

Why the unicode_ prefix?

Symbols for Normalization Forms

Pure Ruby vs. Extension

Efficient Pure Ruby

Normalization in Various Languages

Ruby clearly is shorter. But it's not shortness for shortness' sake, it just feels better. Ruby is a programmer's best friend!

Case Study: Casing

Casing in Ruby Now

Widening Method Interface

In languages with strict typing (C++, Java,...), this can be achieved by method overloading.

Implementation Reuse

One Encoding or All

Default: To Unicode or not to Unicode

Case Study: Properties

Case Study: Iterators

Existing String Iterators

Bytes: "Юに코δ".bytes.to_a
⇒ [208, 174, 227, 129, 171, 235, 143, 132, 206, 180]

As the result of chars shows, Ruby treats characters as Strings of length 1. They were integers up to Ruby 1.8. Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints.

each_line, each_char are older names for lines, chars, codepoints, and bytes. The methods lines, chars, codepoints, and bytes return Enumerators. Here we just use to_a to produce arrays. Enumerators can be used directly for iteration with each, or with separate iterators for mapping with map, selection with select/reject.

The need for several enumerators on a single object, resulting form the change of string/character representation between Ruby 1.8 and 1.9, was one of the main motivators for introducing enumerators into Ruby. This is an interesting example of how internationalization concerns can affect more 'fundamental' language features. The orthogonality resulting from separating what to enumerate over and how to iterate results in very expressive code.

Implementation Status

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

Conclusions

Q & A

Send questions and comments to Martin Dürst, duerst@it.aoyama.ac.jp

The latest version of this presentation is available at:

Acknowledgments

Ayumu Nojima (野島歩) and Kimihito Matsui (松井仁人) for help with research and implementations. Yui Naruse (成瀬ゆい), Nobu Nakada (中田伸悦) and many other Ruby committers for help and support. Matz (まつもとゆきひろ) for Ruby, a programmer's best friend.