Internationalization in Ruby 2.4

Abstract

Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses the progress of adding internationalization functionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be the currently ongoing implementation of locale-aware case conversion.

Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often and most conveniently used with UTF-8.

Support for internationalization facilities beyond character encoding has been available via various external libraries. As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4, and efficiently reuses data already available for case-sensitive matching in regular expressions.

We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described.

This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.

For Best Viewing

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switch to projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

Introduction

Introductions

Overview

Ruby Basics

Ruby

Ruby Implementations

Basic Ruby

Conventions Used in This Talk

Encoding is indicated with a _subscript
'Юに코δ'_UTF-8, 'ユニコード'_SJIS

Frequent Example Юに코δ

Up and Running

String Basics

Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints. Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.

Encoding Basics

Ruby Likes UTF-8

Ruby Versions

Ruby Versions and Unicode Versions

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.

New in Ruby 2.4:

Non-ASCII Case Conversion

Case Conversions Functions in Ruby

Case Conversion in Ruby 2.3

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
⇒ 'RéSUMé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'

Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢĲŇŐŃÆŁĨŻÀŤÏŌŅ'

Case Conversions NOT in Ruby 2.3

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢĲŇŐŃÆŁĨŻÀŤÏŌŅ'

But in Ruby 2.4!

Case Conversion Around the World

Case Distinction History

Modern Case Usage

German:
der Gefangene floh - the prisoner fled, but
der gefangene Floh - the captive flea

Isn't ASCII-only Case Conversion Enough?

But: Backwards Compatibility?

Backwards Compatibility Problems

Backwards Compatibility: :ascii Option

Use if you find a case where you really don't want to convert non-ASCII characters

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢĲŇŐŃÆŁĨŻÀŤÏŌŅ'

'Résumé ĭñŧėřŋãţĳňőńæłĩżàťïōņ'.upcase
:ascii

⇒

'RéSUMé
ĭñŧėřŋãţĳňőńæłĩżàťïōņ'

Implementation Choices

Where to Get the Data From?

Special Cases: Not 1-to-1

Special Case: Simple Case Mapping

Special Case: Turkic

Special Case: Lithuanian

Special Case: Case Folding

Special Case: Titlecase

More Special Cases

Implementation

12 Methods to Implement

Internally, a Single Function

`String` (functional)	`String` (destructive)	`Symbol`
`upcase`	`upcase!`	`upcase`
`downcase`	`downcase!`	`downcase`
`capitalize`	`capitalize!`	`capitalize`
`swapcase`	`swapcase!`	`swapcase`

capitalize:

 ONIGENC_CASE_TITLECASE |
ONIGENC_CASE_UPCASE

(changed to ONIGENC_CASE_DOWNCASE after first character)

swapcase:

   ONIGENC_CASE_UPCASE |
ONIGENC_CASE_DOWNCASE

(both upcasing and downcasing needed)

Option Handling

String Expansion

Handling Encodings: The Ruby Way

[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)

Implementation Choice:
UTF-8 only or Primitives

Implementation Choice:
New or Reused Primitive

The case_map Primitive

Implementations of case_map Primitive

The Primitive of Primitives:
onigenc_unicode_case_map

More case_map Primitives

So What about Shift_JIS and Friends?

Reusing Case Folding Data

Folding Data: Before and After

/*  before  */
  {0x0041, {1, {0x0061}}},  /*  A → a  */
  {0x00df, {2, {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1, {0x01c6}}},  /*  Ǆ → ǆ  */
  {0x01c5, {1, {0x01c6}}},  /*  ǅ → ǆ  */
  {0xab73, {1, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

/*  after  */
  {0x0041, {1|F|D, {0x0061}}},  /*  A → a  */
  {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}},  /*  ß → ss  */
  {0x01c4, {1|F|D|ST|I(8), {0x01c6}}},  /*  Ǆ → ǆ  */
  {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}},  /*  ǅ → ǆ  */
  {0xab73, {1|F|U, {0x13a3}}},  /*  Ꭳ → ꭳ (Cherokee)  */

Folding Data: Flags

/*  data is available here  */
/*  (flags are the same as for options)  */
#define U ONIGENC_CASE_UPCASE
#define D ONIGENC_CASE_DOWNCASE
#define F ONIGENC_CASE_FOLD
/*  data is in special additional array  */
#define ST ONIGENC_CASE_TITLECASE
#define SU ONIGENC_CASE_UP_SPECIAL
#define SL ONIGENC_CASE_DOWN_SPECIAL
#define IT ONIGENC_CASE_IS_TITLECASE
/*  index into special array
    (size: around 420 words only)  */
#define I(n) OnigSpecialIndexEncode(n)

Small Implementation Detail

upcase

seems useful

downcase

seems useful

capitalize

seems useful

swapcase

Who would use swapcase?

Nobody?

Why swapcase?

implementing swapcase

must be easy
UPPER ⇒ upper
lower ⇒ LOWER

But what about titlecase?

ǲ, ǅ, ǈ, ǋ
ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ
ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ
ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Choice 1
"ǅunGLA".swapcase
⇓ leave as is
"ǅUNgla"

Choice 2
"ǅunGLA".swapcase
⇓ upcase
"ǄUNgla"

Choice 3
"ǅunGLA".swapcase
⇓ downcase
"ǆUNgla"

Choice 4
"ǅunGLA".swapcase
⇓ swap
"dŽUNgla"

Implemented
swap ⇒"dŽUNgla"

useless?, but 'correct'
additional effort for implementation
additional effort for testing

Commit Date
April 1st, 2016

(エイプリルフールの日)
Japan Time 20:58:33 ⇒ same date in most timezones
please draw your own conclusions

Testing

Test-Driven Development

Files:
test/ruby/enc/test_case_options.rb
test/ruby/enc/test_case_mapping.rb

Year (`y`)	Ruby version (`V`_Ruby)	Unicode version (`V`_Unicode)
	published around Christmas	published in Summer
2014	2.2	7.0.0
2015	2.3	8.0.0
2016	2.4	9.0.0

40th Internationalization and Unicode Conference

Santa Clara, California, U.S.A., November 3, 2016

Martin J. DÜRST

Abstract

For Best Viewing

Introduction

Introductions

Overview

Ruby Basics

Ruby

Ruby Implementations

Basic Ruby

Conventions Used in This Talk

Frequent Example Юに코δ

Up and Running

String Basics

Encoding Basics

Ruby Likes UTF-8

Ruby Versions

Ruby Versions and Unicode Versions

New in Ruby 2.4: Non-ASCII Case Conversion

Case Conversions Functions in Ruby

Case Conversion in Ruby 2.3

Case Conversions NOT in Ruby 2.3

Case Conversions NOT in Ruby 2.3

But in Ruby 2.4!

Case Conversion Around the World

Case Distinction History

Modern Case Usage

Isn't ASCII-only Case Conversion Enough?

But: Backwards Compatibility?

Backwards Compatibility Problems

Backwards Compatibility: :ascii Option

Implementation Choices

Implementation Choices

Where to Get the Data From?

Special Cases: Not 1-to-1

Special Case: Simple Case Mapping

Special Case: Turkic

Special Case: Lithuanian

Special Case: Case Folding

Special Case: Titlecase

More Special Cases

Implementation

12 Methods to Implement

Internally, a Single Function

Option Handling

String Expansion

Handling Encodings: The Ruby Way

Implementation Choice: UTF-8 only or Primitives

Implementation Choice: New or Reused Primitive

The case_map Primitive

Implementations of case_map Primitive

The Primitive of Primitives: onigenc_unicode_case_map

More case_map Primitives

So What about Shift_JIS and Friends?

Reusing Case Folding Data

Folding Data: Before and After

Folding Data: Flags

Small Implementation Detail

upcase

seems useful

downcase

seems useful

capitalize

seems useful

swapcase

Who would use swapcase?

Nobody?

Nobody?

Why swapcase?

Why swapcase?

Why swapcase?

implementing swapcase

must be easy UPPER ⇒ upper lower ⇒ LOWER

But what about titlecase?

ǲ, ǅ, ǈ, ǋ ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Choice 1 "ǅunGLA".swapcase ⇓ leave as is "ǅUNgla"

Choice 2 "ǅunGLA".swapcase ⇓ upcase "ǄUNgla"

Frequent Example `Юに코δ`

New in Ruby 2.4:

Non-ASCII Case Conversion

Backwards Compatibility: `:ascii` Option

Implementation Choice:
UTF-8 only or Primitives

Implementation Choice:
New or Reused Primitive

The `case_map` Primitive

Implementations of `case_map` Primitive

The Primitive of Primitives:
`onigenc_unicode_case_map`

More `case_map` Primitives

`upcase`

`downcase`

`capitalize`

`swapcase`

Who would use `swapcase`?

Why `swapcase`?

Why `swapcase`?

Why `swapcase`?

implementing `swapcase`

must be easy
UPPER ⇒ upper
lower ⇒ LOWER

ǲ, ǅ, ǈ, ǋ
ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏ
ῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟ
ῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Choice 1
`"ǅunGLA".swapcase`
⇓ leave as is
`"ǅUNgla"`

Choice 2
`"ǅunGLA".swapcase`
⇓ upcase
`"ǄUNgla"`

Choice 3
`"ǅunGLA".swapcase`
⇓ downcase
`"ǆUNgla"`

Choice 4
`"ǅunGLA".swapcase`
⇓ swap
`"dŽUNgla"`

Implemented
swap ⇒`"dŽUNgla"`

Commit Date
April 1st, 2016

Future:

Ideas, Problems, Questions