http://www.sw.it.aoyama.ac.jp/2014/pub/RubyI18N/
© 2014 Martin J. Dürst, Aoyama Gakuin University
Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses how to add internationalization functionality such as Unicode normalization, case conversion, and number formatting to Ruby in a true Ruby way.
Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often used with UTF-8.
Support for internationalization facilities beyond character encoding is available in various external libraries, but not yet in the Ruby core. As a result, libraries and applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. An internationalization library may add separate functions for case conversion for the whole range of Unicode characters.
We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way.
This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.
[These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some of the characters and character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.]
(Ruby) code is green, monospace
puts "Hello Ruby!"
Variable parts are orange
puts "some string"
Encoding is indicated with a subscript
"Юに코δ"
UTF-8"ユニコード"
SJISResults are indicated with "⇒": 1 + 1
⇒ 2
irb
: Interactive Ruby Shell5.times { puts "Hello Ruby!" }
{
}
or do
end
)String
, Array
,
Hash
,...)require
require
require
String
String
, StringBuffer
,
StringBuilder
, byte[]
, char[]
] main
object as file context(since Ruby 1.9)
"Юに코δ".length
⇒ 4
"Юに코δ"
UTF-8.encoding
⇒ UTF-8"abcδ".encode 'SJIS'
⇒ "abcδ"
SJIS"Юに코δ".match /δ/
Point 2 is fundamentally different from other programming languages, which in general use just a single encoding internally (Java, JavaScript: UTF-16; Python: 8-bit/UTF-16/UTF-32; Perl: 8-bit/UTF-8; C/C++: mb/wc)
In Ruby, all encodings are equal, but some encodings are more equal than others:
[Adapted from George Orwell's Animal Farm.]
Most I18N support is currently available via various libraries:
Unicode Normalization Forms NFC, NFD, NFKC, NFKD
Defined in UTS #15, yesterday's tutorial
Example: NFC of R + ̣ + is Ṝ
import java.text.Normalizer
Normalizer.normalize(string,
Normalizer.Form.NFC)
using System.Text;
string.Normalize(NormalizationForm.FormKD);
require Unicode::Normalize;
Unicode::Normalize->import(qw(NFD NFC NFKD NFKC));
NFKD($string);
import unicodedata
unicodedata.normalize('NFKD', string)
require "unicode_utils/nfc" require "unicode_utils/nfkd" require "unicode_utils/nfkc" require "unicode_utils/canonical_decomposition" UnicodeUtils.nfkc string
require 'active_support/multibyte/chars' ActiveSupport::Multibyte::Chars.new(string).normalize :kc
require 'twitter_cldr' TwitterCldr::Normalization::NFKC.normalize string
require
/using
/import
necessaryWe can do better!
String
String
Date
/Time
String
⇒ Libraries are reluctant to patch core classes
⇒ Adding functionality to language core
⇒ Dynamic loading
Starting point: Add method to class String
class String def unicode_normalize(form) # implementation goes here end end
Step 1: Separate invocation and implementation
class String
def unicode_normalize(form)
UnicodeNormalize.normalize(self, form)
end
end
Implementation in separate module String::UnicodeNormalize
(including internal methods)
Step 2: Conditionally require
implementation (only on first
call)
class String #reopen class String to add new methods
def unicode_normalize(form = :nfc)
unless defined? UnicodeNormalize
require 'unicode_normalize/normalize.rb'
end
UnicodeNormalize.normalize(self, form)
end
end
Speed optimization: Dynamically redefine method without require
class String #reopen class String to add new methods
def unicode_normalize(form = :nfc)
unless defined? UnicodeNormalize
require 'unicode_normalize/normalize.rb'
end
String.send(:define_method,
:unicode_normalize, ->(form = :nfc) do
UnicodeNormalize.normalize(self, form)
end
)
UnicodeNormalize.normalize(self, form)
end
end
:define_method
: Same as (1), but using metaprogramming We have to use metaprogramming (send, define_method) to redefine the method because Ruby doesn't allow to directly define a method inside a method.
Creating a new normalized String
string.unicode_normalize(form)
string
string.unicode_normalize!(form)
string
for normalizationstring.unicode_normalized?(form)
unicode_
prefix?:
:nfc
, :nfd
, :nfkc
,
:nfkd
gsub
)Hash
import java.text.Normalizer
Normalizer.normalize(string,
Normalizer.Form.NFC)
using System.Text;
string.Normalize(NormalizationForm.FormKD);
require Unicode::Normalize;
Unicode::Normalize->import(qw(NFD NFC NFKD NFKC));
NFKD($string);
import unicodedata
unicodedata.normalize('NFKD', string)
string.unicode_normalize :nfkd
Ruby clearly is shorter. But it's not shortness for shortness' sake, it just feels better. Ruby is a programmer's best friend!
'Hello'.upcase
⇒ 'HELLO'
'Hello'.downcase
⇒ 'hello'
'hello'.capitalize
⇒ 'Hello'
'Hello'.swapcase
⇒ 'hELLO'
'Hello'.casecmp 'hello'
⇒ true
+destructive variants 'München'.upcase
⇒ 'MüNCHEN'
)'München'.upcase :unicode
⇒ 'MÜNCHEN'
'Türkiye'.upcase 'tr'
⇒ 'TÜRKİYE'
'Türkiye'.upcase tr_locale
⇒
'TÜRKİYE'
In languages with strict typing (C++, Java,...), this can be achieved by method overloading.
/regexp/i
option'München'.upcase
(no argument) be'MÜNCHEN'
or 'MüNCHEN'
?'MüNCHEN'
'MÜNCHEN'
upcase
should
be very rare\p{Hiragana}
)String
Strings can be cut up, and iterated over, in four ways:
Lines: "Юに코δ".lines.to_a
⇒ ["Юに코δ"]
Characters: "Юに코δ".chars.to_a
⇒ ["Ю", "に", "코", "δ"]
Codepoints: "Юに코δ".codepoints.to_a
⇒ [1070, 12395, 46020, 948]
Bytes: "Юに코δ".bytes.to_a
⇒ [208, 174, 227, 129, 171, 235, 143, 132, 206, 180]
As the result of chars
shows, Ruby treats
characters as String
s of length 1. They were integers up to Ruby
1.8. Using the same class for both strings and characters avoids the
distinction between characters and strings of length 1. This matches Ruby's
"big classes" policy. It also leaves the door open for 'characters' other than
single codepoints.
each_line
, each_char
are older names
for lines
, chars
, codepoints
, and
bytes
. The methods lines
, chars
,
codepoints
, and bytes
return
Enumerator
s. Here we just use to_a
to produce arrays.
Enumerators can be used directly for iteration with each
, or with
separate iterators for mapping with map
, selection with
select
/reject
.
The need for several enumerators on a single object, resulting form the change of string/character representation between Ruby 1.8 and 1.9, was one of the main motivators for introducing enumerators into Ruby. This is an interesting example of how internationalization concerns can affect more 'fundamental' language features. The orthogonality resulting from separating what to enumerate over and how to iterate results in very expressive code.
string.unicode_normalize
A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.
Send questions and comments to Martin Dürst, duerst@it.aoyama.ac.jp
The latest version of this presentation is available at:
http://www.sw.it.aoyama.ac.jp/2014/pub/RubyI18N/
Ayumu Nojima (野島 歩) and Kimihito Matsui (松井 仁人) for help with research and implementations. Yui Naruse (成瀬 ゆい), Nobu Nakada (中田 伸悦) and many other Ruby committers for help and support. Matz (まつもと ゆきひろ) for Ruby, a programmer's best friend.
The IME Pad for facilitating character input.
Image credits: Christmas tree: Frits Ahlefeldt-Laurvig