http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/
© 2016 Martin J. Dürst, Aoyama Gakuin University
Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses the progress of adding internationalization functionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be the currently ongoing implementation of locale-aware case conversion.
Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often and most conveniently used with UTF-8.
Support for internationalization facilities beyond character encoding has been available via various external libraries. As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4, and efficiently reuses data already available for case-sensitive matching in regular expressions.
We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described.
This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general.
These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switch to projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.
String#encode
, Ruby 1.9)String#unicode-normalize
,
Ruby 2.2)String#upcase
,..., Ruby
2.4)This tutorial is about MRI/C-Ruby, the reference implementation
3.times { puts 'Hello Ruby!' }
⇒
Hello Ruby! Hello Ruby! Hello Ruby!
{
... }
or
do
... end
)Code is mostly green, monospace
puts 'Hello Ruby!'
Variable parts are orange
puts "some string"
Encoding is indicated with a subscript
'Юに코δ'
UTF-8,
'ユニコード'
SJIS
Results are indicated with "⇒"
1 + 1
⇒ 2
Юに코δ
Ю
: Cyrillic uppercase YUに
: Hiragana NI코
: Hangul KOδ
: Greek deltachcp 65001
irb
(Interactive Ruby)"Юに코δ".length
⇒ 4
"Юに코δ".bytesize
⇒ 10
String
:"Юに코δ".class
⇒ String
"Юに코δ"[0]
⇒ "Ю"
;"Юに코δ"[0].length
⇒ 1
Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints. Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.
String
has an encoding'Юに코δ'
UTF-8 +
'Юに코δ'
UTF-16
⇒ Encoding::CompatibilityError
Trying to combine strings with different encodings, as here with concatenation (+), leads to an exception. There are some exceptions (sic!) to this rule that we will look at later. The reasoning for the error here is that transcoding should not happen without the programmer being aware of it.
'Dürst'
ISO-8859-1
==
'Dürst'
ISO-8859-2
⇒ false
Trying to compare two character-by-character identical
strings in different encodings will produce false
, even if
these strings are, as in the above example, also byte-for-byte identical.
Again, the reason for the result is that encoding mismatches should be
detected early. In addition, a simple byte-for-byte comparison could
produce false positives.
except if their content is ASCII-only (bytes)
'abc'
ISO-8859-1 ==
'abc'
Shift_JIS ⇒
true
⇒ Just use Unicode, just use UTF-8
# encoding UTF-8
encoding pragma)\u
escapes is always UTF-8
"abc\u03B4"
⇒ 'abcδ'
UTF-8
-U
option if not in an UTF-8 context: ruby -U
myscript.rb
Year (y) | Ruby version (VRuby) | Unicode version (VUnicode) |
published around Christmas | published in Summer | |
2014 | 2.2 | 7.0.0 |
2015 | 2.3 | 8.0.0 |
2016 | 2.4 | 9.0.0 |
A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.
RbConfig::CONFIG["UNICODE_VERSION"]
⇒ '9.0.0'
VUnicode = y - 2007
VRuby = 1.5 + VUnicode · 0.1
VUnicode = VRuby · 10 - 15
Don't extrapolate too far!
'Unicode Everywhere'.upcase
'UNICODE EVERYWHERE'
'Unicode Everywhere'.downcase
'unicode everywhere'
'Unicode Everywhere'.capitalize
'Unicode everywhere'
'Unicode Everywhere'.swapcase
'uNICODE eVERYWHERE'
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
⇒ 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
Case Conversion up to and including Ruby 2.3 is ASCII-only!
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
(details vary by language)
German:
der Gefangene floh - the prisoner fled, but
der gefangene Floh - the captive flea
Regexp
: //i
)'Résumé'.upcase
⇒ 'RéSUMé'
'Résumé'.upcase :unicode
⇒ 'RÉSUMÉ'
grep
your code base for upcase
and friendsE.g. DNS servers
(but you used Encoding::ASCII_8BIT
there anyway?!)
downcase
d with Ruby 2.3 to help users
(Соколов in DB):ascii
OptionUse if you find a case where you really don't want to convert non-ASCII
characters
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
:ascii
⇒ 'RéSUMé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'
Use a library?
Integrate ICU?
Write new code?
Use a library?
Integrate ICU?
Write new code?
Data and other specifications available from the Unicode Consortium:
'ß'.upcase
⇒ 'SS'
(German sz/sharp s)'ffi'.upcase
⇒ "FFI"
(ffi
ligature)'ß'.upcase.downcase
⇒ 'ß'
'ss'
'
σ'.upcase
⇒
'Σ'
(Greek sigma)'
ς'.upcase
⇒
'Σ'
(Greek final sigma)'
ς'.upcase.downcase
⇒ 'ς'
'σ'
'Σ'.downcase
should be context-dependent⇒Not implemented!
'i'.upcase
⇒ 'I'
'I'.upcase
⇒ 'i'
'i'.upcase
⇒ 'İ'
(uppercase I with dot)'İ'.downcase
⇒ 'i'
'ı'.upcase
⇒ 'I'
(i
without dot)'I'.downcase
⇒ 'ı'
'Türkiye'.upcase :turkic
⇒
'TÜRKİYE'
'Í'.downcase
⇒'í'
(accent replaces dot)
'Í'.downcase :lithuanian
⇒'í'
(accent above visible dot; may not show because of
technology limits)
upcase
/downcase
/capitalize
/swapcase
downcase
:fold
option on downcase
'ß'.downcase :fold
⇒ 'ss'
'ffi'.downcase :fold
⇒ 'ffi'
'ς'.downcase :fold
⇒
'σ'
capitalize
'džungla'.capitalize
⇒ 'DŽungla'
'džungla'.capitalize
⇒
'Džungla'
String (functional) |
String (destructive) |
Symbol |
---|---|---|
upcase |
upcase! |
upcase |
downcase |
downcase! |
downcase |
capitalize |
capitalize! |
capitalize |
swapcase |
swapcase! |
swapcase |
Not dealt with: String#casecmp
Why: Includes sorting
Flags to indicate operation needed
(in file include/ruby/oniguruma.h
):
#define ONIGENC_CASE_UPCASE (1<<13) /* uppercase mapping */ #define ONIGENC_CASE_DOWNCASE (1<<14) /* lowercase mapping */ #define ONIGENC_CASE_TITLECASE (1<<15) /* titlecase mapping */
Usage to indicate operation type:
upcase
: ONIGENC_CASE_UPCASE
(upcasing needed)
downcase
: ONIGENC_CASE_DOWNCASE
(downcasing needed)
capitalize
: ONIGENC_CASE_TITLECASE |
ONIGENC_CASE_UPCASE
(changed to ONIGENC_CASE_DOWNCASE
after first character)
swapcase
: ONIGENC_CASE_UPCASE |
ONIGENC_CASE_DOWNCASE
(both upcasing and downcasing needed)
Flags also used for options:
:fold
(for case folding; only on downcase
):turkic
:lithuanian
(not yet implemented):ascii
Corresponding flags:
#define ONIGENC_CASE_FOLD (1<<19) /* has/needs case folding * / #define ONIGENC_CASE_FOLD_TURKISH_AZERI (1<<20) /* Turkic */ #define ONIGENC_CASE_FOLD_LITHUANIAN (1<<21) /* Lithuanian */ #define ONIGENC_CASE_ASCII_ONLY (1<<22) /* limited to ASCII */
Handles string expansion (e.g. "ffi".upcase
⇒
"FFI"
)
Common to all casing operations
[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)
//i
)
mbc_case_fold
apply_all_case_fold
get_case_fold_codes_by_str
⇒ New primitive
case_map
PrimitiveOnigCaseFoldType flags
case_map
PrimitiveExamples:
"Résumé"
UTF-8.upcase
callsonigenc_unicode_case_map
in enc/unicode.c
(most complex case)OnigEncodingDefine
in
enc/utf_8.c
"Résumé"
UTF-16LE.upcase
callsonigenc_unicode_case_map
in enc/unicode.c
OnigEncodingDefine
in
enc/utf_16le.c
"Résumé"
ISO-8859-1.upcase
callscase_map
in enc/iso_8859_1.c
OnigEncodingDefine
in the same fileonigenc_unicode_case_map
while
loop, one source character a timeONIGENC_CASE_MODIFIED
flagif
/else if
/else
|
/&
with flag bitsgoto
sgperf
-created hash lookups:onigenc_unicode_fold_lookup
onigenc_unicode_unfold1_lookup
case_map
PrimitivesStudents (sophomores/juniors/seniors) at Aoyama Gakuin University
For East Asian encodings
(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)
data could be shared between //i
and case mapping
but case folding for //i
only works for ASCII
None of the main Japanese committers thought this was needed
anymore
Talk to me if you need it
downcase
upcase
in enc/unicode/9.0.0/casefold.h
/* before */ {0x0041, {1, {0x0061}}}, /* A → a */ {0x00df, {2, {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1, {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1, {0x01c6}}}, /* Dž → dž */ {0xab73, {1, {0x13a3}}}, /* Ꭳ → ꭳ (Cherokee) */ /* after */ {0x0041, {1|F|D, {0x0061}}}, /* A → a */ {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1|F|D|ST|I(8), {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}}, /* Dž → dž */ {0xab73, {1|F|U, {0x13a3}}}, /* Ꭳ → ꭳ (Cherokee) */
(squeezed into an int
where only 2 bits were used)
see enc/unicode.c
/* data is available here */ /* (flags are the same as for options) */ #define U ONIGENC_CASE_UPCASE #define D ONIGENC_CASE_DOWNCASE #define F ONIGENC_CASE_FOLD /* data is in special additional array */ #define ST ONIGENC_CASE_TITLECASE #define SU ONIGENC_CASE_UP_SPECIAL #define SL ONIGENC_CASE_DOWN_SPECIAL #define IT ONIGENC_CASE_IS_TITLECASE /* index into special array (size: around 420 words only) */ #define I(n) OnigSpecialIndexEncode(n)
(or my attempt at using the Takahashi method)
upcase
downcase
capitalize
swapcase
swapcase
?Well, I did,
when testing swapcase
!
swapcase
?swapcase
?Python has it ?! (Matz)
swapcase
?Python has it ?! (Matz)
To revert accidental Caps Lock output ?! (on Unicode list)
swapcase
"DžunGLA".swapcase
"DžUNgla"
preferred by
Unicode Consortium
(never ever need any new standardization)
preserves
reversibility
(X.swapcase.swapcase == X
)
"DžunGLA".swapcase
"DŽUNgla"
"DžunGLA".swapcase
"džUNgla"
"DžunGLA".swapcase
"dŽUNgla"
proposed by Nobuyoshi Nakada
"dŽUNgla"
useless?, but
'correct'
additional effort for implementation
additional effort for testing
(エイプリルフールの日)
Japan Time 20:58:33 ⇒ same date in most timezones
please draw your own conclusions
Files:
test/ruby/enc/test_case_options.rb
test/ruby/enc/test_case_mapping.rb
Files:
test/ruby/enc/test_case_comprehensive.rb
413 tests, 2'212'391 assertions, 0 failures, 0 errors, 0 skips
:lithuanian
(because last to be actually
implemented)Regexp
'Юに코δ' =~ /\p{Hiragana}/
'Ю'
?What I want:
loc = Locale.new 'de-CH'
(German as used in Switzerland)
1.2345678E5.to_s
⇒ "123456.78"
1.2345678E5.to_s(loc)
⇒
"123'456,78"
Internationalization support in libraries:
UnicodeUtils.nfkc string
ActiveSupport::Multibyte::Chars.new(string).normalize :kc
TwitterCldr::Normalization::NFKC.normalize string
string.unicode_normalize :nfkc
Libraries avoid monkey patching
⇒ not Ruby-like (ライブラリを使うと Ruby らしくない)
Possible solution (解決案):
loc = Locale.new 'tr'
⇒
'Türkiye'.upcase loc
'TÜRKİYE'
//i
case folding: all encodings not at end oftest/ruby/enc/test_regex_casefold.rb
)Regexp
dataMore information about case conversion implementation internals:
http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/
(video at http://rubykaigi.org/2016/presentations/duerst.html)
Send questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open a bug report or feature request
for Ruby
The latest version of this presentation is available at: