http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/
(マーティンと呼んでください)
© 2016 Martin J. Dürst, Aoyama Gakuin University
Currently many of Ruby's String methods, such as upcase and downcase, are limited to ASCII and ignore the rest of the world. This is finally going to change in Ruby 2.4, where this functionality will be extended to cover full Unicode. You will get to know what will change, how your programs may be affected, and how these changes are implemented behind the scenes. We will also look at the overall state of internationalization functionality in Ruby, and potential future directions.
These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.
Code is mostly green, monospace
puts "Hello Ruby!"
Variable parts are orange
puts "some string"
Encoding is indicated with a subscript
"Юに코δ"
UTF-8,
"ユニコード"
SJIS
Results are indicated with "⇒"
1 + 1
⇒ 2
Mainly in the following areas:
String#encode
, Ruby 1.9)String#unicode-normalize
, Ruby
2.2)String#upcase
,..., Ruby 2.4)Year (y) | Ruby version (VRuby) | Unicode version (VUnicode) |
published around Christmas | published in Summer | |
2014 | 2.2 | 7.0.0 |
2015 | 2.3 | 8.0.0 |
2016 | 2.4 | 9.0.0 (as of yesterday) |
A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.
RbConfig::CONFIG["UNICODE_VERSION"]
⇒ '9.0.0'
VUnicode = y - 2007
VRuby = 1.5 + VUnicode · 0.1
VUnicode = VRuby · 10 - 15
Don't extrapolate too far!
'Ruby Kaigi 2016'.upcase
⇒ 'RUBY KAIGI
2016'
'Ruby Kaigi 2016'.downcase
⇒ 'ruby kaigi
2016'
'Ruby Kaigi 2016'.capitalize
⇒ 'Ruby Kaigi
2016'
'Ruby Kaigi 2016'.swapcase
⇒ 'rUBY kAIGI
2016'
'Ruby Kaigi 2016'.upcase
⇒ 'RUBY KAIGI
2016'
'Ruby Kaigi 2016'.downcase
⇒ 'ruby kaigi
2016'
'Ruby Kaigi 2016'.capitalize
⇒ 'Ruby Kaigi
2016'
'Ruby Kaigi 2016'.swapcase
⇒ 'rUBY kAIGI
2016'
'Résumé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'
'Юрий Соколов'.upcase
(aka funny.falcon)'Юрий Соколов'
'Résumé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
'Юрий Соколов'.upcase
'ЮРИЙ СОКОЛОВ'
'Résumé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
'Юрий Соколов'.upcase
'ЮРИЙ СОКОЛОВ'
(details vary by language)
German:
der Gefangene floh 捕虜が逃げた the prisoner fled,
but
der gefangene Floh 捕まった蚤 the captive flea
Regexp
: //i
)'Résumé'.upcase
⇒ 'RéSUMé'
'Résumé'.upcase :unicode
⇒ 'RÉSUMÉ'
grep
your code base for upcase
and friendsE.g. DNS servers
(but you used Encoding::ASCII_8BIT
there anyway, didn't
you)
downcase
d with Ruby 2.3 to help users (Соколов in
DB):ascii
OptionUse if you find a case where you really don't want to convert non-ASCII
characters
(どうしても ASCII のみに限定したいとき)
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
:ascii
⇒ 'RéSUMé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'
'Юрий Соколов'.upcase
⇒ 'ЮРИЙ СОКОЛОВ'
'Юрий Соколов'.upcase :ascii
⇒ 'Юрий Соколов'
Data and other specifications available from the Unicode Consortium:
UnicodeData.txt
CaseFolding.txt
SpecialCasing.txt
'ß'.upcase
⇒ 'SS'
(German sz/sharp s)'ffi'.upcase
⇒ "FFI"
(ffi
ligature)'ß'.upcase.downcase
⇒ 'ß'
'ss'
'
σ'.upcase
⇒
'Σ'
(Greek sigma)'
ς'.upcase
⇒
'Σ'
(Greek final sigma)'
ς'.upcase.downcase
⇒ 'ς'
'σ'
'Σ'.downcase
should be context-dependent⇒Not implemented!
'i'.upcase
⇒ 'I'
'I'.upcase
⇒ 'i'
'i'.upcase
⇒ 'İ'
(uppercase I with dot)'İ'.downcase
⇒ 'i'
'ı'.upcase
⇒ 'I'
(i
without dot)'I'.downcase
⇒ 'ı'
'Türkiye'.upcase :turkic
⇒
'TÜRKİYE'
'Í'.downcase
⇒'í'
(accent replaces dot)
'Í'.downcase :lithuanian
⇒'í'
(accent above visible dot)
upcase
/downcase
/capitalize
/swapcase
downcase
downcase
'ß'.downcase :fold
⇒ 'ss'
'ffi'.downcase :fold
⇒ 'ffi'
'ς'.downcase :fold
⇒
'σ'
capitalize
"džungla".capitalize
⇒ "DŽungla"
"džungla".capitalize
⇒
"Džungla"
Not implemented (yet?)
tr('ABC...', 'abc...')
String (functional) |
String (destructive) |
Symbol |
---|---|---|
upcase |
upcase! |
upcase |
downcase |
downcase! |
downcase |
capitalize |
capitalize! |
capitalize |
swapcase |
swapcase! |
swapcase |
Not dealt with: String#casecmp
Why: Includes sorting, very difficult for all of Unicode
string.c
: Init_String
Tells Ruby what C functions to use for each method
rb_define_method(rb_cString, "upcase", rb_str_upcase, -1); rb_define_method(rb_cString, "downcase", rb_str_downcase, -1); rb_define_method(rb_cString, "capitalize", rb_str_capitalize, -1); rb_define_method(rb_cString, "swapcase", rb_str_swapcase, -1); rb_define_method(rb_cString, "upcase!", rb_str_upcase_bang, -1); rb_define_method(rb_cString, "downcase!", rb_str_downcase_bang, -1); rb_define_method(rb_cString, "capitalize!", rb_str_capitalize_bang, -1); rb_define_method(rb_cString, "swapcase!", rb_str_swapcase_bang, -1); rb_define_method(rb_cSymbol, "upcase", sym_upcase, -1); rb_define_method(rb_cSymbol, "downcase", sym_downcase, -1); rb_define_method(rb_cSymbol, "capitalize", sym_capitalize, -1); rb_define_method(rb_cSymbol, "swapcase", sym_swapcase, -1);
string.c
: Init_String
Tells Ruby what C functions to use for each method
rb_define_method(rb_cString, "upcase", rb_str_upcase, -1); rb_define_method(rb_cString, "downcase", rb_str_downcase, -1); rb_define_method(rb_cString, "capitalize", rb_str_capitalize, -1); rb_define_method(rb_cString, "swapcase", rb_str_swapcase, -1); rb_define_method(rb_cString, "upcase!", rb_str_upcase_bang, -1); rb_define_method(rb_cString, "downcase!", rb_str_downcase_bang, -1); rb_define_method(rb_cString, "capitalize!", rb_str_capitalize_bang, -1); rb_define_method(rb_cString, "swapcase!", rb_str_swapcase_bang, -1); rb_define_method(rb_cSymbol, "upcase", sym_upcase, -1); rb_define_method(rb_cSymbol, "downcase", sym_downcase, -1); rb_define_method(rb_cSymbol, "capitalize", sym_capitalize, -1); rb_define_method(rb_cSymbol, "swapcase", sym_swapcase, -1);
string.c
: sym_upcase
static VALUE sym_upcase(int argc, VALUE *argv, VALUE sym) { return rb_str_intern(
rb_str_upcase(argc, argv, rb_sym2str(sym))); }
Equivalent in Ruby:
class Symbol def upcase (*args) to_s.upcase(*args).to_sym end end
string.c
: sym_upcase
static VALUE sym_upcase(int argc, VALUE *argv, VALUE sym) { return rb_str_intern(
rb_str_upcase(argc, argv, rb_sym2str(sym))); }
Equivalent in Ruby:
class Symbol def upcase (*args) to_s.upcase(*args).to_sym end end
string.c
: rb_string_upcase
static VALUE rb_str_upcase(int argc, VALUE *argv, VALUE str) { str = rb_str_dup(str); rb_str_upcase_bang(argc, argv, str); return str; }
Equivalent in Ruby:
class String def upcase (*args) dup.upcase!(*args) end end
string.c
: rb_string_upcase
static VALUE rb_str_upcase(int argc, VALUE *argv, VALUE str) { str = rb_str_dup(str); rb_str_upcase_bang(argc, argv, str); return str; }
Equivalent in Ruby:
class String def upcase (*args) dup.upcase!(*args) end end
string.c
: rb_string_upcase_bang
Here the real work starts
static VALUE rb_str_upcase_bang(int argc, VALUE *argv, VALUE str) { ... /* set flags for upcase */ OnigCaseFoldType flags = ONIGENC_CASE_UPCASE; /* check options */ flags = check_case_options(argc, argv, flags); ... /* shortcuts for ASCII-only */ if (...) { ... } else if (flags&ONIGENC_CASE_ASCII_ONLY) rb_str_ascii_casemap(str, &flags, enc); else /* actual hard work */ str_shared_replace(str, rb_str_casemap(str,&flags,enc)); if (ONIGENC_CASE_MODIFIED&flags) return str; return Qnil; }
string.c
: rb_string_upcase_bang
Here the real work starts
static VALUE rb_str_upcase_bang(int argc, VALUE *argv, VALUE str) { ... /* set flags for upcase */ OnigCaseFoldType flags = ONIGENC_CASE_UPCASE; /* check options */ flags = check_case_options(argc, argv, flags); ... /* shortcuts for ASCII-only */ if (...) { ... } else if (flags&ONIGENC_CASE_ASCII_ONLY) rb_str_ascii_casemap(str, &flags, enc); else /* actual hard work */ str_shared_replace(str, rb_str_casemap(str,&flags,enc)); if (ONIGENC_CASE_MODIFIED&flags) return str; return Qnil; }
include/ruby/oniguruma.h
: OnigCaseFoldType
Flags used to indicate operation needed
(upcase
/downcase
/capitalize
/swapcase
):
#define ONIGENC_CASE_UPCASE (1<<13) /* uppercase mapping */ #define ONIGENC_CASE_DOWNCASE (1<<14) /* lowercase mapping */ #define ONIGENC_CASE_TITLECASE (1<<15) /* titlecase mapping */
Usage to indicate operation type:
upcase
: ONIGENC_CASE_UPCASE
(upcasing needed)
downcase
: ONIGENC_CASE_DOWNCASE
(downcasing needed)
capitalize
: ONIGENC_CASE_TITLECASE |
ONIGENC_CASE_UPCASE
(changed to ONIGENC_CASE_DOWNCASE
after first character)
swapcase
: ONIGENC_CASE_UPCASE |
ONIGENC_CASE_DOWNCASE
(both upcasing and downcasing needed)
string.c
: rb_string_upcase_bang
Here the real work starts
static VALUE rb_str_upcase_bang(int argc, VALUE *argv, VALUE str) { ... /* set flags for upcase */ OnigCaseFoldType flags = ONIGENC_CASE_UPCASE; /* check options */ flags = check_case_options(argc, argv, flags); ... /* shortcuts for ASCII-only */ if (...) { ... } else if (flags&ONIGENC_CASE_ASCII_ONLY) rb_str_ascii_casemap(str, &flags, enc); else /* actual hard work */ str_shared_replace(str, rb_str_casemap(str,&flags,enc)); if (ONIGENC_CASE_MODIFIED&flags) return str; return Qnil; }
string.c
: check_case_options
Common to
upcase
/downcase
/capitalize
/swapcase
Checks case options, sets flags, or produces error messages
Possible options:
:fold
(for case folding; only on downcase
):turkic
:lithuanian
(not yet implemented; usable with :turkic
):ascii
Corresponding flags:
#define ONIGENC_CASE_FOLD (1<<19) /* has/needs case folding * / #define ONIGENC_CASE_FOLD_TURKISH_AZERI (1<<20) /* Turkic */ #define ONIGENC_CASE_FOLD_LITHUANIAN (1<<21) /* Lithuanian */ #define ONIGENC_CASE_ASCII_ONLY (1<<22) /* limited to ASCII */
string.c
: rb_string_upcase_bang
Here the real work starts
static VALUE rb_str_upcase_bang(int argc, VALUE *argv, VALUE str) { ... /* set flags for upcase */ OnigCaseFoldType flags = ONIGENC_CASE_UPCASE; /* check options */ flags = check_case_options(argc, argv, flags); ... /* shortcuts for ASCII-only */ if (...) { ... } else if (flags&ONIGENC_CASE_ASCII_ONLY) rb_str_ascii_casemap(str, &flags, enc); else /* actual hard work */ str_shared_replace(str, rb_str_casemap(str,&flags,enc)); if (ONIGENC_CASE_MODIFIED&flags) return str; return Qnil; }
string.c
: rb_str_casemap
Handles string expansion (e.g. "ffi".upcase
⇒
"FFI"
)
Common to all casing operations
string.c
: rb_str_casemap
static VALUE rb_str_casemap(VALUE source, OnigCaseFoldType *flags, rb_encoding *enc) { /* general preparations */ while (source_current < source_end) { /* calculate next buffer length */ size_t capa = (source_end - source_current) * ++buffer_count + 20; ... /* prepare and link next buffer */ buffer_length_or_invalid = enc->case_map(flags, &source_current, source_end, current_buffer->space, current_buffer->space+current_buffer->capa, enc); ... /* check for errors (invalid input string) */ } /* prepare for copy to final location */ while (...) /* copy current_buffer and move to next buffer */ /* cleanup and return */ }
string.c
: rb_str_casemap
static VALUE rb_str_casemap(VALUE source, OnigCaseFoldType *flags, rb_encoding *enc) { /* general preparations */ while (source_current < source_end) { /* calculate next buffer length */ size_t capa = (source_end - source_current) * ++buffer_count + 20; ... /* prepare and link next buffer */ buffer_length_or_invalid = enc->case_map(flags, &source_current, source_end, current_buffer->space, current_buffer->space+current_buffer->capa, enc); ... /* check for errors (invalid input string) */ } /* prepare for copy to final location */ while (...) /* copy current_buffer and move to next buffer */ /* cleanup and return */ }
enc->case_map
and Encoding Primitivesenc
is the encoding of our string, a struct
OnigEncodingTypeST
case_map
is a function pointer, an encoding
primitive [1]"Résumé"
UTF-8.upcase
callsonigenc_unicode_case_map
in enc/unicode.c
OnigEncodingDefine
in
enc/utf_8.c
"Résumé"
UTF-16LE.upcase
callsonigenc_unicode_case_map
in enc/unicode.c
OnigEncodingDefine
in
enc/utf_16le.c
"Résumé"
ISO-8859-1.upcase
callscase_map
in enc/iso_8859_1.c
OnigEncodingDefine
in the same file[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)
//i
mbc_case_fold
apply_all_case_fold
get_case_fold_codes_by_str
⇒ New primitive
case_map
PrimitiveOnigCaseFoldType flags
case_map
Primitive
(enc/iso_8859_1.c
)static int case_map (...) { /* initializations */ while (*pp<end && to<to_end) { code = *(*pp)++; if (code==SHARP_s) /* German ß special case */ else if (/* have upper case && want lower case */) code += 0x20, flags |= ONIGENC_CASE_MODIFIED; else if (/* lower without upper: ª, º, µ, ÿ */) ; else if ((EncISO_8859_1_CtypeTable[code]&BIT_CTYPE_LOWER) && (flags&ONIGENC_CASE_UPCASE)) code -= 0x20, flags |= ONIGENC_CASE_MODIFIED; *to++ = code; if (flags&ONIGENC_CASE_TITLECASE)
/* titlecase → lowercase for capitalize */ } /* cleanup and return */ }
Good example when creating a new primitive!
case_map
Primitives and Who Wrote ThemStudents (sophomores/juniors/seniors 二・三・四年生) at Aoyama Gakuin University
For East Asian encodings
(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)
data could be shared between //i
and case mapping
but case folding for //i
only works for ASCII
None of the main Japanese committers thought this was needed
anymore
(日本人からもう不要と言われた)
Talk to me if you need it
onigenc_unicode_case_map
while
loop, one source character a timeONIGENC_CASE_MODIFIED
flagif
/else if
/else
|
/&
with flag bitsgoto
sgperf
-created hash lookups:onigenc_unicode_fold_lookup
onigenc_unicode_unfold1_lookup
downcase
upcase
in enc/unicode/9.0.0/casefold.h
/* before */ {0x0041, {1, {0x0061}}}, /* A → a */ {0x00df, {2, {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1, {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1, {0x01c6}}}, /* Dž → dž */ {0xab73, {1, {0x13a3}}}, /* Ꭳ → ꭳ (Cherokee) */ /* after */ {0x0041, {1|F|D, {0x0061}}}, /* A → a */ {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1|F|D|ST|I(8), {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}}, /* Dž → dž */ {0xab73, {1|F|U, {0x13a3}}}, /* Ꭳ → ꭳ (Cherokee) */
(squeezed into an int
where only 2 bits were used)
see enc/unicode.c
/* data is available here */ /* (flags are the same as for options) */ #define U ONIGENC_CASE_UPCASE #define D ONIGENC_CASE_DOWNCASE #define F ONIGENC_CASE_FOLD /* data is in special additional array */ #define ST ONIGENC_CASE_TITLECASE #define SU ONIGENC_CASE_UP_SPECIAL #define SL ONIGENC_CASE_DOWN_SPECIAL #define IT ONIGENC_CASE_IS_TITLECASE /* index into special array (size: around 420 words only) */ #define I(n) OnigSpecialIndexEncode(n)
(or my attempt at using the Takahashi method)
upcase
downcase
capitalize
swapcase
swapcase
?Well, I did,
when testing swapcase
!
swapcase
?swapcase
?Python has it ?! (Matz)
swapcase
?Python has it ?! (Matz)
To revert accidental Caps Lock output ?! (on Unicode list)
swapcase
"DžunGLA".swapcase
"DžUNgla"
preferred by
Unicode Consortium
(never ever need any new standardization)
preserves
reversibility
(X.swapcase.swapcase == X
)
"DžunGLA".swapcase
"DŽUNgla"
"DžunGLA".swapcase
"džUNgla"
"DžunGLA".swapcase
"dŽUNgla"
proposed by
Nobu (中田さんの提案)
"dŽUNgla"
useless?, but
'correct'
additional effort for implementation
additional effort for testing
(エイプリルフールの日)
Japan Time 20:58:33 ⇒ same date in most timezones
please draw your own conclusions
Files:
test/ruby/enc/test_case_options.rb
test/ruby/enc/test_case_mapping.rb
Files:
test/ruby/enc/test_case_comprehensive.rb
413 tests, 2212391 assertions, 0 failures, 0 errors, 0 skips
:lithuanian
(because last to be actually
implemented)Regexp
'Юに코δ' =~ /\p{Hiragana}/
'Ю'
?What I want:
loc = Locale.new 'de-CH'
(German as used in Switzerland)
1.2345678E5.to_s
⇒ "123456.78"
1.2345678E5.to_s(loc)
⇒
"123'456,78"
Internationalization support in libraries:
UnicodeUtils.nfkc string
ActiveSupport::Multibyte::Chars.new(string).normalize :kc
TwitterCldr::Normalization::NFKC.normalize string
string.unicode_normalize :nfkc
Libraries avoid monkey patching
⇒ not Ruby-like (ライブラリを使うと Ruby らしくない)
Possible solution (解決案):
loc = Locale.new 'tr'
⇒
'Türkiye'.upcase loc
'TÜRKİYE'
//i
case folding: all encodings not at end oftest/ruby/enc/test_regex_casefold.rb
)Regexp
dataSend questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open a bug report or feature
request
The latest version of this presentation is available at: