http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/
(マーティンと呼んでください)
© 2016 Martin J. Dürst, Aoyama Gakuin University
Currently many of Ruby's String methods, such as upcase and downcase, are limited to ASCII and ignore the rest of the world. This is finally going to change in Ruby 2.4, where this functionality will be extended to cover full Unicode. You will get to know what will change, how your programs may be affected, and how these changes are implemented behind the scenes. We will also look at the overall state of internationalization functionality in Ruby, and potential future directions.
These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux; use F11 to switch to projection mode). Texts in gray, like this one, are comments/notes which do not appear on the slides. Please note that depending on the browser and OS you use, some rare characters or special character combinations may not display as intended, but e.g. as empty boxes, question marks, or apart rather than composed.
Code is mostly green, monospace
puts "Hello Ruby!"
Variable parts are orange
puts "some string"
Encoding is indicated with a subscript
"Юに코δ"UTF-8,
"ユニコード"SJIS
Results are indicated with "⇒"
1 + 1 ⇒ 2
Mainly in the following areas:
String#encode, Ruby 1.9)String#unicode-normalize, Ruby
2.2)String#upcase,..., Ruby 2.4)| Year (y) | Ruby version (VRuby) | Unicode version (VUnicode) |
| published around Christmas | published in Summer | |
| 2014 | 2.2 | 7.0.0 |
| 2015 | 2.3 | 8.0.0 |
| 2016 | 2.4 | 9.0.0 (as of yesterday) |
A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view too conservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore only happens for new Ruby versions.
RbConfig::CONFIG["UNICODE_VERSION"] ⇒ '9.0.0'
VUnicode = y - 2007
VRuby = 1.5 + VUnicode · 0.1
VUnicode = VRuby · 10 - 15
Don't extrapolate too far!
'Ruby Kaigi 2016'.upcase ⇒ 'RUBY KAIGI
2016''Ruby Kaigi 2016'.downcase ⇒ 'ruby kaigi
2016''Ruby Kaigi 2016'.capitalize ⇒ 'Ruby Kaigi
2016''Ruby Kaigi 2016'.swapcase ⇒ 'rUBY kAIGI
2016''Ruby Kaigi 2016'.upcase ⇒ 'RUBY KAIGI
2016''Ruby Kaigi 2016'.downcase ⇒ 'ruby kaigi
2016''Ruby Kaigi 2016'.capitalize ⇒ 'Ruby Kaigi
2016''Ruby Kaigi 2016'.swapcase ⇒ 'rUBY kAIGI
2016''Résumé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ''Юрий Соколов'.upcase (aka funny.falcon)'Юрий Соколов''Résumé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ''Юрий Соколов'.upcase'ЮРИЙ СОКОЛОВ''Résumé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ''Юрий Соколов'.upcase'ЮРИЙ СОКОЛОВ'(details vary by language)
German:
der Gefangene floh 捕虜が逃げた the prisoner fled,
but
der gefangene Floh 捕まった蚤 the captive flea
Regexp: //i)'Résumé'.upcase ⇒ 'RéSUMé''Résumé'.upcase :unicode ⇒ 'RÉSUMÉ'grep your code base for upcase and friendsE.g. DNS servers
(but you used Encoding::ASCII_8BIT there anyway, didn't
you)
downcased with Ruby 2.3 to help users (Соколов in
DB):ascii OptionUse if you find a case where you really don't want to convert non-ASCII
characters
(どうしても ASCII のみに限定したいとき)
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
⇒ 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'
'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase
:ascii
⇒ 'RéSUMé
ĭñŧėřŋãţijňőńæłĩżàťïōņ'
'Юрий Соколов'.upcase
⇒ 'ЮРИЙ СОКОЛОВ'
'Юрий Соколов'.upcase :ascii
⇒ 'Юрий Соколов'
Data and other specifications available from the Unicode Consortium:
UnicodeData.txt
CaseFolding.txt
SpecialCasing.txt
'ß'.upcase ⇒ 'SS'
(German sz/sharp s)'ffi'.upcase ⇒ "FFI" (ffi
ligature)'ß'.upcase.downcase ⇒ 'ß' 'ss'
'σ'.upcase ⇒
'Σ' (Greek sigma)'ς'.upcase ⇒
'Σ' (Greek final sigma)'ς'.upcase.downcase ⇒ 'ς' 'σ''Σ'.downcase should be context-dependent⇒Not implemented!
'i'.upcase ⇒ 'I''I'.upcase ⇒ 'i''i'.upcase ⇒ 'İ'
(uppercase I with dot)'İ'.downcase ⇒ 'i''ı'.upcase ⇒ 'I' (i
without dot)'I'.downcase ⇒ 'ı''Türkiye'.upcase :turkic ⇒
'TÜRKİYE''Í'.downcase ⇒'í' (accent replaces dot)
'Í'.downcase :lithuanian ⇒'í'
(accent above visible dot)
upcase/downcase/capitalize/swapcasedowncasedowncase
'ß'.downcase :fold ⇒ 'ss'
'ffi'.downcase :fold ⇒ 'ffi'
'ς'.downcase :fold ⇒
'σ'
capitalize
"džungla".capitalize ⇒ "DŽungla""džungla".capitalize ⇒
"Džungla"Not implemented (yet?)
tr('ABC...', 'abc...')String (functional) |
String (destructive) |
Symbol |
|---|---|---|
upcase |
upcase! |
upcase |
downcase |
downcase! |
downcase |
capitalize |
capitalize! |
capitalize |
swapcase |
swapcase! |
swapcase |
Not dealt with: String#casecmpWhy: Includes sorting, very difficult for all of Unicode
string.c: Init_StringTells Ruby what C functions to use for each method
rb_define_method(rb_cString, "upcase", rb_str_upcase, -1); rb_define_method(rb_cString, "downcase", rb_str_downcase, -1); rb_define_method(rb_cString, "capitalize", rb_str_capitalize, -1); rb_define_method(rb_cString, "swapcase", rb_str_swapcase, -1); rb_define_method(rb_cString, "upcase!", rb_str_upcase_bang, -1); rb_define_method(rb_cString, "downcase!", rb_str_downcase_bang, -1); rb_define_method(rb_cString, "capitalize!", rb_str_capitalize_bang, -1); rb_define_method(rb_cString, "swapcase!", rb_str_swapcase_bang, -1); rb_define_method(rb_cSymbol, "upcase", sym_upcase, -1); rb_define_method(rb_cSymbol, "downcase", sym_downcase, -1); rb_define_method(rb_cSymbol, "capitalize", sym_capitalize, -1); rb_define_method(rb_cSymbol, "swapcase", sym_swapcase, -1);
string.c: Init_StringTells Ruby what C functions to use for each method
rb_define_method(rb_cString, "upcase", rb_str_upcase, -1); rb_define_method(rb_cString, "downcase", rb_str_downcase, -1); rb_define_method(rb_cString, "capitalize", rb_str_capitalize, -1); rb_define_method(rb_cString, "swapcase", rb_str_swapcase, -1); rb_define_method(rb_cString, "upcase!", rb_str_upcase_bang, -1); rb_define_method(rb_cString, "downcase!", rb_str_downcase_bang, -1); rb_define_method(rb_cString, "capitalize!", rb_str_capitalize_bang, -1); rb_define_method(rb_cString, "swapcase!", rb_str_swapcase_bang, -1); rb_define_method(rb_cSymbol, "upcase", sym_upcase, -1); rb_define_method(rb_cSymbol, "downcase", sym_downcase, -1); rb_define_method(rb_cSymbol, "capitalize", sym_capitalize, -1); rb_define_method(rb_cSymbol, "swapcase", sym_swapcase, -1);
string.c: sym_upcasestatic VALUE sym_upcase(int argc, VALUE *argv, VALUE sym) { return rb_str_intern(
rb_str_upcase(argc, argv, rb_sym2str(sym))); }
Equivalent in Ruby:
class Symbol def upcase (*args) to_s.upcase(*args).to_sym end end
string.c: sym_upcasestatic VALUE sym_upcase(int argc, VALUE *argv, VALUE sym) { return rb_str_intern(
rb_str_upcase(argc, argv, rb_sym2str(sym))); }
Equivalent in Ruby:
class Symbol def upcase (*args) to_s.upcase(*args).to_sym end end
string.c: rb_string_upcasestatic VALUE rb_str_upcase(int argc, VALUE *argv, VALUE str) { str = rb_str_dup(str); rb_str_upcase_bang(argc, argv, str); return str; }
Equivalent in Ruby:
class String def upcase (*args) dup.upcase!(*args) end end
string.c: rb_string_upcasestatic VALUE rb_str_upcase(int argc, VALUE *argv, VALUE str) { str = rb_str_dup(str); rb_str_upcase_bang(argc, argv, str); return str; }
Equivalent in Ruby:
class String def upcase (*args) dup.upcase!(*args) end end
string.c: rb_string_upcase_bangHere the real work starts
static VALUE rb_str_upcase_bang(int argc, VALUE *argv, VALUE str) { ... /* set flags for upcase */ OnigCaseFoldType flags = ONIGENC_CASE_UPCASE; /* check options */ flags = check_case_options(argc, argv, flags); ... /* shortcuts for ASCII-only */ if (...) { ... } else if (flags&ONIGENC_CASE_ASCII_ONLY) rb_str_ascii_casemap(str, &flags, enc); else /* actual hard work */ str_shared_replace(str, rb_str_casemap(str,&flags,enc)); if (ONIGENC_CASE_MODIFIED&flags) return str; return Qnil; }
string.c: rb_string_upcase_bangHere the real work starts
static VALUE rb_str_upcase_bang(int argc, VALUE *argv, VALUE str) { ... /* set flags for upcase */ OnigCaseFoldType flags = ONIGENC_CASE_UPCASE; /* check options */ flags = check_case_options(argc, argv, flags); ... /* shortcuts for ASCII-only */ if (...) { ... } else if (flags&ONIGENC_CASE_ASCII_ONLY) rb_str_ascii_casemap(str, &flags, enc); else /* actual hard work */ str_shared_replace(str, rb_str_casemap(str,&flags,enc)); if (ONIGENC_CASE_MODIFIED&flags) return str; return Qnil; }
include/ruby/oniguruma.h: OnigCaseFoldTypeFlags used to indicate operation needed
(upcase/downcase/capitalize/swapcase):
#define ONIGENC_CASE_UPCASE (1<<13) /* uppercase mapping */ #define ONIGENC_CASE_DOWNCASE (1<<14) /* lowercase mapping */ #define ONIGENC_CASE_TITLECASE (1<<15) /* titlecase mapping */
Usage to indicate operation type:
upcase: ONIGENC_CASE_UPCASE
(upcasing needed)
downcase: ONIGENC_CASE_DOWNCASE
(downcasing needed)
capitalize: ONIGENC_CASE_TITLECASE |
ONIGENC_CASE_UPCASE
(changed to ONIGENC_CASE_DOWNCASE after first character)
swapcase: ONIGENC_CASE_UPCASE |
ONIGENC_CASE_DOWNCASE
(both upcasing and downcasing needed)
string.c: rb_string_upcase_bangHere the real work starts
static VALUE rb_str_upcase_bang(int argc, VALUE *argv, VALUE str) { ... /* set flags for upcase */ OnigCaseFoldType flags = ONIGENC_CASE_UPCASE; /* check options */ flags = check_case_options(argc, argv, flags); ... /* shortcuts for ASCII-only */ if (...) { ... } else if (flags&ONIGENC_CASE_ASCII_ONLY) rb_str_ascii_casemap(str, &flags, enc); else /* actual hard work */ str_shared_replace(str, rb_str_casemap(str,&flags,enc)); if (ONIGENC_CASE_MODIFIED&flags) return str; return Qnil; }
string.c: check_case_optionsCommon to
upcase/downcase/capitalize/swapcase
Checks case options, sets flags, or produces error messages
Possible options:
:fold (for case folding; only on downcase):turkic:lithuanian (not yet implemented; usable with :turkic):asciiCorresponding flags:
#define ONIGENC_CASE_FOLD (1<<19) /* has/needs case folding * / #define ONIGENC_CASE_FOLD_TURKISH_AZERI (1<<20) /* Turkic */ #define ONIGENC_CASE_FOLD_LITHUANIAN (1<<21) /* Lithuanian */ #define ONIGENC_CASE_ASCII_ONLY (1<<22) /* limited to ASCII */
string.c: rb_string_upcase_bangHere the real work starts
static VALUE rb_str_upcase_bang(int argc, VALUE *argv, VALUE str) { ... /* set flags for upcase */ OnigCaseFoldType flags = ONIGENC_CASE_UPCASE; /* check options */ flags = check_case_options(argc, argv, flags); ... /* shortcuts for ASCII-only */ if (...) { ... } else if (flags&ONIGENC_CASE_ASCII_ONLY) rb_str_ascii_casemap(str, &flags, enc); else /* actual hard work */ str_shared_replace(str, rb_str_casemap(str,&flags,enc)); if (ONIGENC_CASE_MODIFIED&flags) return str; return Qnil; }
string.c: rb_str_casemapHandles string expansion (e.g. "ffi".upcase ⇒
"FFI")
Common to all casing operations
string.c: rb_str_casemapstatic VALUE rb_str_casemap(VALUE source, OnigCaseFoldType *flags, rb_encoding *enc) { /* general preparations */ while (source_current < source_end) { /* calculate next buffer length */ size_t capa = (source_end - source_current) * ++buffer_count + 20; ... /* prepare and link next buffer */ buffer_length_or_invalid = enc->case_map(flags, &source_current, source_end, current_buffer->space, current_buffer->space+current_buffer->capa, enc); ... /* check for errors (invalid input string) */ } /* prepare for copy to final location */ while (...) /* copy current_buffer and move to next buffer */ /* cleanup and return */ }
string.c: rb_str_casemapstatic VALUE rb_str_casemap(VALUE source, OnigCaseFoldType *flags, rb_encoding *enc) { /* general preparations */ while (source_current < source_end) { /* calculate next buffer length */ size_t capa = (source_end - source_current) * ++buffer_count + 20; ... /* prepare and link next buffer */ buffer_length_or_invalid = enc->case_map(flags, &source_current, source_end, current_buffer->space, current_buffer->space+current_buffer->capa, enc); ... /* check for errors (invalid input string) */ } /* prepare for copy to final location */ while (...) /* copy current_buffer and move to next buffer */ /* cleanup and return */ }
enc->case_map and Encoding Primitivesenc is the encoding of our string, a struct
OnigEncodingTypeSTcase_map is a function pointer, an encoding
primitive [1]"Résumé"UTF-8.upcase callsonigenc_unicode_case_map in enc/unicode.cOnigEncodingDefine in
enc/utf_8.c"Résumé"UTF-16LE.upcase callsonigenc_unicode_case_map in enc/unicode.cOnigEncodingDefine in
enc/utf_16le.c"Résumé"ISO-8859-1.upcase callscase_map in enc/iso_8859_1.cOnigEncodingDefine in the same file[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌. 2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Method for Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)
//i
mbc_case_foldapply_all_case_foldget_case_fold_codes_by_str⇒ New primitive
case_map PrimitiveOnigCaseFoldType flagscase_map Primitive
(enc/iso_8859_1.c)static int case_map (...) { /* initializations */ while (*pp<end && to<to_end) { code = *(*pp)++; if (code==SHARP_s) /* German ß special case */ else if (/* have upper case && want lower case */) code += 0x20, flags |= ONIGENC_CASE_MODIFIED; else if (/* lower without upper: ª, º, µ, ÿ */) ; else if ((EncISO_8859_1_CtypeTable[code]&BIT_CTYPE_LOWER) && (flags&ONIGENC_CASE_UPCASE)) code -= 0x20, flags |= ONIGENC_CASE_MODIFIED; *to++ = code; if (flags&ONIGENC_CASE_TITLECASE)
/* titlecase → lowercase for capitalize */ } /* cleanup and return */ }
Good example when creating a new primitive!
case_map Primitives and Who Wrote ThemStudents (sophomores/juniors/seniors 二・三・四年生) at Aoyama Gakuin University
For East Asian encodings
(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)
data could be shared between //i and case mapping
but case folding for //i only works for ASCII
None of the main Japanese committers thought this was needed
anymore
(日本人からもう不要と言われた)
Talk to me if you need it
onigenc_unicode_case_mapwhile loop, one source character a timeONIGENC_CASE_MODIFIED flagif/else if/else|/& with flag bitsgotosgperf-created hash lookups:onigenc_unicode_fold_lookup
onigenc_unicode_unfold1_lookupdowncaseupcasein enc/unicode/9.0.0/casefold.h
/* before */ {0x0041, {1, {0x0061}}}, /* A → a */ {0x00df, {2, {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1, {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1, {0x01c6}}}, /* Dž → dž */ {0xab73, {1, {0x13a3}}}, /* Ꭳ → ꭳ (Cherokee) */ /* after */ {0x0041, {1|F|D, {0x0061}}}, /* A → a */ {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1|F|D|ST|I(8), {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}}, /* Dž → dž */ {0xab73, {1|F|U, {0x13a3}}}, /* Ꭳ → ꭳ (Cherokee) */
(squeezed into an int where only 2 bits were used)
see enc/unicode.c
/* data is available here */ /* (flags are the same as for options) */ #define U ONIGENC_CASE_UPCASE #define D ONIGENC_CASE_DOWNCASE #define F ONIGENC_CASE_FOLD /* data is in special additional array */ #define ST ONIGENC_CASE_TITLECASE #define SU ONIGENC_CASE_UP_SPECIAL #define SL ONIGENC_CASE_DOWN_SPECIAL #define IT ONIGENC_CASE_IS_TITLECASE /* index into special array (size: around 420 words only) */ #define I(n) OnigSpecialIndexEncode(n)
(or my attempt at using the Takahashi method)
upcasedowncasecapitalizeswapcaseswapcase?Well, I did,
when testing swapcase!
swapcase?swapcase?Python has it ?! (Matz)
swapcase?Python has it ?! (Matz)
To revert accidental Caps Lock output ?! (on Unicode list)
swapcase"DžunGLA".swapcase"DžUNgla"preferred by
Unicode Consortium
(never ever need any new standardization)
preserves
reversibility
(X.swapcase.swapcase == X)
"DžunGLA".swapcase"DŽUNgla""DžunGLA".swapcase"džUNgla""DžunGLA".swapcase"dŽUNgla"proposed by
Nobu (中田さんの提案)
"dŽUNgla"useless?, but
'correct'
additional effort for implementation
additional effort for testing
(エイプリルフールの日)
Japan Time 20:58:33 ⇒ same date in most timezones
please draw your own conclusions
Files:
test/ruby/enc/test_case_options.rb
test/ruby/enc/test_case_mapping.rb
Files:
test/ruby/enc/test_case_comprehensive.rb
413 tests, 2212391 assertions, 0 failures, 0 errors, 0 skips
:lithuanian (because last to be actually
implemented)Regexp'Юに코δ' =~ /\p{Hiragana}/'Ю'?What I want:
loc = Locale.new 'de-CH' (German as used in Switzerland)
1.2345678E5.to_s ⇒ "123456.78"
1.2345678E5.to_s(loc) ⇒
"123'456,78"
Internationalization support in libraries:
UnicodeUtils.nfkc string
ActiveSupport::Multibyte::Chars.new(string).normalize :kc
TwitterCldr::Normalization::NFKC.normalize string
string.unicode_normalize :nfkcLibraries avoid monkey patching
⇒ not Ruby-like (ライブラリを使うと Ruby らしくない)
Possible solution (解決案):
loc = Locale.new 'tr' ⇒
'Türkiye'.upcase loc'TÜRKİYE'
//i case folding: all encodings not at end oftest/ruby/enc/test_regex_casefold.rb)Regexp
dataSend questions and comments to Martin Dürst
(mailto:duerst@it.aoyama.ac.jp)
or open a bug report or feature
request
The latest version of this presentation is available at: