Internationalization of the Ruby Scripting
Language
IUC 31, San Jose, 2007
Martin J. DÜRST
duerst@it.aoyama.ac.jp
Aoyama Gakuin University
© 2007 Martin
J. Dürst, Aoyama Gakuin
University
Outline
- What is Ruby?
- Current Ruby Internationalization (Ruby 1.8)
- The Ruby Internationalization Masterplan
CSI vs. UCS
- Internationalization in Ruby 1.9
- Outlook
- Q & A
Assumptions
Talk assumes that you know
- Unicode basics
- Programming basics
Slides are avaliable online
What is Ruby?
- Tiny annotations for Japanese (and Chinese)
pronunciation
- A programming language
- Interpreted (like Perl or Python)
- Fully object-oriented (more than Java or C++)
- Simple and short syntax
- Intuitive semantics (after a short while)
- A lot of fun!
Ruby History
- Developed by Yukihiro MATSUMOTO ("Matz") starting in 1993
- "Better" Perl, fully object-oriented, includes elements of Smalltalk,
Lisp,...
- ca. 2000: 'discovered' by Dave Thomas (Pragmatic Programmers)
- ca. 2004: David Heinemeier Hansson publishes Ruby on Rails
high-productivity Web application framework
- Productive version is 1.8(.6), 1.9 in the works with faster engine and
maybe better internationalization
Ruby Highlights
- Open source
- Comes with all the trimmings: libraries, debugger, profiler,
interactive shell, documentation tools, testing framework
- Very flexible type system (open classes, duck typing)
- Extensibility (C), metaprogramming
- Very easy to create internal DSLs:
- Looking like independent, simple, domain specific language
- Part of Ruby: full power available where necessary
Internationalization in Ruby 1.8
Bad news:
- Strings are just sequences of bytes
- String methods work on bytes
Good news:
- Basic UTF-8 support available
- Unicode means UTF-8
- Regular expressions can work with UTF-8 characters
Character Encodings Modes
- Set on regular expressions with postfix
(e.g. /./u
)
- Set for overall program using
$KCODE
(attention: sticky
across files)
- Modes available:
- Generic single-byte: raw bytes, US-ASCII, other single-byte
encodings
- EUC: EUC-JP, EUC-KR,...
- Shift_JIS
- UTF-8
Also available: [0x9752, 0x5C71].pack('U*')
produces the string '青山'
(UTF-8 only)
Standard Libraries
(need to be require
d)
jcode
: characters, not bytes
- adds some methods to
String
(length
→
jlength
)
- changes some others (e.g.
chop!
)
- based on regular expresssions, slow
iconv
: Transcoding, interface to local/native library
kconv
/nkf
: Transcoding for Japanese
encodings, preferred by Japanese programmers
Other Libraries
Various things available, but scattered
Unicode in Ruby on Rails
Own Work
- sconv: Higher-level
wrapper for code conversion libraries, alpha quality
- charesc: Character
escapes in strings
- langtag
Charesc
- Character escapes, not byte escapes
- Uses metaprogramming to 'cheat' on syntax
- Future: Ruby 1.9 native syntax
require 'charesc'
print "#{U677Eu672Cu884Cu5F18}"
# => 松本行弘
print U9752u5C71u5B66u9662u5927u5B66
# => 青山学院大学
print "Martin J. D#{U00FC}rst"
# => Martin J. Dürst
Langtag
- Support for IETF BCP47 language tags
- Well-formedness checking
- Future: Matching, validity checking
require 'langtag'
tag = Langtag.new('de-Latn-ch')
tag.script # => 'Latn'
tag.wellformed? # => true
tag.region = 'at'
print tag # => de-Latn-at
Multilingualization
M17N, term used often in Japan
Yukihiro Matsumoto and Masahiko Nawate, Multilingual Text Manipulation
Method for Ruby Language, IPSJ Journal, Vol. 46, No. 11, Nov. 2005 (in
Japanese).
Future internationalization architecture for Ruby:
- Code Set Independent (as opposed to UCS)
- Very small set of data and functions per encoding
- Each Ruby string has an encoding
- Encoding conflicts raise exceptions
UCS: Universal Code Set
- Could be something else in theory, but in practice means Unicode
- Convert to Unicode early
- Convert back to legacy encoding late
- Inside is Unicode only
- Used by Java, C#, Perl, Python, the WWW,...
CSI: Code Set Independent
- Not dependent on any specific encoding
- Old style: configuring for a single encoding, often at compile time
- In proposal:
- Multiple encodings in the same application
- Encodings abstracted by (too?) small set of functions
In Defense of the CSI Approach
- Small scripts often use just one legacy encoding
- Transcoding is slow
- (Transcoding needs huge tables)
- Transcoding is messy
- Example: Shift_JIS
- Backslash or Yen sign?
- What vendor additions?
- What company or personal gaiji?
- CSI doesn't fix the problem, but
- CSI allows to stay vague
Main Problems of CSI
- Minimal set of primitive functions (~15) to abstract encodings
→ potential efficiency problems
Potential solution: core primitives vs. extended primitives
- Raising exceptions for encoding conflicts
→ difficult to address during runtime
Will help programmers to clean up their job
- Lowest common denominator approach to string functionality
From Ruby 1.8.x to Ruby 1.9
- Version progression: 1.4 → 1.6 → 1.8 → 1.9.0? → 1.9.2? →??
2.0
- Faster implementation (YARV, Koichi Sasada)
- New regular expression engine
(Oniguruma, 鬼車)
- Basics of new internationalization framework
(ongoing)
- Other improvements/changes/additions
- Timeline: Christmas 2007
Already Working in Ruby 1.9
- Encoding indication, per file:
# -*- coding: utf-8 -*-
Affects identifiers, string constants, and regular expressions
- Regular expressions:
/\p{Sm}\p{Hiragana}/
(full support only for UTF-8)
string.force_encoding(encoding)
Change encoding without changing byte representation
Soon to Come: Character Escapes
\u89AB
for 4-digit hex escapes
\u{}
for variable-length hex escapes
- Unclear whether it will work for encodings other than UTF-8
print "\u677E\u672C\u884C\u5F18"
# 松本行弘
print "Martin J. D\u00FCrst"
# Martin J. Dürst
print "Martin J. D\u{FC}rst"
# Martin J. Dürst
Current Discussions
- Exact structure of US-ASCII-only/US-ASCII+some-8bit/binary data
- Encoding of strings in many different circumstances
Where We Need to Go
- All coded character sets are equal, but Unicode is more equal
- Numeric character escapes (
\u89AB
)
- Functionality not applicable to other encodings, e.g.
Normalization
- Functionality not available for other encodings, e.g. Collation
- All encodings are equal, but UTF-8 is more equal
- Optimize for speed if necessary
- Convert (mainly) via UTF-8
- Allow to use UCS model on top of CSI architecture
Transcoding Policies
- Goals:
- More failsafe in case of encoding conflicts
- Emulating UCS model on CSI architecture
- Where to apply
- What to do
Transcoding Policies: Where
- Read/write from/to file system, console
- File/directory names in file system
- Low level (C) extensions
- Different Ruby implementations (e.g. UTF-16 for JRuby)
- Error reporting (need to be more permissive)
Transcoding Policies: What
- Throw exception
- Transcode if target character set is wider
- Transcode to common superset, or always UTF-8
- Transcode if possible without loss for actual data
- Use fallbacks (Ü→UE, ü)
- Use replacement characters
- Drop if not convertible
Acknowledgements
- Yukihiro Matsumoto (Matz) and many others for Ruby
- Matz for some great discussions
- Takuya Shimada
- The members of my lab
Conclusions
- Ruby is exciting, even without I18N
- Basic support for UTF-8 is available in Ruby 1.8
- Third-party libraries provide more support
- New architecture for Ruby 1.9
- Lots of work to do
Further Information
An up-to-date
version of this paper as well as slides
used for the talk connected to this paper are available at http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-ruby.
Code for Demos
System.out.println('Hello World!');
print 'Hello World!'
5.times { print 'Hello World!' }
So it is possible to write "Hello".upcase
(returning the
string HELLO
).
who = 'Unicode Conference'
print "Hello #{who}!"
[1, 2, 3.14159, "four", "five and a quarter"]
[0x9752, 0x5C71].pack('U*')
aoyama = '青山学院大学'
require 'jcode'
aoyama.length
aoyama.jlength
"abc".length
Blocks:
5.times { print "Hello World!" }
File.open(fn) do |file| ... end