Internationalization of the Ruby Scripting Language

IUC 31, San Jose, 2007

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

Outline

What is Ruby?
Current Ruby Internationalization (Ruby 1.8)
The Ruby Internationalization Masterplan
CSI vs. UCS
Internationalization in Ruby 1.9
Outlook
Q & A

Assumptions

Talk assumes that you know

Unicode basics
Programming basics

Slides are avaliable online

What is Ruby?

Tiny annotations for Japanese (and Chinese) pronunciation
A programming language
Interpreted (like Perl or Python)
Fully object-oriented (more than Java or C++)
Simple and short syntax
Intuitive semantics (after a short while)
A lot of fun!

Ruby History

Developed by Yukihiro MATSUMOTO ("Matz") starting in 1993
"Better" Perl, fully object-oriented, includes elements of Smalltalk, Lisp,...
ca. 2000: 'discovered' by Dave Thomas (Pragmatic Programmers)
ca. 2004: David Heinemeier Hansson publishes Ruby on Rails high-productivity Web application framework
Productive version is 1.8(.6), 1.9 in the works with faster engine and maybe better internationalization

Ruby Highlights

Open source
Comes with all the trimmings: libraries, debugger, profiler, interactive shell, documentation tools, testing framework
Very flexible type system (open classes, duck typing)
Extensibility (C), metaprogramming
Very easy to create internal DSLs:
- Looking like independent, simple, domain specific language
- Part of Ruby: full power available where necessary

Internationalization in Ruby 1.8

Bad news:

Strings are just sequences of bytes
String methods work on bytes

Good news:

Basic UTF-8 support available
Unicode means UTF-8
Regular expressions can work with UTF-8 characters

Character Encodings Modes

Set on regular expressions with postfix
(e.g. /./u)
Set for overall program using $KCODE (attention: sticky across files)
Modes available:
1. Generic single-byte: raw bytes, US-ASCII, other single-byte encodings
2. EUC: EUC-JP, EUC-KR,...
3. Shift_JIS
4. UTF-8

Also available: [0x9752, 0x5C71].pack('U*')
produces the string '青山' (UTF-8 only)

Standard Libraries

(need to be required)

jcode: characters, not bytes
- adds some methods to String (length → jlength)
- changes some others (e.g. chop!)
- based on regular expresssions, slow
iconv: Transcoding, interface to local/native library
kconv/nkf: Transcoding for Japanese encodings, preferred by Japanese programmers

Other Libraries

Various things available, but scattered

Normalization: http://www.yoshidam.net/unicode.txt (outdated)
Redefining operators for Unicode operations: ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2
ICU4R http://icu4r.rubyforge.org/: parallel classes (String and UString)
ActiveSupport::Multibyte (part of Ruby on Rails 1.2): http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html
Examples: '青山'.length → 6
'青山'.chars.length → 2

Unicode in Ruby on Rails

Ruby on Rails uses UTF-8 by default
ActiveSupport::Multibyte available by default in Rails 1.2
More information and pointers at
http://wiki.rubyonrails.org/rails/pages/HowToUseUnicodeStrings,
http://ruby.org.ee/wiki/Unicode_in_Ruby/Rails
Various frameworks for localization

Own Work

sconv: Higher-level wrapper for code conversion libraries, alpha quality
charesc: Character escapes in strings
langtag

Charesc

Character escapes, not byte escapes
Uses metaprogramming to 'cheat' on syntax
Future: Ruby 1.9 native syntax

require 'charesc'
print "#{U677Eu672Cu884Cu5F18}"
      # => 松本行弘
print U9752u5C71u5B66u9662u5927u5B66
      # => 青山学院大学
print "Martin J. D#{U00FC}rst"
      # => Martin J. Dürst

Langtag

Support for IETF BCP47 language tags
Well-formedness checking
Future: Matching, validity checking

require 'langtag'
tag = Langtag.new('de-Latn-ch')
tag.script             # => 'Latn'
tag.wellformed?        # => true
tag.region = 'at'
print tag              # => de-Latn-at

Multilingualization

M17N, term used often in Japan

Yukihiro Matsumoto and Masahiko Nawate, Multilingual Text Manipulation Method for Ruby Language, IPSJ Journal, Vol. 46, No. 11, Nov. 2005 (in Japanese).

Future internationalization architecture for Ruby:

Code Set Independent (as opposed to UCS)
Very small set of data and functions per encoding
Each Ruby string has an encoding
Encoding conflicts raise exceptions

UCS: Universal Code Set

Could be something else in theory, but in practice means Unicode
Convert to Unicode early
Convert back to legacy encoding late
Inside is Unicode only
Used by Java, C#, Perl, Python, the WWW,...

CSI: Code Set Independent

Not dependent on any specific encoding
Old style: configuring for a single encoding, often at compile time
In proposal:
- Multiple encodings in the same application
- Encodings abstracted by (too?) small set of functions

In Defense of the CSI Approach

Small scripts often use just one legacy encoding
Transcoding is slow
(Transcoding needs huge tables)
Transcoding is messy
Example: Shift_JIS
- Backslash or Yen sign?
- What vendor additions?
- What company or personal gaiji?
CSI doesn't fix the problem, but
CSI allows to stay vague

Main Problems of CSI

Minimal set of primitive functions (~15) to abstract encodings
→ potential efficiency problems
Potential solution: core primitives vs. extended primitives
Raising exceptions for encoding conflicts
→ difficult to address during runtime
Will help programmers to clean up their job
Lowest common denominator approach to string functionality

From Ruby 1.8.x to Ruby 1.9

Version progression: 1.4 → 1.6 → 1.8 → 1.9.0? → 1.9.2? →?? 2.0
Faster implementation (YARV, Koichi Sasada)
New regular expression engine
(Oniguruma, 鬼車)
Basics of new internationalization framework (ongoing)
Other improvements/changes/additions
Timeline: Christmas 2007

Already Working in Ruby 1.9

Encoding indication, per file:
# -*- coding: utf-8 -*-
Affects identifiers, string constants, and regular expressions
Regular expressions:
/\p{Sm}\p{Hiragana}/
(full support only for UTF-8)
string.force_encoding(encoding)
Change encoding without changing byte representation

Soon to Come: Character Escapes

\u89AB for 4-digit hex escapes
\u{} for variable-length hex escapes
Unclear whether it will work for encodings other than UTF-8

print "\u677E\u672C\u884C\u5F18"
      # 松本行弘
print "Martin J. D\u00FCrst"
      # Martin J. Dürst
print "Martin J. D\u{FC}rst"
      # Martin J. Dürst

Current Discussions

Exact structure of US-ASCII-only/US-ASCII+some-8bit/binary data
Encoding of strings in many different circumstances

Where We Need to Go

All coded character sets are equal, but Unicode is more equal
- Numeric character escapes (\u89AB)
- Functionality not applicable to other encodings, e.g. Normalization
- Functionality not available for other encodings, e.g. Collation
All encodings are equal, but UTF-8 is more equal
- Optimize for speed if necessary
- Convert (mainly) via UTF-8
Allow to use UCS model on top of CSI architecture

Transcoding Policies

Goals:
- More failsafe in case of encoding conflicts
- Emulating UCS model on CSI architecture
Where to apply
What to do

Transcoding Policies: Where

Read/write from/to file system, console
File/directory names in file system
Low level (C) extensions
Different Ruby implementations (e.g. UTF-16 for JRuby)
Error reporting (need to be more permissive)

Transcoding Policies: What

Throw exception
Transcode if target character set is wider
Transcode to common superset, or always UTF-8
Transcode if possible without loss for actual data
Use fallbacks (Ü→UE, ü)
Use replacement characters
Drop if not convertible

Acknowledgements

Yukihiro Matsumoto (Matz) and many others for Ruby
Matz for some great discussions
Takuya Shimada
The members of my lab

Conclusions

Ruby is exciting, even without I18N
Basic support for UTF-8 is available in Ruby 1.8
Third-party libraries provide more support
New architecture for Ruby 1.9
Lots of work to do

Further Information

An up-to-date version of this paper as well as slides used for the talk connected to this paper are available at http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-ruby.

Code for Demos

System.out.println('Hello World!');

print 'Hello World!'

5.times { print 'Hello World!' }

So it is possible to write "Hello".upcase (returning the string HELLO).

who = 'Unicode Conference'
print "Hello #{who}!"

[1, 2, 3.14159, "four", "five and a quarter"]
[0x9752, 0x5C71].pack('U*')
aoyama = '青山学院大学'

require 'jcode'
aoyama.length
aoyama.jlength

"abc".length

Blocks:5.times { print "Hello World!" } File.open(fn) do |file| ... end