Internationalization in Ruby 1.9

IUC 33, San Jose, CA, U.S.A., October 2009

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

Outline

What is Ruby?
Internationalization in Ruby 1.8
Internationalization in Ruby 1.9
- Architecture
- Functionality
- Behind the Scenes
Outlook
Q & A

Abstract

Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. Internationalization of Ruby made a big leap forwards when this January, Ruby 1.9.1, the first stable release of the Ruby 1.9 series, was released. While previous versions of Ruby mostly treated text data as byte sequences, strings in Ruby 1.9 are sequences of characters. Because Ruby tags each string with encoding information internally, different applications can choose different internationalization models.

The presentation will give a short overview of Ruby as a programming language, and introduce the new internationalization features in detail. We will be concentrating on how to use Ruby with Unicode, which in Ruby's case means UTF-8. We will also discuss internationalization support in Ruby on Rails, the pouplar Web application framework written in Ruby.

Assumptions

Talk assumes that you know

Unicode basics
Programming basics

Slides avaliable at http://www.sw.it.aoyama.ac.jp/2009/pub/IUC33-ruby1.9/

Some parts of this talk are based on:

Paper and talk at IUC 31: Internationalization of the Ruby Scripting Language, 2007
Talk together with Yui Naruse at RubyKaigi '08: Ruby M17N, 2008

What is Ruby?

Tiny annotations for Japanese pronunciation
A programming (scripting) language
Interpreted (like Perl or Python)
Simple and short syntax, easy to get into
Fully object-oriented (more than Java or C++ or ...)
Intuitive semantics
Lots of fun!

Ruby History

Developed by Yukihiro MATSUMOTO ("Matz") starting in 1993
"Better" Perl, fully object-oriented, includes elements of Smalltalk, Lisp,...
2000: 'discovered' in the West by Dave Thomas
2004: David Heinemeier Hansson publishes Ruby on Rails high-productivity Web application framework
Current productive versions: 1.8(.6/7), 1.9(.1); in preparation: 1.9.2

Ruby Highlights

Open source
All the trimmings: libraries, debugger, profiler, interactive shell, documentation tools, testing framework
Very flexible type system (open classes, duck typing)
Metaprogramming (program your program), extensibility (C)
Very easy to create internal DSLs (domain specific language):
- Look like independent, simple, domain specific languages
- Part of Ruby: full programming power available where necessary
- Example by author: SVuGy

The Past: Internationalization in Ruby 1.8

The problem:

Strings are just sequences of bytes
String methods work on bytes

Good news:

Basic UTF-8 support available
Unicode means UTF-8
Regular expressions can work with UTF-8 characters
Rails provides more with ActiveSupport::Multibyte

Now much better: Ruby 1.9

Internationalization in Ruby 1.9: The Architecture

Code Set Independent (CSI, not UCS)
Each Ruby string has an encoding
Very small set of data and functions per encoding
Encoding conflicts raise exceptions

Japanese mostly use Multilingualization (M17N) instead of Internationalization (I18N)

Original proposal: Yukihiro Matsumoto and Masahiko Nawate, Multilingual Text Manipulation Method for Ruby Language, IPSJ Journal, Vol. 46, No. 11, Nov. 2005 (in Japanese).

UCS: Universal Code Set

In theory: anything, in practice: Unicode
Unicode inside
Convert to Unicode early
Convert back late
Used by Java, C#, Perl, Python, the WWW,...

CSI: Code Set Independent

Not dependent on any specific encoding
Old style: configure a single encoding at compile time
Ruby 1.9:
- Multiple encodings in the same application
- Each String is tagged with an encoding

In Defense of the CSI Approach

Scripts often use just one legacy encoding
Transcoding is slow
(Transcoding needs huge tables)
Transcoding is messy, example: Shift_JIS
- Backslash or Yen sign?
- What vendor additions?
- What company or personal 外字 (gaiji)?
CSI does not fix problems, but allows to stay vague

Main Problems of CSI

Overhead for implementation and on execution
Need to be general, few primitive functions to abstract encodings
→ potential efficiency problems
Raising exceptions for encoding conflicts
→ difficult to address during runtime
Will help programmers to clean up their job
Lowest common denominator approach to string functionality

Internationalization in Ruby 1.9: Functionality

String basics
Transcoding
Source file encoding
Character escapes
Input and output
Default encodings

`String` Basics

Class String represents strings
s = "青山"
string.encoding → Encoding
s.encoding →#<Encoding:UTF-8>
string.length → length in characters
s.length → 2
(same per-character behavior for other String methods)
Character-based functionality for all other string and regular expression methods

String Iteration

Ruby 1.8:
String.each iterates over lines in a text
Ruby 1.9:String.each String.each_char iterates over characters (as small strings)String.each_line iterates over lines (as strings)String.each_byte iterates over bytes (as integers)

Transcoding: `String.encode`

Converting a string to another encoding:: s2 = string.encode(to_enc)
Explicit from-encoding:: s2 = s.encode(to_enc, from_enc)
Convert to default_internal (by default no errors):: s2 = s.encode
Options:: s.encode(..., invalid: :ignore); (options for how to handle transcoding errors)
Destructive conversion:: s1.encode!(...)
Changing encoding without changing bytes:: s.force_encoding enc

Setting the Source File Encoding

Each source file has an encoding
Encoding is US-ASCII if not specified
Source encoding has to be ASCII-compatible (→ ~~UTF-16~~)
Encoding can be set in comment on first (sometimes second) line:
# ... coding: encoding ...
Example:
#!/bin/env ruby
# -*- coding: UTF-8 -*-
UTF-8 BOM also declares UTF-8 ☹

IMPORTANT: Declare your source encoding

Using the Source File Encoding

Source file encoding sets:
- Encoding of strings
- Encoding of regular expressions
- Encoding of identifiers
- Encoding is US-ASCII for ASCII-only items
Encoding is checked (not for comments)

Internationalization of Ruby Identifiers

Assign a value to variable π:
π = 3.1415...
Create another name (λ) for a method:
alias λ lambda
Limitations:
- Only works if encoding is the same
- Only ASCII A-Z are considered upper-case
  (constants, for class names)

Unicode Character Escapes

History: charesc
Fixed-length syntax:
"Hello \u9752\u5C71" → "Hello 青山"
"Martin D\u00FCrst" → "Martin Dürst"
Variable-length:
"Hello \u{9752 5C71}" → "Hello 青山"
"Martin D\u{FC}rst" → "Martin Dürst"
Encoding automatically is UTF-8

Input and Output

Setting encoding on input (bytes conserved):
open "filename", "mode:external" open "file.txt", "r:Shift_JIS"
Transcoding on input and output:
open "filename", "mode:external:internal" open "file.txt", "r:Shift_JIS:UTF-8"

Default Encodings

Encoding.default_external:
- default for external encoding
- derived from LANG on Unix/Linux
- derived from legacy system encoding on Windows
Encoding.default_internal:
- default for internal encoding
- by default undefined (≊ default external)
File name encoding: Currently concept only

Setting Default Encodings

Command line option -E:
- Setting external encoding: -E euc-jp
- Setting both external and internal: -E euc-jp:utf-8
- Setting internal encoding: -E :utf-8
Command line option -U:
- Sets default_internal to UTF-8
- Close to UCS mode for Ruby
Also possible to get and set inside program
Set only once, preferably outside or inside application
Do not set in libraries

Internationalization in Ruby 1.9: Behind the Scenes

Encoding details
- What defines an encoding
- Important encodings
- Other encodings covered
Transcoding implementation
- Why new transcoding implemenation
- Transcoding architecture

What Defines an Encoding

A few bits in flags field of String object
(hash if there are not enough bits for all encodings used)
Encoding primitives
- Allowed byte patterns
- Which bytes are one character
Regular expression functionality
- Various properties
Transcoders

Important Encodings

US-ASCII
ASCII-8BIT (alias BINARY)
UTF-8

These three are built in, others are loaded dynamically

US-ASCII

Encoding can be ascii_compatible? (or not)
For Ruby, UTF-8 and Shift_JIS are ascii-compatible?, iso-2022-jp and UTF-16[BE|LE] are not
Transcodings keep US-ASCII range fixed (0x5C is always 0x5C)
Source data that contains only US-ASCII is labeled as US-ASCII
Different encodings can be combined if the actual data is ASCII-only

ASCII-8BIT

Alias: BINARY
Use for binary data
ASCII interpreted as ASCII to simplify notation

UTF-8

Unicode in Ruby means UTF-8
Processing highly optimized
Wide range of properties supported in regular expressions,...
\uABCD escapes produce UTF-8 strings
Can be set as default_internal with -U
Used as transcoding hub

Other Encodings Supported

iso-8859-X
windows-125X (except windows-1258)
many Mac encodings
Shift_JIS, EUC-JP, iso-2022-jp
Korean and Chinese encodings (including GB 18030)
UTF-16[BE|LE], UTF-32[BE|LE] (not UTF-16, nor UTF-32)

Tell me if you need something else!

Why New Transcoding Library

Copyright issues
Platform-independence
One-stop shopping
Adapt to Ruby's needs (memory, performance)
Use Ruby mechanisms (e.g. dynamic loading, expose transcoders as objects,...)

Transcoding Library Architecture

UTF-8 as transcoding hub,
but bypass possible (e.g. EUC-JP ↔Shift_JIS)
Transcoding one byte at a time
- Fits most encodings
- Reasonably small table sizes
Single core loop instruction interpreter, rest is data
→ single point of extension for new transcoding options

Conversion Data Example

One table per byte
(Shift_JIS ⇒ UTF-8, leading byte)
Byte value	Action
0x00
...
0x61 ('a')	Copy input (0x61, 'a')
...
0xC3 ('ﾃ')	output "\xEF\xBE\x83"
...
0xE0	Go to table for 0xE0...
...

Problems with one Table per Byte

1K memory per leading byte
Similar values are frequent
(undefined, invalid,...)
Sparse tables are frequent
(e.g. UTF-8 ⇒ something)

Two Tables per Byte

Indices (offsets):
- One byte per byte
- Table size 256 bytes
  (or 64 bytes for UTF-8 trailing bytes)
- index 0-255 into infos
Instructions (infos):
- Pointers: Proceed to next table for next byte
- Actual output: 1 byte, 2 bytes, 3 bytes (4 bytes)
- Copy input
- Invalid, undefined
- Call some function (e.g. UTF-16...)

Two Tables Example

(Shift_JIS ⇒ UTF-8, leading byte)

`offsets`
Byte value	`offset`
0x00	0
...	0
0x61 ('a')	0
...
0xC3 ('ﾃ')	23
...
0xE0	42
...

`infos`
`offset`	`info`
0	0
...
23	output "\xEF\xBE\x83"
...
42	goto table for 0xE0...
...

Advantages of Two Bytes per Table

Sharing infos in same byte
Increase sharing with special info values
(e.g. copy-as-is)
Sharing tables when they are equal
- Shared code structure
  (e.g. offsets, e.g. UTF-8⇒EUC-JP, UTF-8⇒Shift_JIS)
- Shared conversion results
  (e.g. Shift_JIS⇒UTF-8, CP932⇒UTF-8)
Working on table sharing when they are close, but not equal

Next Steps for Transcoding

More encodings
Reimplement table switching (iso-2022-jp,...)
Cleanup table generation script
More options (e.g. fallbacks)
Backtracking (e.g. for windows-1258)
More data optimizations

Unicode in Ruby on Rails

Ruby on Rails uses UTF-8 by default
ActiveSupport::Multibyte provides more support (e.g. normalization):
(http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html)
Core framework for localization (mailing list), very flexible

Internationalization in Ruby 1.9: Advice

Use UTF-8 as default_internal with -U option
Label your source files
For libraries, document expectations
- ASCII only
- UTF-8 only
- Handling various encodings
  example: CSV library
For the moment, use Ruby on Rails components for internationalization beyond character encoding

Future Work

Internationalization beyond character encoding
- Number formatting
- Dates and times
- ...
Standardization among various Ruby implementations
(JRuby, MacRuby, IronRuby, Rubinius, MagLev,...)

Acknowledgements

Yukihiro Matsumoto (Matz) and many others for Ruby
Matz for some great discussions
The members of my lab, in particular Takuya Shimada, Kouji Tominaga, Yoshihiro Kambayashi, and Tatsuya Mizuno

Conclusions

Ruby is exciting, even without I18N
Internationalization much better in Ruby 1.9
More work ahead

Questions & Answers

Collophon

or how these slides were produced,
and how they are best viewed

XHTML and CSS
Edited with Amaya
Ruby Logo
Best viewed with Opera presentation mode (F11)