Internationalization in Ruby 1.9
IUC 33, San Jose, CA, U.S.A., October 2009
Martin J. DÜRST
duerst@it.aoyama.ac.jp
Aoyama Gakuin University
© 2009 Martin J.
Dürst, Aoyama Gakuin University
Outline
- What is Ruby?
- Internationalization in Ruby 1.8
- Internationalization in Ruby 1.9
- Architecture
- Functionality
- Behind the Scenes
- Outlook
- Q & A
Abstract
Ruby is a purely object-oriented scripting language which is easy to learn
for beginners and highly appreciated by experts for its productivity and depth.
Internationalization of Ruby made a big leap forwards when this January, Ruby
1.9.1, the first stable release of the Ruby 1.9 series, was released. While
previous versions of Ruby mostly treated text data as byte sequences, strings
in Ruby 1.9 are sequences of characters. Because Ruby tags each string with
encoding information internally, different applications can choose different
internationalization models.
The presentation will give a short overview of Ruby as a programming
language, and introduce the new internationalization features in detail. We
will be concentrating on how to use Ruby with Unicode, which in Ruby's case
means UTF-8. We will also discuss internationalization support in Ruby on
Rails, the pouplar Web application framework written in Ruby.
Assumptions
Talk assumes that you know
- Unicode basics
- Programming basics
Slides avaliable at http://www.sw.it.aoyama.ac.jp/2009/pub/IUC33-ruby1.9/
Some parts of this talk are based on:
What is Ruby?
- Tiny annotations for Japanese pronunciation
- A programming (scripting) language
- Interpreted (like Perl or Python)
- Simple and short syntax, easy to get into
- Fully object-oriented (more than Java or C++ or ...)
- Intuitive semantics
- Lots of fun!
Ruby History
- Developed by Yukihiro MATSUMOTO ("Matz") starting in 1993
- "Better" Perl, fully object-oriented, includes elements of Smalltalk,
Lisp,...
- 2000: 'discovered' in the West by Dave Thomas
- 2004: David Heinemeier Hansson publishes Ruby on Rails high-productivity Web
application framework
- Current productive versions: 1.8(.6/7), 1.9(.1); in preparation:
1.9.2
Ruby Highlights
- Open source
- All the trimmings: libraries, debugger, profiler, interactive shell,
documentation tools, testing framework
- Very flexible type system (open classes, duck typing)
- Metaprogramming (program your program), extensibility (C)
- Very easy to create internal DSLs (domain specific
language):
- Look like independent, simple, domain specific languages
- Part of Ruby: full programming power available where necessary
- Example by author: SVuGy
The Past: Internationalization in Ruby 1.8
The problem:
- Strings are just sequences of bytes
- String methods work on bytes
Good news:
- Basic UTF-8 support available
- Unicode means UTF-8
- Regular expressions can work with UTF-8 characters
- Rails provides more with ActiveSupport::Multibyte
Now much better: Ruby 1.9
Internationalization in Ruby 1.9: The Architecture
- Code Set Independent (CSI, not UCS)
- Each Ruby string has an encoding
- Very small set of data and functions per encoding
- Encoding conflicts raise exceptions
Japanese mostly use Multilingualization (M17N) instead of
Internationalization (I18N)
Original proposal: Yukihiro Matsumoto and Masahiko Nawate,
Multilingual Text Manipulation Method for Ruby Language, IPSJ Journal, Vol. 46,
No. 11, Nov. 2005 (in Japanese).
UCS: Universal Code Set
- In theory: anything, in practice: Unicode
- Unicode inside
- Convert to Unicode early
- Convert back late
- Used by Java, C#, Perl, Python, the WWW,...
CSI: Code Set Independent
- Not dependent on any specific encoding
- Old style: configure a single encoding at compile time
- Ruby 1.9:
- Multiple encodings in the same application
- Each
String
is tagged with an encoding
In Defense of the CSI Approach
- Scripts often use just one legacy encoding
- Transcoding is slow
- (Transcoding needs huge tables)
- Transcoding is messy, example: Shift_JIS
- Backslash or Yen sign?
- What vendor additions?
- What company or personal 外字 (gaiji)?
- CSI does not fix problems, but allows to stay vague
Main Problems of CSI
- Overhead for implementation and on execution
- Need to be general, few primitive functions to abstract encodings
→ potential efficiency problems
- Raising exceptions for encoding conflicts
→ difficult to address during runtime
Will help programmers to clean up their job
- Lowest common denominator approach to string functionality
Internationalization in Ruby 1.9: Functionality
- String basics
- Transcoding
- Source file encoding
- Character escapes
- Input and output
- Default encodings
String
Basics
- Class
String
represents strings
s = "青山"
string.encoding
→ Encoding
s.encoding
→#<Encoding:UTF-8>
string.length
→ length in characters
s.length
→ 2
(same per-character behavior for other String
methods)
- Character-based functionality for all other string and regular expression
methods
String Iteration
- Ruby 1.8:
String.each
iterates over lines in a text
- Ruby 1.9:
String.each
String.each_char
iterates over characters (as small
strings)
String.each_line
iterates over lines (as strings)
String.each_byte
iterates over bytes (as integers)
Transcoding: String.encode
- Converting a string to another encoding:
s2 = string.encode(to_enc)
- Explicit from-encoding:
s2 = s.encode(to_enc, from_enc)
- Convert to
default_internal
(by default no errors):
s2 = s.encode
- Options:
s.encode(..., invalid: :ignore)
- (options for how to handle transcoding errors)
- Destructive conversion:
s1.encode!(...)
- Changing encoding without changing bytes:
s.force_encoding enc
Setting the Source File Encoding
- Each source file has an encoding
- Encoding is US-ASCII if not specified
- Source encoding has to be ASCII-compatible (→
UTF-16)
- Encoding can be set in comment on first (sometimes second) line:
# ... coding: encoding
...
- Example:
#!/bin/env ruby
# -*- coding: UTF-8 -*-
- UTF-8 BOM also declares UTF-8 ☹
IMPORTANT: Declare your source encoding
Using the Source File Encoding
- Source file encoding sets:
- Encoding of strings
- Encoding of regular expressions
- Encoding of identifiers
- Encoding is US-ASCII for ASCII-only items
- Encoding is checked (not for comments)
Internationalization of Ruby Identifiers
- Assign a value to variable π:
π = 3.1415...
- Create another name (λ) for a method:
alias λ lambda
- Limitations:
- Only works if encoding is the same
- Only ASCII
A-Z
are considered upper-case
(constants, for class names)
Unicode Character Escapes
- History: charesc
- Fixed-length syntax:
"Hello \u9752\u5C71"
→
"Hello 青山"
"Martin D\u00FCrst"
→ "Martin
Dürst"
- Variable-length:
"Hello \u{9752 5C71}"
→
"Hello 青山"
"Martin D\u{FC}rst"
→
"Martin Dürst"
- Encoding automatically is UTF-8
Input and Output
- Setting encoding on input (bytes conserved):
open "filename", "mode:external"
open "file.txt", "r:Shift_JIS"
- Transcoding on input and output:
open "filename",
"mode:external:internal"
open "file.txt", "r:Shift_JIS:UTF-8"
Default Encodings
Encoding.default_exte
rnal
:
- default for external encoding
- derived from LANG on Unix/Linux
- derived from legacy system encoding on Windows
Encoding.default_internal
:
- default for internal encoding
- by default undefined (≊ default external)
- File name encoding: Currently concept only
Setting Default Encodings
- Command line option
-E
:
- Setting external encoding: -E euc-jp
- Setting both external and internal: -E euc-jp:utf-8
- Setting internal encoding: -E :utf-8
- Command line option
-U
:
- Sets default_internal to UTF-8
- Close to UCS mode for Ruby
- Also possible to get and set inside program
- Set only once, preferably outside or inside application
- Do not set in libraries
Internationalization in Ruby 1.9: Behind the Scenes
- Encoding details
- What defines an encoding
- Important encodings
- Other encodings covered
- Transcoding implementation
- Why new transcoding implemenation
- Transcoding architecture
What Defines an Encoding
- A few bits in flags field of String object
(hash if there are not enough bits for all encodings used)
- Encoding primitives
- Allowed byte patterns
- Which bytes are one character
- Regular expression functionality
- Transcoders
Important Encodings
- US-ASCII
- ASCII-8BIT (alias BINARY)
- UTF-8
These three are built in, others are loaded dynamically
US-ASCII
- Encoding can be
ascii_compatible?
(or not)
For Ruby, UTF-8 and Shift_JIS are ascii-compatible?
,
iso-2022-jp and UTF-16[BE|LE] are not
- Transcodings keep US-ASCII range fixed (0x5C is always 0x5C)
- Source data that contains only US-ASCII is labeled as US-ASCII
- Different encodings can be combined if the actual data is ASCII-only
ASCII-8BIT
- Alias: BINARY
- Use for binary data
- ASCII interpreted as ASCII to simplify notation
UTF-8
- Unicode in Ruby means UTF-8
- Processing highly optimized
- Wide range of properties supported in regular expressions,...
\uABCD
escapes produce UTF-8 strings
- Can be set as
default_internal
with -U
- Used as transcoding hub
Other Encodings Supported
- iso-8859-X
- windows-125X (except windows-1258)
- many Mac encodings
- Shift_JIS, EUC-JP, iso-2022-jp
- Korean and Chinese encodings (including GB 18030)
- UTF-16[BE|LE], UTF-32[BE|LE] (not UTF-16, nor UTF-32)
Tell me if you need something else!
Why New Transcoding Library
- Copyright issues
- Platform-independence
- One-stop shopping
- Adapt to Ruby's needs (memory, performance)
- Use Ruby mechanisms (e.g. dynamic loading, expose transcoders as
objects,...)
Transcoding Library Architecture
- UTF-8 as transcoding hub,
but bypass possible (e.g. EUC-JP ↔Shift_JIS)
- Transcoding one byte at a time
- Fits most encodings
- Reasonably small table sizes
- Single core loop instruction interpreter, rest is data
→ single point of extension for new transcoding options
Conversion Data Example
One table per byte
(Shift_JIS ⇒ UTF-8, leading byte)
Byte value |
Action |
0x00 |
|
... |
|
0x61 ('a') |
Copy input (0x61, 'a') |
... |
|
0xC3 ('テ') |
output "\xEF\xBE\x83" |
... |
|
0xE0 |
Go to table for 0xE0... |
... |
|
Problems with one Table per Byte
- 1K memory per leading byte
- Similar values are frequent
(undefined, invalid,...)
- Sparse tables are frequent
(e.g. UTF-8 ⇒ something)
Two Tables per Byte
- Indices (
offsets
):
- One byte per byte
- Table size 256 bytes
(or 64 bytes for UTF-8 trailing bytes)
- index 0-255 into
infos
- Instructions (
infos
):
- Pointers: Proceed to next table for next byte
- Actual output: 1 byte, 2 bytes, 3 bytes (4 bytes)
- Copy input
- Invalid, undefined
- Call some function (e.g. UTF-16...)
Two Tables Example
(Shift_JIS ⇒ UTF-8, leading byte)
offsets
Byte value |
offset |
0x00 |
0 |
... |
0 |
0x61 ('a') |
0 |
... |
|
0xC3 ('テ') |
23 |
... |
|
0xE0 |
42 |
... |
|
infos
offset |
info |
0 |
0 |
... |
|
23 |
output
"\xEF\xBE\x83" |
... |
|
42 |
goto table for 0xE0... |
... |
|
Advantages of Two Bytes per Table
- Sharing
infos
in same byte
- Increase sharing with special
info
values
(e.g. copy-as-is)
- Sharing tables when they are equal
- Shared code structure
(e.g. offsets
, e.g. UTF-8⇒EUC-JP, UTF-8⇒Shift_JIS)
- Shared conversion results
(e.g. Shift_JIS⇒UTF-8, CP932⇒UTF-8)
- Working on table sharing when they are close, but not equal
Next Steps for Transcoding
- More encodings
- Reimplement table switching (iso-2022-jp,...)
- Cleanup table generation script
- More options (e.g. fallbacks)
- Backtracking (e.g. for windows-1258)
- More data optimizations
Unicode in Ruby on Rails
Internationalization in Ruby 1.9: Advice
- Use UTF-8 as
default_internal
with -U
option
- Label your source files
- For libraries, document expectations
- ASCII only
- UTF-8 only
- Handling various encodings
example: CSV library
- For the moment, use Ruby on Rails components for internationalization
beyond character encoding
Future Work
- Internationalization beyond character encoding
- Number formatting
- Dates and times
- ...
- Standardization among various Ruby implementations
(JRuby, MacRuby, IronRuby, Rubinius, MagLev,...)
Acknowledgements
- Yukihiro Matsumoto (Matz) and many others for Ruby
- Matz for some great discussions
- The members of my lab, in particular Takuya Shimada, Kouji Tominaga,
Yoshihiro Kambayashi, and Tatsuya Mizuno
Conclusions
- Ruby is exciting, even without I18N
- Internationalization much better in Ruby 1.9
- More work ahead
Questions & Answers
Collophon
or how these slides were produced,
and how they are best viewed