31st Internationalization and Unicode Conference, October 2007, San Jose, CA, USA
Martin J. Dürst
Department of Integrated Information Technology,
College
of Science and Engineering, Aoyama
Gakuin University
Sagamihara, Kanagawa, Japan
mailto:duerst@it.aoyama.ac.jp
http://www.sw.it.aoyama.ac.jp
Keywords: Ruby (Scripting Language), Internationalization, Transcoding Policy, Unicode
Ruby is a purely object-oriented scripting language that is rapidly growing in popularity due to its high productivity. Because it was invented in Japan, some basic internationalization features are available, but there is still a lot of work to do. This paper gives a short overview of the most important features of Ruby, and introduces the available internationalization-related features, concentrating on how to use Ruby with Unicode, which in Ruby's case means UTF-8. An analysis and outlook of planned directions for further internationalization work is also given.
An up-to-date version of this paper as well as slides used for the talk connected to this paper are available at http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-ruby.
A programming language should allow to write programs easily and conveniently, at a high abstraction level, with results that are compact and readable. In the view of the author of this paper, as well as many others, the scripting language Ruby has managed to move a significant step closer to this goal. Programming is fun again, more fun than ever.
Programming languages, and in particular scripting languages, deal with text, and in this day and age of the World Wide Web, text is no longer just 7-bit characters. Internationalization and Unicode support are crucial for everyday programming tasks.
This paper discusses various aspects of Ruby internationalization. We start with a short overview of Ruby itself. Then we discuss the core internationalization features of Ruby 1.8, the current stable version, in Section 3, including shortcomings. In Section 4, we look at the CSI (Character Set Independent) model underlying the internationalization architecture for future versions of Ruby, contrast it with the UCS (Universal Character Set) model that is often used in combination with Unicode, and discuss the shortcomings of the CSI approach. In Section 5 we describe the progress towards implementing the new internationalization architecture in Ruby 1.9. In Section 6, we propose transcoding policies as a measure to implement the UCS model on top of the CSI architecture and to improve the usability of the CSI model.
This section gives a short overview of the Ruby scripting language, with particular attention to the features relevant for internationalization. Ruby's core functionality is close to that of Perl [Perl], and it borrows many syntactical idioms from Perl, but avoids many of Perl's shortcomings. Ruby also uses concepts from programming languages such as Smalltalk and Lisp. The language is continuously being improved, with work on internationalization one of the main items in the current cycle.
Ruby is the creation of Yukihiro Matsumoto, who started working on it in 1993. The idea was to create a purely object-oriented scripting language that was easy to use. In the view of this author and many others, this was highly successful.
Ruby started to become more widely known in 1999 in Japan, and around 2000, it was 'discovered' for the rest of the world by Dave Thomas, who wrote the standard Pickaxe book, now in its second edition [RubyPrag]. If you are already familiar with a programming language and with the basic concepts of object-oriented programming, then this is the first Ruby book that you should buy.
Ruby became even more well-known in 2004 when David Heinemeier Hansson published the first version of Ruby on Rails [Rails], a high-productivity Web application framework that is based on the principle of convention over configuration and takes extensive advantage of Ruby.
Ruby comes with an extensive range of built-in classes, standard libraries, a framework for installing extensions called RubyGems, a debugger, a profiler, an interactive shell, documentation tools, a testing framework, and many other productivity tools. Ruby has been ported on a large number of platforms starting with Linux/Unix, and including Microsoft Windows (native and Cygwin) as well as many others.
In the meantime, there are already several implementations of the Ruby interpreter. Besides the main implementation in C, there is JRuby, implemented in Java, Rubinius, based on the Smalltalk-80 VM architecture, and IronRuby, implemented on the Microsoft Common Language Runtime. Many well-known programmers in the agile software community use Ruby, and there is even a book that predicts that Ruby will replace Java [JavaRuby].
One of the features that makes Ruby so attractive is its concise syntax. This is achieved on two levels. On a micro-level, parentheses around method arguments and semicolons at the end of a line can be left out. This is achieved at the cost of higher parser complexity, something which does not bother the Ruby programmer. It often produces syntax that is more reminiscent of configuration settings than an actual programming language. This kind of style is also called Internal Domain Specific Language [DSL], Internal referring to the fact that this DSL is just a natural part of the host language Ruby.
At a higher level, concise syntax it is achieved by using convenience
methods on the root class Object
and a default object that forms
the context at the start of a script. In Java, printing to standard output
has to be written
System.out.println("Hello World!");
after creating an application class to execute this command. In Ruby, this is simply written
print "Hello World!"
without the need for any further code. This allows for Perl-like one-liners while keeping the object-oriented concepts clean.
For control structures, Ruby mostly uses familiar syntax, although
constructs such as if
/elsif
/else
and
while
use an explicit end
rather than brackets or
implicit delimitation. A control structure particular to Ruby, and widely
used, is the block. A block is a function-like piece of code that
can be passed to a method and then can be called repeatedly by this method. A
very simple example is the times
method on integers:
5.times { print "Hello World!" }
This repeats the statement in the block (between {
and
}
) for five times. In object-oriented terms, the object
5
is requested to execute the block for 'itself' number of
times. Blocks are widely used for iteration, for higher-order
functions/methods, and for providing optional behavior.
As the 5.times
example above showed, being fully object
oriented means that even integers and strings are objects, and that methods
can directly be applied to constants. So it is possible to write
"Hello".upcase
(returning the string HELLO
).
Another syntactical feature frequently used is string interpolation. Using
the delimiters #{
and }
within a string allows to
add replacement strings to the original string. This feature is not limited
to variable values (as e.g. in Perl), but can also be used with method
results. As an example, the following short program
who = "Unicode Conference" print "Hello #{who}!"
will print Hello Unicode Conference!
Ruby has a very flexible type system. All classes are derived from the
Object
root class. Each object is an instance of some class, and
the class defines what methods the object can execute.
In contrast to objects, variables do not have types, they are just
references to objects. As a consequence, data structures such as
Array
s can contain mixtures of objects. Here is an example of an
Array
constant that mixes integers, floating point numbers, and
strings:
[1, 2, 3.14159, "four", "five and a quarter"]
Because variables are not typed, they also do not need to be declared. Variable scoping is lexical, but a variable lives as long as it may be used, e.g. in a block. In the computer science literature, this is called a closure.
What makes Ruby even more flexible is the fact that methods are called strictly by name, independent of the class hierarchy. This is called duck typing, because any object, of whatever class, is considered a duck as long as it walks and quacks like a duck. This is different from languages such as C++ and Java, where a method has to be declared in a common superclass in order for this method to be used with objects of different types.
While the class hierarchy can only be extended, not changed, any class (and with it its subclasses) can easily be extended by adding new methods. Also, existing methods can be redefined. This is often combined with renaming the existing method and calling it from the new definition to avoid losing the existing functionality. This kind of extensibility is crucial, because it reduces the 'reinventing the wheel' phenomenon: It is not necessary to reimplement or subclass a class just to add some features. This kind of extensibility also serves as a great catalyst and testbed for new methods that may eventually become built-in methods.
Module
s can be used to add a collection of methods to a
class, in a scaled-down version of multiple inheritance. They also serve as a
namespacing mechanism.
Ruby also makes it easy to write glue code to a library written in C, providing additional extensibility and a performance boost when needed.
Metaprogramming means that a program can program itself, i.e. change
itself. The basic working mode of the Ruby interpreter is that class and
method definitions are just program code that creates classes and methods
instead of integers, String
s, Array
s and the
like.
Metaprogramming is used frequently in Ruby to increase extensibility, to write more compact and consistent programs, and to provide DSL functionality.
This section describes the state of Ruby internationalization for Ruby 1.8, the currently stable and prevalent version. The latest subversion is 1.8.6, but there are no significant changes regarding internationalization within version 1.8.
In Ruby 1.8, String
objects are treated just as sequences of
bytes. Accessing a String with array-style indexing returns the corresponding
byte as an integer, which can quickly lead to data corruption. Some support
for treating strings as sequences of characters is available with regular
expressions. There are four different character encoding modes for regular
expressions:
These modes can be set on regular expressions by using the option letters
n
(none), e
(EUC), s
(SJIS), and
u
(UTF-8). As an example, the regular expression
/.../n
matches three bytes, while /.../u
matches
three UTF-8 characters. According to Matsumoto, UTF-8 was added around 1997,
which is quite early for Japanese software.
The UTF-8 mode indicates that for Ruby, Unicode means UTF-8. Other encoding forms such as UTF-16 are not really under consideration. For people used to working with UTF-16, this may be unfortunate, but having all Unicode support be centered around a single encoding definitely eliminates many problems, and leads to a concentration of forces. For a scripting language, the US-ASCII compatibility of UTF-8 [PropUTF8] is definitely a big asset.
The above modes can also be set globally with the -K
option
or by setting the $KCODE
global variable. The 'K' is derived
from Kanji. However, these features have to be used with care. Due
to their global nature, they may easily affect libraries where another
setting is desired.
The -K
option and the $KCODE
global variable
have a few other effects. They influence what can be used as a variable name.
With $KCODE = 'utf-8'
, for example, any sequence of
(UTF-8-encoded) non-ASCII characters can be used as a variable name. A
similar feature is available in XML, and is very helpful for education [XML2003] and for dealing with complex local terminologies
[Murata]. Also, the above settings may affect whether
strings are being escaped or not on output.
Another feature helpful when working with UTF-8 are the pack
method for Array
and the unpack
method for
String
. They allow to convert from an array of integers,
representing Unicode scalar values, to a string encoded in UTF-8. As an
example, [0x9752, 0x5C71].pack('U*')
produces the string
'青山'
. This can also be used for detecting UTF-8, for
details, see [RubyWay, Section 4.2.3].
For more support, the programmer has to refer to libraries. The standard
jcode
library uses the built-in regular expression support to
add and change some methods of the String class to make them work with
character semantics. As an example, string.length
returns the
string length in bytes, whereas string.jlength
, added by the
jcode
library, returns the length in characters.
string.chop!
is one of the modified methods, and removes the
last character, rather than the last byte, if the jcode
library
is loaded. However, not all methods that one would expect are changed.
Some code conversion facilities are also available as standard libraries.
nkf
(Network Kanji Filter) is a library well-known in Japan, and
is available directly from Ruby. A higher-level interface, called
Kconv
, based on NKF, is also available, and adds some conversion
methods directly to the String class. However, these libraries only cover
Japanese encodings (including parts of UTF-8), and by default also perform
MIME decoding, which should be dealt with separately.
Another standard conversion library available in Ruby is
iconv
. This is a simple interface to the POSIX iconv library.
However, this is only supported as far as an iconv library is installed
natively, and can therefore not be relied upon.
Because iconv
in general covers a much wider range of
character encodings, but nkf
/Kconv
are preferred by
Japanese programmers for how they handle some of the conversion details, we
have explored the possibility of a higher-level wrapper over these two
libraries, with added programmer convenience [Shima2007]. An alpha-quality version of this work has
been published as [sconv].
The list of other libraries that provide parts of the functionality necessary for internationalization is long. Hal Fulton [RubyWay, Chapter 4] provides a good start, including short explanations of the basic issues, which makes it recommended reading for people without experience in internationalization.
A library for doing Unicode normalization by Masato Yoshida is available at http://www.yoshidam.net/unicode-0.1.tar.gz, with a description at http://www.yoshidam.net/unicode.txt. However, this is based on Ruby 1.4 and Unicode as of 1999, and so is seriously outdated.
Another Unicode library, based on Unicode version 4.1.0, by Rob Leslie, is available at ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2. It redefines various operators for normalization and other operations.
The library currently most up-to-date for basic Unicode support is most
probably ActiveSupport::Multibyte, part of Ruby on Rails 1.2. See http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html
for more documentation. It supports Unicode semantics for strings encoded in
UTF-8. Instead of just writing string.length
, which gives the
length in bytes, string.chars.length
will give the string length
in terms of characters. The chars
method creates a proxy for
each string, and delegates the methods to a handler that knows about the
encoding. http://wiki.rubyonrails.org/rails/pages/HowToUseUnicodeStrings
gives a short introduction for how to use Unicode in Ruby on Rails, which is
also useful for Ruby in general. Another useful resource for using Unicode in
Ruby on Rails is http://ruby.org.ee/wiki/Unicode_in_Ruby/Rails.
Another approach is ICU4R, providing ICU bindings for Ruby [ICU4R]. This provides a lot of functionality, but uses its
own classes, e.g. UString
instead of String
. This
means that advanced functionality is available, but at a high overhead for
the programmer.
The author has implemented two small libraries, described in this and the
next subsection, both available as RubyGems. The langtag RubyGem [langtag] provides parsing, wellformedness checking, and
read/write accessors by component for IETF language tags [BCP47]. Below is a little example program showing some
usage examples. The results are indicated with =>
in comments
(starting with #
).
require 'langtag' tag = Langtag.new('de-Latn-ch') tag.script # => 'Latn' tag.wellformed? # => true tag.region = 'at' print tag # => de-Latn-at
The charesc RubyGem [charesc] provides character
escapes for Ruby strings. Up to version 1.8, Ruby only provides an escape
syntax for bytes, \xHH
for the hexadecimal case. Using
metaprogramming and string interpolation, this limitation can be overcome. In
Ruby, the method const_missing
is called whenever a constant (an
identifier starting with an upper-case letter) is not found, usually just to
produce an error message. Redefining this method, we pretend that there are
constants of the form UABCD
, with the corresponding Unicode
character U+ABCD as a value. To simplify syntax, we also allowed repeated
sequences of hexadecimal digits, separated by the character u
,
to represent strings of Unicode characters. Again, we show some usage
examples.
require 'charesc' print "#{U677Eu672Cu884Cu5F18}" # => 松本行弘 print U9752u5C71u5B66u9662u5927u5B66 # => 青山学院大学 print "Martin J. D#{U00FC}rst" # => Martin J. Dürst
The third line above shows that the constants do not have to be
interpolated, they can be used directly where appropriate. The escapes also
work if $KCODE
is set to SHIFT_JIS
or
EUC_JP
, but only for characters available in these encodings.
For $KCODE
set to none/US-ASCII, UTF-8 is produced.
This section is heavily based on a 2005 paper [M17N] by Matsumoto, as well as direct discussions with him. To those who can read it, the above paper in some parts reads like one of the Unicode critiques that were so popular in Japan in the mid 1990ies. A Unicode-centered model for character encoding is rejected in favor of a character set independent model, relegating Unicode to one coded character set among many. The next subsections explain these two models in detail.
Character encoding and processing on the Web as well as in many applications and programming languages all follow one and the same basic model, most clearly documented in Section 4.3 of [Charmod]. While this document talks indirectly, about specifications, the overall idea is very simple: All character processing occurs in (some encoding form of) Unicode; external data is converted to/from this encoding form as necessary, and other character encodings are secondary, simply being used to transmit (subsets of) Unicode. In more abstract terms, this model is called the Universal Character Set model, but in practical terms, Unicode is the only viable character set for this purpose.
Programming languages using the UCS model include Java and C#, which use UTF-16 internally, Perl, which mostly uses UTF-8, and Python, which can be compiled for both UTF-16 and UCS-4.
The many advantages of the UCS model are its conceptual simplicity and the possibility to provide rich functionality based on well-defined character semantics. Also, using a single encoding means that implementations can be highly optimized.
The Character Set Independent model tries to treat various different character encodings as equals. This model was popular with big software companies in the 1980ies and early 1990ies, before an UCS model using Unicode became possible. At that time, the choice of encoding usually happened at compile time. In the [M17N], the encoding is a property of a String, and may be different for each string. Further implementation details are discussed in Section 4.3.
Matsumoto gives the following reasons for choosing the CSI model:
Among these, the memory use for code conversion tables is less and less of an issue, even for devices such as mobile phones, where the newest versions already support Unicode. Conversion speed is also not an issue, because the necessary table lookups are very fast compared to the overhead of a scripting language such as Ruby. The limitations with respect to characters not (yet) encoded in Unicode can be addressed by mapping such characters to one of the private use areas of Unicode; this is easier than to implement a new encoding.
What remains, and what is most often mentioned in direct conversation, are
the difficulties with encoding variants. As an example, the encoding usually
labeled as Shift_JIS
, used on the Windows PC and the MacIntosh
for over 20 years, has accumulated a lot of variants. These are due to
additions and changes in the underlying coded character set, due to the
addition of vendor-specific and user-specific characters in the reserved
area. Also, various vendors and libraries differ slightly in how they map the
core characters of Shift_JIS to Unicode and back. For more details on
Shift_JIS variants, please see [XMLja].
A particularly thorny issue for Japanese encodings (and similar for Korean
encodings) is the character encoded as byte 0x5C
. It is used
both as a backslash (\) and as a Yen symbol (¥). In its syntactic function
as a backslash, e.g. in programming languages, it may look as it pleases, but
may not change code position. In its function as a currency symbol, it has to
look right or economic confusion may result.
While Shift_JIS is a particularly serious example, such issues are not at all limited to Shift_JIS or to Japanese encodings [ICUcompare]. Often, just a few characters out of a few hundred or even a few thousand are affected, and frequently, the characters affected are rare, or the differences are visually minor. Also, many of the issues do not appear in simple round-trip scenarios, i.e. if the same conversion library is used for converting in both directions.
Often, conversion to Unicode would be the right thing, but doing it correctly would require too much work, maybe including manual intervention. Keeping the data in a legacy encoding does not mean that the legacy encoding is better suited for representing the data. But it allows to gloss over ambiguities that otherwise would have to be resolved explicitly. In particular for a scripting language, where a lot of programs are written for one-time or casual use, the ability to not have to transcode is felt to be an important feature.
Being able to stay vague when needed is a feature often attributed to the Japanese language. Compared to Japanese, using English often means that things that are implicit or unknown have to be made explicit. To some extent, there may be some connection between the use of vagueness in the Japanese language and the desire for glossing over ambiguities in character encodings.
[M17N] describes an implementation of the CSI model. Each string carries its encoding as a property. Each encoding is represented by a C-level data structure containing mostly function pointers. Each function pointer implements a particular primitive for the specific encoding. The data structure also contains a few flags and other information about the encoding in question, such as the length of the shortest and longest byte sequence representing a character. No recompilation is needed except when a new encoding is added, and even in that case, it may be done by dynamic linking, avoiding recompilation of the interpreter itself.
Strings with different encodings can co-exist in a single execution. In many cases, at least two encodings will coexist, a binary encoding (which can also be understood as no encoding) and a single platform-specific encoding. If strings or regular expressions with different encodings are used in the same operation, an exception is generated. Exceptions are not very useful in this case. It is for example impossible to use them to implement a general policy of "convert if you see an encoding conflict". The intention of using exceptions is to make programmers understand and fix their code.
The number of functions that have to be implemented for each encoding is kept very low in order to make implementing new encodings easy. The exact number has been fluctuating over the past few years, but always stayed between 15 and 20. The functions are very low-level, such as "starting from byte p, assuming that this is the start of a byte sequence representing a character, tell me how long this byte sequence is". With these basic functions, it is possible to implement all basic string functionality.
The CSI model is not without problems. This subsection discusses these problems as perceived by the author. A first problem is performance, on multiple levels. Encoding-dependent function dispatch incurs a relatively small performance penalty, because it can be done once per string operation, where the overhead of using an interpreted language is already significant.
More serious performance problems can be expected due to the use of an extremely small set of primitives. As an example, finding the length of a string is currently implemented by repeatedly calling the function to find the next character. Adding a new primitive that calculates string length directly would increase the number of primitives, but increase performance for finding the length of a string in characters. While it is difficult to judge exactly when and where such performance problems will surface, the Law of Leaky Abstractions [JoelSW] virtually guarantees that they will.
The best way to deal with this conflict is to separate between core primitives and extended primitives. The core primitives are those primitives that have to be implemented for each encoding. They can be kept extremely low in number. The extended primitives are operations that are used frequently, can be expressed in terms of the core primitives, but may be performed significantly faster if implemented directly. For each extended primitive, a default implementation in terms of core primitives is made available.
When implementing a new encoding, as soon as the core primitives are implemented, the encoding can be used, because for the extended primitives, the default implementations are used. If the encoding is used more frequently, and if performance problems show up, more and more extended primitives can be implemented for the specific encoding. Overall, this results in a triangle-shaped oblique approach as used for the fast implementation of the Graphic User Interface Application Framework on several windowing platforms [Wei1993].
Another problem with the CSI model is that it leads to a lowest comon denominator approach to string functionality, whereas the UCS model tends towards a greatest common divisor approach. In the CSI model, only a minimum of primitives that can be applied to all encodings is implemented. This leaves the functionality virtually at the same level as the old-style interfaces designed for the ASCII-only days. On the other hand, the UCS model encourages covering more and more functionality.
Supporting Unicode means more that just to make it possible write UCS-oriented programs on a CSI infrastructure by using UTF-8 instead of Shift_JIS or EUC-JP, for example. Unicode is not just a union of existing character encoding repertoires convenient for round-tripping. It is also not just one coded character set among many. Unicode is at the forefront of encoding rare scripts, it constitutes the lowest level of Web architecture, it provides a wide range of character-related properties, conventions, and standards. Thorough support for Unicode, not only as an encoding, but including advanced functionality, is expected from any programming language and application platform.
Tim Bray [World30] has helped a lot to make Unicode well known in the Ruby community. Since the implementation of UTF-8 around 1997, Unicode is on equal footing to other encodings such as Shift_JIS and EUC-JP. But this is not enough. To paraphrase George Orwell, all encodings are equal, but Unicode (meaning UTF-8 in the case of Ruby) has to become more equal than others.
This section looks at the ongoing implementation for Internationalization in Ruby 1.9 and beyond. If everything goes well, some basic pieces may be finished by Christmas 2007, but this is not sure yet. This work is ongoing as we speak, so any details should be checked before use.
The principle approach in Ruby 1.9 is that following the lines of the CSI
model above, strings and regular expressions carry a character encoding as a
property. This encoding is available as a String
with the
encoding
method. The encoding is taken into account for basic
string operations such as length
or indexing, to make sure that
these operations work on characters rather than bytes. As a small but
important detail, indexing operations that extract a single character no
longer return an integer, but simply a very short string.
For working on bytes, a binary encoding is provided. Currently, as in Ruby 1.8, the binary encoding is also used for US-ASCII. Work is ongoing to distinguish simple binary data from US-ASCII data, at least by checking the range of byte values. This is used to relax the restriction on combining strings with different encodings. A string with an ASCII-compatible encoding that only contains US-ASCII data is considered to be combinable with another string with an US-ASCII compatible encoding. US-ASCII compatible here means that US-ASCII codepoints are expressed as US-ASCII byte values and that the encoding has no switching states.
The $KCODE
global variable is being replaced by a comment
convention at the start of a file. The start of the file here means the first
line, or the second line if the first line contains a path to the ruby binary
with options (the so-called shebang, e.g.
#!/usr/local/bin/ruby
). As an example, the line
# -*- coding: utf-8 -*-
indicates that the file is encoded in UTF-8. This convention only affects the actual file, not other files. String constants and regular expressions in the file are given this encoding, and the encoding is checked for correctness for identifiers.
In Section 3.5, we presented our charesc RubyGem [charesc] that uses metaprogramming to make Unicode-based
character escapes with reasonably short syntax available in Ruby 1.8. For
Ruby 1.9, the implementation of character escapes in the Ruby core syntax is
planned. The syntax allows two styles, \u
followed by four
hexadecimal digits, and \u
followed by one to six hexadecimal
digits enclosed in brackets. Here are some of the examples from Section 3.5,
rewritten for the new syntax:
print "\u677E\u672C\u884C\u5F18" # => 松本行弘 print "Martin J. D\u00FCrst" # => Martin J. Dürst print "Martin J. D\u{FC}rst" # => Martin J. Dürst
It is as of yet undecided what these escapes will produce in encodings
other than UTF-8. However, it seems clear that no syntax for other encodings
(e.g. something like \s
for Shift_JIS or \e
for
EUC-JP) is needed, for many reasons. There is no equivalent of an Unicode
scalar value for these encodings. Byte-wise notation is already available,
i.e. \x90\xc2
is too close to a potential \s90c2
to
warrant implementation. Also, such a scheme would not scale to other
encodings.
Ruby 1.9 uses a new (for Ruby) regular expression engine, called Oniguruma [Oniguruma]. Oniguruma is a very powerful regular expression engine, including the capability to use recursion in regular expressions. Oniguruma is relevant to Ruby internationalization in several different ways. First, Oniguruma adapted the CSI approach proposed in [M17N], and this implementation is now used by Ruby, too. The code moved from the Ruby core to Oniguruma, and with the adoption of Oniguruma by Ruby, this code is now in some sense re-imported.
Second, Oniguruma is important because similar to the old regular
expression implementation in Ruby 1.8, it is well internationalized when
compared to the rest of Ruby as currently available. In particular, Oniguruma
supports a wide range of character properties including Unicode general
categories and scripts. As an example, the regular expression
/\p{Sm}\p{Hiragana}/
would match a mathematical symbol followed
by a Hiragana character. For details, please see [OniSyntax, Section 3].
The problem with the current implementation of character properties is
that except for Unicode-based encodings such as UTF-8, very few properties
are supported. As an example, EUC-JP and Shift_JIS only support
\p{Hiragana}
and \p{Katakana}
even tough these
encodings also contain Han ideographs as well as Greek and Cyrillic
characters. Also, properties not supported in an encoding lead to parse
errors, rather than just not matching anything.
While a start has been made for the new Ruby internationalization framework in Ruby 1.9, a very large number of questions remains to be answered [M17Nqestions]. Many of these questions come from the fact that a CSI architecture allowing multiple internal encodings at the same time poses questions regarding encoding at every corner.
This section looks at what the author thinks may become a crucial piece in making the CSI model of Ruby 1.9 work easily and smoothly for all kinds of programming needs. This in particular includes the UCS model important for Internet and Web-related computing and familiar to many programmers coming from other languages. The material presented in this section is currently just a proposal. Whether and how it will be implemented will still have to be explored and discussed.
It is not too difficult to implement the UCS model on top of a CSI architecture. The two main pieces needed are a universal encoding such as UTF-8 and transcoding functionality. But even then, a basic difference remains: Transcoding is performed automatically and aggressively in the UCS model, whereas in a CSI model, it is mostly avoided or delayed. The question is how this difference can be expressed and implemented. Here, we propose to do this with transcoding policies.
In short, a transcoding policy specifies under what circumstances what kinds of transcoding should or should not take place. In general, such a policy will only apply for cases where the program does not specify the exact details. Some basic policies, e.g. a basic CSI policy and a basic UCS policy, may be made available as command options. Finer control should be made possible by exposing the policy as a Ruby object.
In this subsection, we provide a short overview of where transcoding policies may be useful. Data enters and exits an application at various locations. Each such location is a candidate for transcoding, and therefore for the use of a policy.
In many applications, the bulk of data is read from and written to files. In modern operating systems, console input and output are abstracted as standard input and output files, but these can also be replaced by real files by using redirection. So we obtain real console input and output, redirected console input and output, real and redirected error output, and general file input and output as possible distinct locations where a policy may be applied. In many cases, the same policy will apply to several of these locations, so some kind of defaulting mechanism may be needed.
A next location to consider is the encoding of directory paths and filenames in the file system. In this case, the encoding to use may be derived from some system information or from locale settings in an environment variable.
Different C-level extensions and libraries may use different character encodings, and a policy or several policies may be used to bridge the gap between Ruby internals and these extensions. Different implementations of Ruby use different character encodings natively. For example, Java uses UTF-16 exclusively for strings, so for JRuby, a special policy may be needed.
When combining strings with different encodings, a pure CSI application will throw an exception, whereas a UCS application will want to convert the strings to an encoding that encompasses both encodings. This again can be selected by a policy. A special, more relaxed policy should be in force when reporting errors, to avoid secondary errors and endless loops.
In the previous subsection, we have looked at the various locations where a character encoding policy may be applied. In this subsection, we discuss the various options that a transcoding policy may provide.
For some cases, a fixed source or target encoding may be provided. As an example, the encoding of filenames on a modern Microsoft Windows system is fixed to UTF-16. For a CSI system, internal transcodings would simply be prohibited. An alternative would be to allow conversion to a wider encoding (e.g. from US-ASCII to UTF-8), or to allow conversion if no non-convertible codepoints are found in the actual data.
Both for explicit conversion as well as for automatic conversions, it may be helpful to be able to indicate what fallbacks should be used in case a character or character sequence cannot be transcoded. Alternatives may include to simply drop the data, to replace it with a formal replacement character or with an appropriate visual equivalent (e.g. '?'), or to use some kind of escaping convention, e.g. XML numeric character references or Ruby character escaping syntax.
Transcoding policies may also be used to deal with illegal code sequences (frequently detected when transcoding) and with normalization. For normalization, one approach is to create 'guaranteed normalized' versions of encodings, and to use the policy logic to invoke normalization checks when combining two strings in such an encoding. Another approach is to use the policy logic to normalize at certain well-defined interfaces.
Ruby is a scripting language that is gaining popularity rapidly due to its high productivity. We described Ruby internationalization for the current version (Ruby 1.8) including some work of the author. We described and analyzed the CSI model which forms the base for the internationalization of the next version (1.9) of Ruby. We also proposed transcoding policies as a general conceptual framework for unifying the CSI model and the UCS model. A lot of work remains to be done to fully implement internationalization as planned for Ruby 1.9, including character conversions and Unicode functionality.
My warmest thanks go to Yukihiro Matsumoto (Matz) and many others for Ruby and some great discussions, to Takuya Shimada for his work on sconv, and to Kazunari Ito and many others for providing a great research environment at Aoyama Gakuin University.