http://www.sw.it.aoyama.ac.jp/2015/pub/RubyRails/
© 2012-5 Martin J. Dürst, Aoyama Gakuin University
Ruby is a purely object-oriented scripting language designed to make programming fun and efficient. Ruby on Rails is the groundbreaking web application framework built using the programming language Ruby. This tutorial will help you understand the basics for internationalization and localization in Ruby and Ruby on Rails.
The tutorial will start with a discussion of how character encoding works in Ruby and how to make the best use of it both in throw-away scripts and in long-running applications. We will show how in Ruby, all character encodings are equal, but UTF-8 is more equal than others, and should be used with preference.
Ruby on Rails also preferably uses UTF-8, because this is the best choice for web applications. Ruby on Rails comes with its own internationalization and localization framework. As is typical for Ruby on Rails, this framework is very simple but easily extensible. We will show discuss both the basics framework as will as several helpful extensions, e.g. for handling timeliness or for translating user interface texts.
The tutorial assumes that participants have some experience with programming and Web applications. Experience with Ruby and/or Ruby on Rails is a plus, but is not a precondition for attending.
[Text appearing in gray are comments not showing up in presentation mode. The best way to view the slides as they were presented is with Opera, pressing F11.]
green, monospace
:puts "Hello Ruby!"
orange
:puts "some string"
"Юに코δ"
UTF-8"ユニコード"
SJIS1 + 1
⇒
2
Юに코δ
Ю
: Cyrillic uppercase YUに
: Hiragana NI코
: Hangul KOδ
: Greek delta5.times { puts "Hello Ruby!" }
{
... }
or
do
... end
)This tutorial is about MRI/C-Ruby, the reference implementation
A note about Ruby versions and Unicode versions: The Ruby core team is very conservative in introducing new Unicode versions as bug fixes. New versions therefore only get added on a minor version upgrade (e.g. 2.0, 2.1 (Unicode 6.0), 2.2 (Unicode 7.0), ...).
chcp 65001
irb
(Interactive Ruby)"Юに코δ".length
⇒ 4
We can get a byte count with:
"Юに코δ".bytesize
⇒ 10
String
:"Юに코δ".class
⇒ String
"Юに코δ"[0]
⇒ "Ю"
;"Юに코δ"[0].length
⇒ 1
Using the same class for both strings and characters avoids the distinction between characters and strings of length 1. This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints. Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.
Internationalized:
[]
, reverse
, ... include?
, index
, ,...match
, sub
, gsub
,
scan
, split
, tr
, ...Not yet internationalized:
capitalize
, downcase
, ...to_i
, Int#to_s
, ...Lines: "Unicode\nЮに코δ".lines.to_a
⇒ ["Unicode\n", "Юに코δ"]
Characters: "Юに코δ".chars.to_a
⇒ ["Ю", "に", "코", "δ"]
Codepoints: "Юに코δ".codepoints.map {|cp|
⇒
cp+1}.pack 'U*'"Яぬ콕ε"
(rotation by 1)
Bytes: "Юに코δ".bytes.inject {|m,b| m+b}
⇒ 1865
(sum of byte values)
each_line
, each_char
,
each_codepoint
, and each_byte
are older names for
lines
, chars
, codepoints
, and
bytes
. The methods lines
, chars
,
codepoints
, and bytes
return
Enumerator
s. Here we just use to_a
to produce arrays.
Enumerators can be used directly for iteration with each
, or with
separate iterators for mapping with map
, selection with
select
/reject
.
The need for several enumerators on a single object was one of the main motivators for introducing enumerators into Ruby. This is an interesting example of how internationalization concerns can affect more 'fundamental' language features. The orthogonality resulting from separating what to enumerate over and how to iterate produces very expressive code.
(all for "Юに코δ"
)
\u
):
"\u042e\u306b\ucf54\u03b4"
"\u{42e}\u{306b}\u{cf54}\u{3b4}"
"\u{42e 306b cf54 3b4}"
"\xD0\xAE\xE3\x81\xAB\xEC\xBD\x94\xCE\xB4"
π = 3.14
...
Possible, but not recommended (possible exceptions: basic education, special terminology)
Problems:
C_class_name
/.../options
UTS #18 recommends: \s
, \p{space}
,
\p{Whitespace}
→ Unicode whitespace
In Ruby: \s
→ ASCII whitespace; \p{space}
,
\p{Whitespace}
→ Unicode whitespace
Examples:
"abc def" =~ /\s/
⇒ 3
(i.e.
found)
"abc\u00A0def" =~ /\s/
⇒
nil
"abc\u00A0def" =~ /\p{space}/
⇒ 3
(i.e. found)
"abc\u00A0def" =~
/\p{Whitespace}/
⇒ 3
(i.e.
found)
Keeping \s to mean ASCII whitespace only was done for backwards compatibility. This can be explained as follows: If somebody wrote a script doing some processing where they wanted to match ASCII whitespace characters, they used \s. If Ruby would change \s to suddenly match more characters than before, the meaning of that program would change. Maybe it would change just in the right way. But there's also a good chance that it will change in ways not intended by the programmer. (See also https://bugs.ruby-lang.org/issues/7154.)
"Юに코δ"[2]
) and"Юに코δ".chars
)Hash
, Regexp
, and
gsub
Example: Convert German to ASCII-only
Hash
:german_hash = {'Ä'=>'Ae', 'ä'=>'ae',
'Ö'=>'Oe', 'ö'=>'oe', 'Ü'=>'Ue',
'ü'=>'ue', 'ß'=>'ss' }
german_regexp = /[ÄäÖöÜüß]/
my_german.gsub german_regexp, german_hash
See also: Implementing Normalization in Pure Ruby - the Fast and Easy Way, Martin J. Dürst, IUC 37, Santa Clara, Oct 2013
"external"
: External encoding"external:internal"
: External and
internal encodingopen('my_file.txt', 'r')
open('my_file.txt', 'r:iso-8859-1')
open 'my_file.txt', 'r:KOI-8:SJIS'
String
classString
is:
"Юに코δ".encoding.to_s
⇒ "UTF-8"
Java, JavaScript: UTF-16; Python: 8-bit/UTF-16/UTF-32; Perl: 8-bit/UTF-8; C/C++: mb/wc
(Matz pushing for moving to UTF-8)
String
: "Юに코δ"
Symbol
: :Юに코δ
Regexp
: /Юに코δ/
Юに코δ
(Very simplified, more details later)
encode
(actual transcoding):"abcδ".encode 'SJIS'
⇒ "abcδ"
SJISforce_encoding
(change label, keep bytes):"abc\x83\xC2".force_encoding 'SJIS'
⇒
"abcδ"
SJISUnfortunately, force_encoding
is destructive, no
non-destructive equivalent.
encode
transcode
'smells' of Video
converters,...)transcode.c
,RUBYSOURCE/enc/trans
Encoding::Converter
(very rarely used)"Юに코δ"
UTF-8 +
"Юに코δ"
UTF-8
⇒ "Юに코δЮに코̃"
UTF-8
"Юに코δ"
UTF-8 +
"Юに코δ"
UTF-16
⇒ Encoding::CompatibilityError
Other examples:
"Dürst"
ISO-8859-1 ==
"Dürst"
ISO-8859-2 ⇒
false
π
UTF-8 ≢ π
SJIS
Trying to combine strings with different encodings, as here with concatenation (+), leads to an exception. There are some exceptions (sic!) to this rule that we will look at later. The reasoning for the error here is that transcoding should not happen without the programmer being aware of it.
Trying to compare two character-by-character identical strings
in different encodings will produce false
, even if these strings
are, as in the above example, also byte-for-byte identical. Again, the reason
for the result is that encoding mismatches should be detected early. In
addition, a simple byte-for-byte comparison could produce false positives.
In Ruby, all encodings are equal, but some encodings are more equal than others.
[Adapted from George Orwell's Animal Farm.]
The important encodings are:
ASCII-only data very frequent in programs
Special treatment, to reduce errors and improve performance
This also makes it easy for programmers who don't think about encoding when working only with US-ASCII. Internally, Ruby caches whether a string is ASCII-only or not, to increase performance.
"Data"
SJIS.ascii_only?
⇒ true
"データ"
SJIS.ascii_only?
⇒ false
ASCII-only data does not cause Encoding::CompatibilityError
s,
even if encodings clash
"Юに코δ"
UTF-8 +
"Data"
SJIS
⇒ "Юに코δData"
UTF-8
"Data"
SJIS +
"Юに코δ"
UTF-8
⇒ "DataЮに코δ"
UTF-8
Similar behavior for other string operations
Test with Encoding.compatible? st1,
st2
Encoding.compatible?
⇒ UTF-8"Юに코δ"
UTF-8, "Data"
SJIS
Alias: BINARY
Used for binary data
Use when high-bit bytes' semantics are unknown (but 7-bit bytes are ASCII or can be treated as such)
Special shortcut (since Ruby 2.0):
string.b
≡
string.dup.force_encoding('ASCII-8BIT')
\u
escaping automatically makes strings UTF-8 -U
Array#pack
String to encoding:
e = Encoding.find "WinDOwS-1252"
Encoding to string: e.to_s
⇒ "Windows-1252"
Methods most often accept both String
s and
Encoding
s to indicate encodings
Constants for Encodings: Encoding::Windows_1252
,
Encoding::WINDOWS_1252
(We see that encoding names are case-insensitive, and are
canonicalized. The internal canonicalization is specific to each encoding, and
can start with a lower-case letter. Encodings are also exposed as constants in
the Encoding
module. There is always a constant with all letters
in uppercase. There is also a constant that is the same or close to the
internal canonical name, but starts with an uppercase letter. For the
constants, hyphens are converted to underscores.)
List all encodings:
Encoding.list.join("\n")
Count the encodings supported:
Encoding.list.length
Ruby currently supports around 100 encodings
Aliases are only separate names (not separate encodings)
To list all encoding names:
Encoding.name_list.join "\n"
To list all (active) aliases:
Encoding.aliases.each { |a, e|
puts "#{a} => #{e}" }
Most, but not all encodings are ASCII-compatible:
Encoding::UTF_8.ascii_compatible?
⇒ true
Encoding::UTF_16.ascii_compatible?
⇒ false
ASCII-compatible: ASCII characters≡ ASCII bytes
Encoding::Shift_JIS.ascii_compatible?
⇒ true
Encoding::ISO_2022_JP.ascii_compatible?
⇒ false
List of non-ASCII-compatible encodings:
Encoding.list.reject &:ascii_compatible?
UTF-16BE, UTF-16LE, UTF-16
UTF-32BE, UTF-32LE, UTF-32
UTF-7
ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-KDDI
CP50220, CP50221
\x5C
→ ¥
List of dummy encodings:
Encoding.list.select &:dummy?
UTF-16[LE|BE], UTF-32[LE|BE], UTF-7
ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-KDDI
CP50220, CP50221
Dummy encodings are treated as opaque byte sequences
Dummy encodings are encodings which are treated as opaque byte sequences. But they are labeled, and may be transcoded to some other encodings.
Shift_JIS
vs. Windows31J
CP932
aka SJIS
)UTF8-DoCoMo
, UTF8-KDDI
,UTF-SoftBank
stateless-ISO-2022-JP
# ...coding: ISO-8859-1 ...
...
#!/usr/local/ruby
Encoding.locale_charmap
Encoding.default_external=
(use
sparingly!)Encoding.default_external
nil
(i.e. unknown)Encoding.default_internal=
(use
sparingly!)Encoding.default_internal
IO
: Class for input and output (parent of
File
,...)external_encoding
/internal_encoding
set_encoding
: changes external/internal encoding-E
, --encoding
: Set default
external
or
external:internal
encoding-U
: Set default internal encoding to UTF-8"abc\xFE"
UTF-8.valid_encoding?
⇒ false
"a\xC0\x80"
UTF-8.valid_encoding?
⇒ false
"abc\xC2\x80"
UTF-8.valid_encoding?
⇒ true
"abc\u03A2"
UTF-8.valid_encoding?
⇒
true
(there is no uppercase final Sigma
(yet!?))
Checks for code structure, not unassigned codepoints
"ユニコード"
SJIS.encode('UTF-8)
"ユニコード"
UTF-8encode!
string1 = "ユニコード"
SJIS
string1.encode! 'UTF-8'
string1
⇒ "ユニコード"
UTF-8
Example: Force UTF-8 "double-encoding" (not recommended)
"Юに코δ"
UTF-8.force_encoding('iso-8859-1')
.encode('UTF-8')
⇒ "Юã\u0081«ì½\u0094δ"
Equivalent:
"Юに코δ"
UTF-8.encode('UTF-8',
'iso-8859-1')
Note order of arguments: to ← from
(I always disliked the order of arguments in iconv, but it was unavoidable here, because usually, just the 'to' encoding is needed, so that has to come first.)
"abc\xFE"
UTF-8.encode
'SJIS'
⇒ Encoding::InvalidByteSequenceError
"Юに코δ"
UTF-8.encode
'SJIS'
⇒ Encoding::UndefinedConversionError
undef:
What to do with undefined characters
(:replace
/:ignore
(:ignore can
lead to security issues!))invalid:
What to do with invalid bytes
(:replace
/:ignore
(:ignore can
lead to security issues!))replace:
String used as replacementfallback:
Object (e.g. Hash
) used to look up
replacementAdditional options for newline conversion and XML escaping
RUBYSOURCE/enc
RUBYSOURCE/enc/trans
encdb.h
, transdb.h
)An encoding model describes which encoding(s) can be used in what part of an application (e.g. externally, internally). It defines the conditions and restrictions with respect to string processing that the application (programmer) has to maintain. When creating a Ruby application, it is important to choose the appropriate encoding model.
For applications that use more character semantics outside the ASCII range, or that keep data for a long time, the ideal solution is to use only a single encoding. This should be UTF-8 because that covers all of Unicode and works best with Ruby.
If data (e.g. files) in other encodings also have to be handled, it will in most cases be best to adopt the model of most other programming languages: Use Unicode (i.e. UTF-8 for Ruby) inside, and convert on input/output.
Ruby would also allow the creation of an application using many different encodings internally at the same time. However, this is not what the encoding model of Ruby was created for, and it should be avoided if at all possible.
Important: Don't touch system encodings
(except maybe for a framework)
Ruby leaves open some important i18n support:
Various librairies are available for such tasks, mostly in the form of Ruby Gems
require
statementsrequire "unicode_utils/upcase"
UnicodeUtils.upcase("weiß")
"WEISS"
UnicodeUtils.upcase("i", :tr)
⇒ "İ"
1337.32.localize(:es).to_s
"1.337,32"
ActiveSupport::Multibyte
Chars
class as wrapper for String
translate
aka t
localise
aka l
In templates, replace
<h1>Hello Rails!</h1>
with
<h1><%= t 'welcome.rails' %></h1>
welcome.rails
is a structured key used for looking up the
translated string
Tedious but simple!
Where to get your text replacements from:
config/locales
):Hash
esSend questions and comments to Martin Dürst, duerst@it.aoyama.ac.jp
The latest version of this tutorial is available at: