Ruby M17N

Ruby Kaigi'08, Tsukuba, Japan, 2008/6/21

成瀬ゆい / Martin J. Dürst

http://www.sw.it.aoyama.ac.jp/2008/pub/RubyKaigiM17N.html

成瀬さんの部分

Slides, Notes

Character Encoding Conversion

Usage
Internals
- Why our own
- Files to look at
- Layer structure
- UTF-8 conversion hub
- Data table details

[see Collophon for how to view best]

`String#encode` Usage

Hint: See test/ruby/test_transcode.rb

Changing the encoding of a string:: s2 = string.encode(to_enc)
Explicit from-encoding (avoiding .force_encoding):: s2 = s1.encode(to_enc, from_enc)
Options:: s.encode(..., invalid: :ignore); (more options to come)
Destructive conversion:: s1.encode!(...); (never returns nil)

Why New Library

Copyright issues
Platform-independence
One-stop shopping
Adapt to Ruby's needs (memory, performance)
Use Ruby mechanisms (e.g. dynamic loading)

Naming and Files

Naming caution: String#encode, rest: trans

Core code: transcode.c
Interface definition: transcode_data.h
Conversion (data) files: enc/trans/single_byte.c, enc/trans/japanese.c,...
"Bootstrap" database: enc/trans/transdb.c, transdb.h (autogenerated, thanks to Nobu Nakada)
Tests: test/ruby/test_transcode.rb

Layer Structure

str_encode(_bang): String method implementations

str_transcode: Parameter analysis/conversion, memory handling,...

transcode_dispatch: Conversion selection

transcode_loop: Byte-by-byte loop

(top-down, intranscode.c)

UTF-8 Conversion Hub

Most conversions work well via Unicode
Unicode for Ruby means UTF-8
Direct conversion (e.g. EUC-JP⇒Shift_JIS)
not excluded

Transcoding Core

Byte-by-byte
Data-driven
Two tables per byte
One big central function (transcode_loop):
- No chance of missing options,...
Some helper functions
Examples: Pre/postconversions, code mapping, buffer extension

Conversion Data Example

One table per byte
(Shift_JIS ⇒ UTF-8, first byte)
Byte value	Action
0x00
...
0x61 ('a')	Copy input (0x61, 'a')
...
0xC3 ('ﾃ')	output "\xEF\xBE\x83"
...
0xE0	Go to table for 0xE0...
...

Problems with one Table per Byte

1K memory per leading byte
Similar values are frequent
(undefined, invalid,...)
Sparse tables are frequent
(e.g. UTF-8 ⇒ something)

Two Tables per Byte

Indices (offsets):
- One byte per byte
- Table size 256 bytes
  (or 64 bytes for UTF-8 trailing bytes)
- index 0-255 into infos
Instructions (infos):
- Pointers: Proceed to next table for next byte
- Actual output: 1 byte, 2 bytes, 3 bytes (4 bytes)
- Copy input
- Invalid, undefined
- Call some function (e.g. UTF-16...)

Two Tables Example

(Shift_JIS ⇒ UTF-8, first byte)

`offsets`
Byte value	`offset`
0x00	0
...	0
0x61 ('a')	0
...
0xC3 ('ﾃ')	23
...
0xE0	42
...

`infos`
`offset`	`info`
0	0
...
23	output "\xEF\xBE\x83"
...
42	goto table for 0xE0...
...

Advantages of Two Bytes per Table

Sharing infos in same byte
Increase sharing with special info values
(e.g. copy-as-is)
Sharing tables when they are equal
- Shared code structure
  (e.g. offsets, e.g. UTF-8⇒EUC-JP, UTF-8⇒Shift_JIS)
- Shared conversion results
  (e.g. Shift_JIS⇒UTF-8, CP932⇒UTF-8)
Working on table sharing when they are close, but not equal

Currently Supported Encodings

(UTF-8)
iso-8859-1...15
UTF-(16|32)(BE|LE)
EUC-JP, Shift_JIS, iso-2022-jp, ...
EUC-KR, CP949

Tell us what you need!

Next Steps

Reimplement table switching (iso-2022-jp,...)
More encodings
Cleanup and publish table generation script
More testing
More options
More data optimizations
Backtracking (e.g. for windows-1258)
Expose conversions as Ruby objects
Addressing 64-bit architectures

(order mostly insignificant)

成瀬さんのまとめ

Slides, Notes, Questions and Answers

Collophon

or how these slides were produced,
and how they are best viewed

XHTML and CSS
Edited with Amaya
Ruby Logo
Best viewed with Opera presentation mode (F11)