Ruby
M17N
Ruby Kaigi'08, Tsukuba,
Japan, 2008/6/21
成瀬 ゆい / Martin J. Dürst
http://www.sw.it.aoyama.ac.jp/2008/pub/RubyKaigiM17N.html
© 2008 成瀬
ゆい / Martin J. Dürst
成瀬さんの部分
Slides,
Notes
Character Encoding Conversion
- Usage
- Internals
- Why our own
- Files to look at
- Layer structure
- UTF-8 conversion hub
- Data table details
[see Collophon for how to view best]
String#encode
Usage
Hint: See test/ruby/test_transcode.rb
- Changing the encoding of a string:
s2 = string.encode(to_enc)
- Explicit from-encoding (avoiding .force_encoding):
s2 = s1.encode(to_enc,
from_enc)
- Options:
s.encode(..., invalid: :ignore)
- (more options to come)
- Destructive conversion:
s1.encode!(...)
- (never returns
nil
)
Why New Library
- Copyright issues
- Platform-independence
- One-stop shopping
- Adapt to Ruby's needs (memory, performance)
- Use Ruby mechanisms (e.g. dynamic loading)
Naming and Files
Naming caution:
String#e
ncode
, rest:
trans
Layer Structure
str_encode (_bang ): String
method implementations |
str_transcode : Parameter analysis/conversion, memory
handling,... |
transcode_dispatch : Conversion selection |
transcode_loop : Byte-by-byte loop |
(top-down, intranscode.c
)
UTF-8 Conversion Hub
- Most conversions work well via Unicode
- Unicode for Ruby means UTF-8
- Direct conversion (e.g. EUC-JP⇒Shift_JIS)
not excluded
Transcoding Core
- Byte-by-byte
- Data-driven
- Two tables per byte
- One big central function (
transcode_loop
):
- No chance of missing options,...
- Some helper functions
Examples: Pre/postconversions, code mapping, buffer extension
Conversion Data Example
One table per byte
(Shift_JIS ⇒ UTF-8, first byte)
Byte value |
Action |
0x00 |
|
... |
|
0x61 ('a') |
Copy input (0x61, 'a') |
... |
|
0xC3 ('テ') |
output "\xEF\xBE\x83" |
... |
|
0xE0 |
Go to table for 0xE0... |
... |
|
Problems with one Table per Byte
- 1K memory per leading byte
- Similar values are frequent
(undefined, invalid,...)
- Sparse tables are frequent
(e.g. UTF-8 ⇒ something)
Two Tables per Byte
- Indices (
offsets
):
- One byte per byte
- Table size 256 bytes
(or 64 bytes for UTF-8 trailing bytes)
- index 0-255 into
infos
- Instructions (
infos
):
- Pointers: Proceed to next table for next byte
- Actual output: 1 byte, 2 bytes, 3 bytes (4 bytes)
- Copy input
- Invalid, undefined
- Call some function (e.g. UTF-16...)
Two Tables Example
(Shift_JIS ⇒ UTF-8, first byte)
offsets
Byte value |
offset |
0x00 |
0 |
... |
0 |
0x61 ('a') |
0 |
... |
|
0xC3 ('テ') |
23 |
... |
|
0xE0 |
42 |
... |
|
infos
offset |
info |
0 |
0 |
... |
|
23 |
output
"\xEF\xBE\x83" |
... |
|
42 |
goto table for 0xE0... |
... |
|
Advantages of Two Bytes per Table
- Sharing
infos
in same byte
- Increase sharing with special
info
values
(e.g. copy-as-is)
- Sharing tables when they are equal
- Shared code structure
(e.g. offsets
, e.g. UTF-8⇒EUC-JP,
UTF-8⇒Shift_JIS)
- Shared conversion results
(e.g. Shift_JIS⇒UTF-8, CP932⇒UTF-8)
- Working on table sharing when they are close, but not equal
Currently Supported Encodings
- (UTF-8)
- iso-8859-1...15
- UTF-(16|32)(BE|LE)
- EUC-JP, Shift_JIS, iso-2022-jp, ...
- EUC-KR, CP949
Tell us what you need!
Next Steps
- Reimplement table switching (iso-2022-jp,...)
- More encodings
- Cleanup and publish table generation script
- More testing
- More options
- More data optimizations
- Backtracking (e.g. for windows-1258)
- Expose conversions as Ruby objects
- Addressing 64-bit architectures
(order mostly insignificant)
成瀬さんのまとめ
Slides,
Notes,
Questions and Answers
Collophon
or how these slides were produced,
and how they are best viewed