Exploring Better Source Editing for
Bidirectional XHTML and XML
Martin
J. Dürst (duerst@it.aoyama.ac.jp)
with Shiro Horie and Yusaku Wada
28th Internationalization and
Unicode Conference,
September 2005, Orlando, Florida, USA
This talk and related material is available
at:
http://www.sw.it.aoyama.ac.jp/2005/pub/IUC28-bidi
© 2005 Martin
J. Dürst Aoyama Gakuin
University
The Problem
HTML and XML source is easily readable:
<p>Hello, <img scr='foo.jpg'
alt="abcd efgh"> IJKLM!</p>
HTML and XML source with Arabic or Hebrew gets garbled:
<p>Hello, <img scr='foo.jpg'
alt="ابةتث جحخد"> ذرزسش!</p>
<p>Hello, <img scr='foo.jpg'
alt="אבגד הוזח"> טיךכ!</p>
The Reason
- Arabic and Hebrew is written right-to-left
- Unicode Bidirectional Algorithm (UAX #9, bidi algorithm):
- defines how bidi text is rendered
- designed for free text, not markup
- uses special character for tweaking when necessary
- punctuation is 'weak'
- Markup source:
- uses 'weak' punctuation as syntactic delimiters
- gets confused by bidi control characters
The Solution
- The bidi algorithm allows 'higher level protocols' to affect
rendering
- Modern 'plain-text' editors do syntax highlighting
- Make better bidi source rendering part of syntax highlighting
- Make it as easy and natural as possible
Methodology: Prototyping
- Prototyping using HTML as output
- Use HTML bidi markup to control bidi behaviour
- User browser for bidi rendering
- Start with one line, expand to multi-line
- Use a lot of coloring
- World-wide test and feedback with Web form
The Importance of Source
Why is surce important?
- Most HTML and XML produced by tools, but
- Copy-paste allowed quick grow of HTML and the Web
- The 'real thing'
- Bootstrapping and tooling
- Education
Background
(see paper or ask)
What has to happen:
- Syntax characters have to be strengthened
- LRM and RLM (characters): show as icons, make behave as themselves
- LRE, RLE, LRO, RLO, PDF: show as red
icons (strongly discouraged), no effect (yet)
‎
, ‏
(entities, (X)HTML
only): protecting weak characters, making behave as themselves
<bdo>
with dir
attribute in (X)HTML:
Make content display with forced direction
- Other elements with
dir
attribute in (X)HTML: Make content
display as embedding
- URI values in (X)HTML: impose LTR context
Implementation Details: Parsing
Three different kinds of structures:
- Low-level syntactic units:
element names, tag start characters,...: State machine
- Higher-level syntactic units:
elements,...: Stack
- Output units:
lines
The main problem: mismatch between XML syntactic units and lines. Even an
attribute value can cover more than one line.
Prototype Demo
http://www.sw.it.aoyama.ac.jp/cgi-bin/bidi-source-test
Prototype Configuration Options
- Privacy: allow to test on text/data you don't want anybody else to
see
- New line: make input easier
- Markup language: tag nesting, bidi-specific markup in (X)HTML
- Global direction: overall direction for markup and content
- Color: we have some source coloring, but it's not well worked out and
probably too detailled
Conclusions
Please test prototype and provide
feedback!
Future Work
- Improve and polish prototype:
- 'upper-case as RTL'
- more sophisticated nesting
- LRE/RLE/.../PDF, e.g. in attributes
- HTML4 (missing end tags,...)
- error resilency
- Support file upload
- Implement in actual editor(s): emacs? scintilla/notepad2? others?