Implementing Better Source Editing
for Bidirectional XHTML and XML

Martin J. Dürst (duerst@it.aoyama.ac.jp)
with Kunihiro Sato

32th Internationalization and Unicode Conference,
September 2008, San Jose, CA, USA

This talk and related material is available at:
http://www.sw.it.aoyama.ac.jp/2008/pub/IUC32-bidi

What's Your Background?

HTML, XML, and XHTML
Unicode bidi

The Problem

HTML and XML source is easily readable:

<p>Welcome <img scr='fuji.jpg'
   alt="Mt. Fuji"> to Japan!</p>

HTML and XML source with Arabic or Hebrew gets garbled:

<p>ابةتث <img scr='foo.jpg'
   alt=" جحخد"> ذرزسش!</p>

<p>אבגד  <img scr='foo.jpg'
   alt="הוזח"> טיךכ!</p>

The Reason

Arabic and Hebrew is written right-to-left
Unicode Bidirectional Algorithm (UAX #9):
- defines how bidi text is rendered
- designed for free text, not markup
- punctuation is 'weak'
Markup source:
- uses 'weak' punctuation as syntactic delimiters
- mixes right-to-left and left-to-right in unusual ways

Potential 'Solutions'

Sprinkle in some bidi control characters
Fix HTML and/or XML
Fix the bidi algorithm
Use lots of line breaks
Feel our way around in the dark

The Real Soultion

The bidi algorithm allows 'higher level protocols' to affect rendering
Modern 'plain-text' editors do syntax highlighting
Integrate bidi source rendering with syntax highlighting
Make it as easy and natural as possible

Implementation Options

Prototype simulation (see IUC28 talk, 2005)
Emacs-bidi (using overlays)
Editor based on Scintilla editing component
Javascript in Web browsers

Why Javascript?

Available virtually everywhere (with variations)
Browsers already implement Unicode bidi algorithm
Editing already implemented (document.designMode)
Experience from our prototype can be reused
Existing Javascript-based source editors

Demo: Current Application Structure

http://www.sw.it.aoyama.ac.jp/2008/pub/IUC32-bidi/bidi-source.html

Preview: What you will get in the end
Controls for examples or source import/export
Source Editor: Where you work
Bad Source: What we wanted to avoid
Backend Source: How it was done

Limitations

Re-formatting destroys selection/cursor location
→ timing set to react slowly
No direct load/save
→ raw source field and ↘↖ arrows

Tested with

Opera (9.52)
Mozilla (2.0.0.16)
IE 6 (6.0.2900.5512)
Safari Win (3.1.2)
Google Chrome (0.2.149.29)

(all on a Japanese Windows XP with SP 3)

How it Works

HTML/XML source treated as simple data

Marked up again for coloring and bidi corrections

Whenever source changes:

Strip out existing markup
Analyze source again, in three main parts:
- Main constructs (tags, comments,...)
- Start tag (complex substructure)
- Names and content (including dealing with linebreaks)

Strengthening Syntax Characters

Original: <!-- (actually, <!--)
Strengthened using LRM: &lrm;<!--&lrm;

Show Marks

Original: invisible, (maybe) with some effect
Made raw marks visible as LRM
Added actual marks on outside to activate
- for raw marks
- for Numeric Character References (‎,...)
- for named entities (&ltr;,...)
Protected marks used for correction with <span class="help">‎</span> to distinguish from raw marks

Warn about Controls

LRE, RLE, LRO, RLO, PDF should not be used in HTML
Showing them in red, with tooltip warning
Deactivating them

HTML Embeddings and Overrrides

Made element content work as embedding/override
Made element outside work as embedding/override

Next Steps

Keep selection when reformatting
Smoothen cursor operation (automatically jump over helper marks)
Separate out line numbers
Offer alternatives for load and save
Support omitted HTML end tags, dir value inheritance, IRI attribute values
Support for XML (no <bdo>,...)

Exploring

Global direction: overall direction for markup and content
'upper-case as RTL'
more support for LRE/RLE/.../PDF, e.g. in attributes
More error hints

Conclusions

Please test prototype and provide feedback!

Implementing Better Source Editing for Bidirectional XHTML and XML