Implementing Better Source Editing
for Bidirectional XHTML and XML
Martin
J. Dürst (duerst@it.aoyama.ac.jp)
with Kunihiro Sato
32th Internationalization and
Unicode Conference,
September 2008, San Jose, CA, USA
This talk and related material is available
at:
http://www.sw.it.aoyama.ac.jp/2008/pub/IUC32-bidi
© 2005-8 Martin
J. Dürst Aoyama Gakuin
University
What's Your Background?
- HTML, XML, and XHTML
- Unicode bidi
The Problem
HTML and XML source is easily readable:
<p>Welcome <img scr='fuji.jpg'
alt="Mt. Fuji"> to Japan!</p>
HTML and XML source with Arabic or Hebrew gets garbled:
<p>ابةتث <img scr='foo.jpg'
alt=" جحخد"> ذرزسش!</p>
<p>אבגד <img scr='foo.jpg'
alt="הוזח"> טיךכ!</p>
The Reason
- Arabic and Hebrew is written right-to-left
- Unicode Bidirectional Algorithm (UAX #9):
- defines how bidi text is rendered
- designed for free text, not markup
- punctuation is 'weak'
- Markup source:
- uses 'weak' punctuation as syntactic delimiters
- mixes right-to-left and left-to-right in unusual ways
Potential 'Solutions'
- Sprinkle in some bidi control characters
- Fix HTML and/or XML
- Fix the bidi algorithm
- Use lots of line breaks
- Feel our way around in the dark
The Real Soultion
- The bidi algorithm allows 'higher level protocols' to affect
rendering
- Modern 'plain-text' editors do syntax highlighting
- Integrate bidi source rendering with syntax highlighting
- Make it as easy and natural as possible
Implementation Options
Why Javascript?
- Available virtually everywhere (with variations)
- Browsers already implement Unicode bidi algorithm
- Editing already implemented (
document.designMode
)
- Experience from our prototype can be reused
- Existing Javascript-based
source editors
Demo: Current Application Structure
http://www.sw.it.aoyama.ac.jp/2008/pub/IUC32-bidi/bidi-source.html
- Preview: What you will get in the end
- Controls for examples or source import/export
- Source Editor: Where you work
- Bad Source: What we wanted to avoid
- Backend Source: How it was done
Limitations
Tested with
- Opera (9.52)
- Mozilla (2.0.0.16)
- IE 6 (6.0.2900.5512)
- Safari Win (3.1.2)
- Google Chrome (0.2.149.29)
(all on a Japanese Windows XP with SP 3)
How it Works
HTML/XML source treated as simple data
Marked up again for coloring and bidi corrections
Whenever source changes:
- Strip out existing markup
- Analyze source again, in three main parts:
- Main constructs (tags, comments,...)
- Start tag (complex substructure)
- Names and content (including dealing with linebreaks)
Strengthening Syntax Characters
- Original:
<!--
(actually, <!--
)
- Strengthened using LRM:
‎<!--‎
Show Marks
- Original: invisible, (maybe) with some effect
- Made raw marks visible as LRM
- Added actual marks on outside to activate
- for raw marks
- for Numeric Character References
(
‎
,...)
- for named entities (
<r;
,...)
- Protected marks used for correction with
<span
class="help"></span>
to distinguish from raw marks
Warn about Controls
- LRE, RLE, LRO, RLO, PDF should not be used in HTML
- Showing them in red, with tooltip warning
- Deactivating them
HTML Embeddings and Overrrides
- Made element content work as embedding/override
- Made element outside work as embedding/override
Next Steps
- Keep selection when reformatting
- Smoothen cursor operation (automatically jump over helper marks)
- Separate out line numbers
- Offer alternatives for load and save
- Support omitted HTML end tags, dir value inheritance, IRI attribute
values
- Support for XML (no
<bdo>
,...)
Exploring
- Global direction: overall direction for markup and content
- 'upper-case as RTL'
- more support for LRE/RLE/.../PDF, e.g. in attributes
- More error hints
Conclusions
Please test
prototype and provide feedback!