IRIs Beyond the Napkin: A Survey of
Internationalized Resource Identifier
Issues and Implementation
IUC 34, Santa Clara, CA,
U.S.A., 20 October 2010
Martin J. DÜRST and Addison PHILLIPS
duerst@it.aoyama.ac.jp / addison@lab126.com
Aoyama Gakuin
University / Lab126
© 2010 Martin J.
Dürst, Aoyama Gakuin University and
Addison Phillips, Lab126
Overview
- Introduction
- IETF IRI WG
- IRI details and issues
- How you can contribute
- Conclusions
Abstract
If the Latin Alphabet is not your (or your customer's) main script, there
are many good reasons for including non-Latin characters in a Web address
(URL/URI). This presentation will tell you why, when, and how you can and
should do this, and provide the necessary background to make things work for
servers and clients.
Non-ASCII characters have been used in Web addresses for more than a decade.
Such Web addresses have been called Internationalized Resource Identifiers
(IRIs), and since 2005 have been specified in RFC 3987. Early this year, the
IETF chartered a Working Group to update the RFC 3987.
The presentation will first explain the basic rules for working with IRIs,
in particular the conversion to URIs via UTF-8 and percent-encoding. To provide
a deeper understanding, we will then concentrate on the major issues that the
IRI Working Group is working on addressing:
- Moving from defining IRIs as a presentation element, while restricting
protocols to using URIs, to defining IRIs as protocol elements on par with
URIs.
- Balancing between syntactical uniformity for long-term simplicity and
backwards conformance with established browser behavior in particular
for the domain name and fragment identifier parts of an IRI.
- Moving the specification from a before-after descriptive style to a more
procedural style that covers edge cases of implementations existing
in the wild.
- Comparing, normalization, and security issues for IRIs.
- Restrictions and display advice for bidirectional IRIs.
[Text appearing in gray are comments not showing up in presentation
mode. The best way to view the slides as they were presented is with Opera, pressing F11.]
Speakers' Introduction
Addison:
Martin:
- IRI Internet-Draft Co-Editor
- W3C Internationalization Interest Group Chair
Slides avaliable at: http://www.sw.it.aoyama.ac.jp/
2010/pub/IUC34-iri-napkin
What's an IRI
In full: Internationalized Resource Identifier
Version of URI/URL where non-ASCII characters are allowed
URIs are often also called URLs (Uniform/Universal Resource
Locators), although strictly speaking, URLs are a subset of URIs.
Examples: http://räksmörgås.josefsson.org,
http://納豆.w3.mag.keio.ac.jp,
http://بوابة.تونس/
The first internationalized
top-level domain names were finally introduced by ICANN starting mid 2010 after extensive
agonizing, deliberation and testing.
http://ja.wikipedia.org/wiki/青山学院大学,
http://www.sw.it.aoyama.ac.jp/Dürst/,
http://www.w3.org/People/Dürst/
These just work in all modern browsers (which doesn't include IE6), but may not
always be displayed as such.
Why Internationalized?
Would you like to type ωωω.γουγλ.κομ ?
Identifiers in a well-known script are easier to:
- Remember
- Interpret
- Guess
- Create
- Transcribe
- Identify with
- Be searched (NEW!)
How IRIs Work
- IRIs are transferred as characters
- Encoding depends on substrate
E.g. Shift_JIS on disk, iso-2022-jp in email, UTF-8 in a Web page, ink on
napkin, sound waves
- When an URI is needed:
- Convert to UTF-8
- Percent-encode (used for non-ASCII bytes in URIs)
Example: 青山 (U+9752 U+5C71) →
%E9%9D%92%E5%B1%B1
(use the Unicode Code
Converter for help)
The extended character repertoire is essentially the only
difference between URIs and IRIs, and conversion is easy using UTF-8 and
percent-encoding. However, as in many other areas of Unicode and
internationalization, the details can be surprisingly tricky.
IETF IRI WG
IETF: Internet Engineering Task Force
WG: Working Group
BOF November 2009
Chartered early 2010
Was supposed to be done by June 2010 (charter)
Serious work started recently
Documents being Updated
RFC 3987 -
Internationalized Resource Identifiers (IRIs)
Latest Draft: draft-ietf-iri-3987bis-02
RFC 4395 - Guidelines and
Registration Procedures for New URI/IRI Schemes
Latest Draft: draft-ietf-iri-4395bis-irireg-00
There is a chance that we will create additional documents when
we split up some work (list of WG
documents).
Registration Guidelines
Clarified or new:
- URI schemes are IRI schemes are URI schemes
→ only one IANA Scheme
Registry
- Syntax description can be on URI or IRI level
(Syntax can be expressed either in Unicode characters or in ASCII
characters (including %-encoding). This needs some additional careful
work.)
- Register
example:
scheme
Main Issues for IRIs
- Conversion using decomposition into pieces
- Where and when to Punycode
- Query part
- Bidirectionality
- Normalization
- Legacy issues
- Scheme names
Decomposition of an IRI
The term resource identifier is used generically, to
denote URIs and IRIs.
scheme://userinfo@host:port/path?query#fragment
The most important parts:
scheme
: E.g. http, ftp, mailto,...
host
: reg-name (domain-name) or IP address
path
: Similar to directory hierarchy
query
: Additional data for form
submission,...
fragment
: Location or part inside a document
JavaScript also has access to authority
(userinfo/host/port together)
To Punycode or not to Punycode
- Internationalized Domain
Names (IDNs) use Punycode
Example: بوابة.تونس → xn--mgbbbe6m.xn--pgbs0dh
- IRIs and URIs can use
%-encoding
Example: تونس →%D8%AA%D9%88%D9%86%D8%B3
- Choice A: Use Punycode
- Choice B: Use %-encoding
- Not yet widely implemented
Query Part
- Example: http://www.google.com/search?q=東京
- HTML Form submission uses document encoding
- IRIs are supposed to use UTF-8, but browsers use document encoding
- Conflicts between big sites and small sites
- Limit by scheme (maybe http: only)
Many schemes don't use query parts, and some definitely don't want document
encoding (e.g. mailto:)
- Display query part with %-encoding
This is used by some browsers, and can also be used when copying IRIs to
the clipboard. This transfers bytes, not characters, and therefore keeps
the encoding.
- Use UTF-8 for Web pages
This makes the distinction between document encoding and UTF-8
irrelevant.
- Use side channel
E.g. query parameter indicating encoding or attribute on <form>
element indicating query submission should use UTF-8.
Details need to be worked out.
Bidirectionality
- Restrictions on character combinations
- Conventions for display
- Obvious updates:
- Allow combining marks at end of RTL component
(needed for Dhivehi, Yiddish...)
- Allow numbers at end of RTL component
- Adjust rules to IDNA
2008
Conventions for Bidi Display
- Same display anywhere, or
- Optimal display when we know it is an IRI?
- Overall LTR or RTL or depending on some component
- Reordering by component or by directionality run
Normalization
- Normalization important on different levels
- Upper/lower case for domain names,...
(usually works or gets fixed pretty quickly)
- Full/half width for Japanese,...
(difference is at least visible to users)
- Precomposed or decomposed for Vietnamese,...
- RFC 3987 required normalizing transcoders, but assumed NFC for Unicode data
- In practice, not implemented
- Moving from normative language to advice
Browser Quirks and Other Legacy
- Browsers try to be permissive (
///
, %xy
,...)
- Make permissive implementations interoperate
- Distinguish from "Valid to send"
- XML used very old definition of IRI
- Included spaces, control characters,...
- Defined LEIRI (Legacy Extended IRI)
Scheme Names
Warning: This is not part of current work!
- Switching scripts when typing is annoying
- If we want internationalization, do it all the way
(But punctuation such as :, /, ?, #,... would still be left)
- Technology is available, but not implemented (e.g. NAPTR)
- Scripts, not languages (in what language is
http
?)
- Limit to widely used schemes (http, ftp,...)?
- Limit to RTL scripts only?
How You Can Contribute
- Join the Working Group
by subscribing
to the WG mailing list
- Read the mailing list (archives) and
the drafts
(the mailing list is hosted by W3C, but otherwise the W3C is not officially
involved)
- Contribute to the discussion, attend teleconferences
- Help with editing
- Work on implementations and tests
- Attend meetings (next: IETF
79 - Beijing, China, November
7-12)
(you can also attend it via Jabber/Audio Stream)
Conclusions
- Basics work in browsers
- Update your scripts and tools
- Contribute to ongoing work
Q & A
Further Material
Lots of links everywhere throughout the talk, please use them!
Some older material (more background information):
- Update on
Internationalized Domain Names and Internationalized
Resource Identifiers, Martin J. Dürst, Internationalization and
Unicode Conference 33, San Jose, CA, USA, October 2009.
- IRIs and
IDNs: Testing, Implementation, and Specification Evolvement, Martin J.
Dürst, 31st Internationalization and Unicode Conference, San Jose, CA,
USA, October 2007.
- Internationalized
Resource Identifiers, Internationalization & Unicode
Conference 26, San Jose, CA, USA, Sept. 2004.
- Recent Progress on
Internationalized Resource Identifiers (IRIs), Martin J. Dürst, 25th Internationalization and
Unicode Conference, Washington DC, USA, April 2004.
- Internationalized
Resounce Identifiers (IRIs) - Server-side Implementation, Martin J.
Dürst, 24th
Internationalization & Unicode Conference, Atlanta, GA, USA, Sept.
2003.
- Internationalized
Resource Identifiers: From Specification to Testing, Martin J. Dürst,
19th International Unicode
Conference, San Jose, CA, USA, Sept. 2001.
- Internationalizing
Internet Identifiers, Martin J. Dürst, 11th International Unicode
Conference, San Jose, CA, USA, Sept. 1997.