Aoyama Gakuin University
© 2009 Martin J. Dürst, Aoyama Gakuin University
In domain names such as www.unicode.org
, only a limited number
of characters are allowed. This limitation also applies to Uniform Resource
Identifiers (URIs) such as http://www.unicode.org
.
Internationalized Domain Names (IDNs) and Internationalized Resource
Identifiers (IRIs) changed this a few years ago, both allowing a wide range of
characters from the Unicode repertoire. The specifications underlying these
technologies are currently facing an overhaul, major for IDNs and minor for
IRIs. The long-overdue and now imminent introduction of the first international
top-level domain names will mean that the importance of IDNs and IRIs will
significantly increase in the near future.
The presentation will give a general overview of IDNs and IRIs and discuss the current revisions of the specifications in detail. For IDNs, the set of allowed characters is defined using an inclusion-based model rather than the earlier exclusion-based model. Fixed tables are replaced by a property-based selection process to avoid fixing the specification to a single version of Unicode. The mapping step (dealing with casing and normalization, among else) is moved out of the core libraries and closer to the user to allow adaptions for special cases and reduce user surprises. The IRI specification is being extended with descriptions of widely used variants for handling characters strictly speaking not allowed in IRIs. Both specifications are affected by bug fixes to bidirectionality restrictions.
Talk assumes that you know
Slides avaliable at http://www.sw.it.aoyama.ac.jp/2009/pub/IUC33-idn-iri/
Some older material (more background):
The term resource identifier is used generically, to denote URIs and IRIs.
A simple example: http://www.sw.it.aoyama.ac.jp/visitors.html
The general pattern:
scheme://userinfo@host:port/path?query#fragment
The most important parts:
scheme
: E.g. http, ftp, mailto,...host
: reg-name (domain-name) or IP addresspath
: Similar to directory hierarchyquery
: Additional data for form
submission,...fragment
: Location or part inside a documentIDN:
http://räksmörgås.josefsson.org
IRI:
http://ja.wikipedia.org/wiki/青山学院大学
http://www.sw.it.aoyama.ac.jp/Dürst/
Would you like to type ωωω.γουγλ.κομ?
Identifiers in a well-known script are easier to:
%E9%9D%92%E5%B1%B1
In IETF Last Call until yesterday:
For Proposed Standard:
For Informational:
IDNA 2003 | IDNA 2008 | |
Unicode coverage | Unicode 3.2 | Unicode 5.2 and beyond |
Registration vs. lookup | same rules apply | registration is stricter |
Symbols and punctuation | mostly allowed | mostly prohibited |
Mapping and normalization | required, predefined | not needed, may vary |
Coverage definition | by table | by rules + exceptions |
Context-specific characters | no | yes (ZWJ/ZWNJ) |
Combining marks at end of RTL | no | yes (needed for Dhivehi, Yiddish...) |
Numbers at end of RTL | no | yes |
Done up to now:
BOF scheduled for IETF 76 in Hiroshima in preparation for an IETF WG
Ideally, everything uses UTF-8-based %-escaping