Update on Internationalized Domain Names and
Internationalized Resource Identifiers

IUC 33, San Jose, CA, U.S.A., October 2009

Martin J. DÜRST

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

AGU

© 2009 Martin J. Dürst, Aoyama Gakuin University

Overview

Abstract

In domain names such as www.unicode.org, only a limited number of characters are allowed. This limitation also applies to Uniform Resource Identifiers (URIs) such as http://www.unicode.org. Internationalized Domain Names (IDNs) and Internationalized Resource Identifiers (IRIs) changed this a few years ago, both allowing a wide range of characters from the Unicode repertoire. The specifications underlying these technologies are currently facing an overhaul, major for IDNs and minor for IRIs. The long-overdue and now imminent introduction of the first international top-level domain names will mean that the importance of IDNs and IRIs will significantly increase in the near future.

The presentation will give a general overview of IDNs and IRIs and discuss the current revisions of the specifications in detail. For IDNs, the set of allowed characters is defined using an inclusion-based model rather than the earlier exclusion-based model. Fixed tables are replaced by a property-based selection process to avoid fixing the specification to a single version of Unicode. The mapping step (dealing with casing and normalization, among else) is moved out of the core libraries and closer to the user to allow adaptions for special cases and reduce user surprises. The IRI specification is being extended with descriptions of widely used variants for handling characters strictly speaking not allowed in IRIs. Both specifications are affected by bug fixes to bidirectionality restrictions.

Assumptions

Talk assumes that you know

Slides avaliable at http://www.sw.it.aoyama.ac.jp/2009/pub/IUC33-idn-iri/

Some older material (more background):

Some Terms to Start With

URI
Uniform/Universal Resource Identifier (US-ASCII only)
IRI
Internationalized Resource Identifier (Unicode)
domain name
a name in the DNS
DNS
Domain Name System
IDN
Internationalized Domain Name

Anatomy of a Resource Identifier

The term resource identifier is used generically, to denote URIs and IRIs.

A simple example: http://www.sw.it.aoyama.ac.jp/visitors.html

The general pattern:

scheme://userinfo@host:port/path?query#fragment

The most important parts:

IRI and IDN Examples

IDN:

http://räksmörgås.josefsson.org

http://納豆.w3.mag.keio.ac.jp

IRI:

http://ja.wikipedia.org/wiki/青山学院大学

http://www.sw.it.aoyama.ac.jp/Dürst/

Why Internationalizing Identifiers?

Would you like to type ωωω.γουγλ.κομ?

Identifiers in a well-known script are easier to:

IRI Implementation

IDN Implementation

Everything is Fine?

IDNA Update: IDNA 2008

In IETF Last Call until yesterday:

For Proposed Standard:

For Informational:

Changes from IDNA 2003 to IDNA 2008

IDNA 2003 IDNA 2008
Unicode coverage Unicode 3.2 Unicode 5.2 and beyond
Registration vs. lookup same rules apply registration is stricter
Symbols and punctuation mostly allowed mostly prohibited
Mapping and normalization required, predefined not needed, may vary
Coverage definition by table by rules + exceptions
Context-specific characters no yes (ZWJ/ZWNJ)
Combining marks at end of RTL no yes (needed for Dhivehi, Yiddish...)
Numbers at end of RTL no yes

Why IDNA 2008

Problems with IDNA 2008

Top-Level IDNs

IRI Update

draft-duerst-iri-bis

Done up to now:

Remains to be done:

BOF scheduled for IETF 76 in Hiroshima in preparation for an IETF WG

Internationalizing IRI Components

Ideally, everything uses UTF-8-based %-escaping

Email Address Internationalization

Conclusions

Q & A