Abstract

On the Internet, exchanging content (documents, web pages, e-mail messages and the like) in the world's languages and script can be taken for granted. On the other hand, identifying content and people using anything but a very limited set of Latin characters is still difficult if not outright impossible. For URIs (Uniform Resource Identifiers), this is being changed by IRIs (Internationalized Resource Identifiers). For electronic mail addresses, this is being changed by EAI (Email Address Internationalization). IDNs (Internationalized Domain Names) are used in both cases. This paper will give an update on the latest developments in the area of internationalized identifiers. This update will discuss motivations and limitations, basic approaches and methods, the current state of implementation and standardization, and unsolved problems.

For all kinds of identifiers that extend beyond non-accented Latin characters, Unicode is indispensable, but different kind of identifiers take a different approach to using Unicode. For processing in the DNS proper, IDNs use punycode, a compact but cryptic and therefore often despised "ASCII-compatible" encoding. IRIs use UTF-8 and %-encoding when downgrading to URIs. EAI is the most straightforward: In a bold move that would have been unthinkable a few years ago, it extends the underlying e-mail infrastructure to "just use UTF-8".

When this abstract was written, the most serious unsolved problem relating to internationalized identifiers was bidirectional identifiers, i.e. identifiers including characters written right-to-left such as Arabic and Hebrew. We will report on progress in this area, even if we don't expect it to be completely solved by the time of the conference.

This presentation is of interest to anybody working on or using e-mail addresses or web addresses in an international setting. It is based on the hands-on implementation experience of the authors.

[Text appearing in gray are comments not showing up in presentation mode. The best way to view the slides as they were presented is with Opera, pressing F11.]

About the Slides

Protocol/Format I18N

Why Identifier I18N?

Identifier I18N Techniques

Identifier I18N Caveats

IRIs and EAI

Email Addresses

Email Protocol/Format

ASCII Email Format

non-ASCII Email Format

Email Format Problems

EAI: Clean Approach

EAI Email Format

Why This Change?

In the Making since 2006

Compatibility

Deployment

What Can You Do?

Interlude: RFC I18N

Hacking XML2RFC

Versions

What's an IRI

URIs are often also called URLs (Uniform/Universal Resource Locators), although strictly speaking, URLs are a subset of URIs.

How IRIs Work

The extended character repertoire is essentially the only difference between URIs and IRIs, and conversion is easy using UTF-8 and percent-encoding. However, as in many other areas of Unicode and internationalization, the details can be surprisingly tricky.

IETF IRI WG

Documents being Updated

Registration Guidelines

Main Issues for IRIs

Decomposition of an IRI

To Punycode or not to Punycode

Query Part

Bidirectionality

Conventions for Bidi Display

Checking BIDI Classes

Separator Jumping

Jumping in IRIs

Normalization

Browser Quirks and Other Legacy

Scheme Names

How You Can Contribute

Conclusions

Q & A

Further Material

Lots of links everywhere throughout the talk, please use them!

Some older material (more background information):

IRIs Beyond the Napkin: A Survey of Internationalized Resource Identifier Issues and Implementation, Martin J. Dürst and Addison Phillips, Internationalization and Unicode Conference 34, Santa Clara, CA, USA, October 2010
Update on Internationalized Domain Names and Internationalized Resource Identifiers, Martin J. Dürst, Internationalization and Unicode Conference 33, San Jose, CA, USA, October 2009.
IRIs and IDNs: Testing, Implementation, and Specification Evolvement, Martin J. Dürst, 31st Internationalization and Unicode Conference, San Jose, CA, USA, October 2007.
Internationalized Resource Identifiers, Internationalization & Unicode Conference 26, San Jose, CA, USA, Sept. 2004.
Recent Progress on Internationalized Resource Identifiers (IRIs), Martin J. Dürst, 25th Internationalization and Unicode Conference, Washington DC, USA, April 2004.
Internationalized Resounce Identifiers (IRIs) - Server-side Implementation, Martin J. Dürst, 24th Internationalization & Unicode Conference, Atlanta, GA, USA, Sept. 2003.
Internationalized Resource Identifiers: From Specification to Testing, Martin J. Dürst, 19th International Unicode Conference, San Jose, CA, USA, Sept. 2001.
Internationalizing Internet Identifiers, Martin J. Dürst, 11th International Unicode Conference, San Jose, CA, USA, Sept. 1997.

	Experimental	Standards Track
Framework	RFC 4952, 2007	RFC 6530, 2011
SMTP Extension	RFC 5336, 2008	RFC 6531, 2011
Message Format	RFC 5335, 2008	RFC 6532, 2011
POP3 Extension	RFC 5721, 2009	draft-ietf-eai-rfc5721bis
IMAP Extension	RFC 5738, 2009	draft-ietf-eai-5738bis

Update on Internationalized Resource Identifiers (IRI) and Email Address Internationalization (EAI)

IUC 36, Santa Clara, CA, U.S.A., 24 October 2012

Martin J. DÜRST with Shunsuke OSHIMA and Ayumu NOJIMA

Overview