Update on Internationalized Resource Identifiers (IRI)
and Email Address Internationalization (EAI)

IUC 36, Santa Clara, CA, U.S.A., 24 October 2012

Martin J. DÜRST
with Shunsuke OSHIMA and Ayumu NOJIMA

duerst@it.aoyama.ac.jp

Aoyama Gakuin University

IETF

© 2012 Martin J. Dürst, Aoyama Gakuin University

Overview

Abstract

On the Internet, exchanging content (documents, web pages, e-mail messages and the like) in the world's languages and script can be taken for granted. On the other hand, identifying content and people using anything but a very limited set of Latin characters is still difficult if not outright impossible. For URIs (Uniform Resource Identifiers), this is being changed by IRIs (Internationalized Resource Identifiers). For electronic mail addresses, this is being changed by EAI (Email Address Internationalization). IDNs (Internationalized Domain Names) are used in both cases. This paper will give an update on the latest developments in the area of internationalized identifiers. This update will discuss motivations and limitations, basic approaches and methods, the current state of implementation and standardization, and unsolved problems.

For all kinds of identifiers that extend beyond non-accented Latin characters, Unicode is indispensable, but different kind of identifiers take a different approach to using Unicode. For processing in the DNS proper, IDNs use punycode, a compact but cryptic and therefore often despised "ASCII-compatible" encoding. IRIs use UTF-8 and %-encoding when downgrading to URIs. EAI is the most straightforward: In a bold move that would have been unthinkable a few years ago, it extends the underlying e-mail infrastructure to "just use UTF-8".

When this abstract was written, the most serious unsolved problem relating to internationalized identifiers was bidirectional identifiers, i.e. identifiers including characters written right-to-left such as Arabic and Hebrew. We will report on progress in this area, even if we don't expect it to be completely solved by the time of the conference.

This presentation is of interest to anybody working on or using e-mail addresses or web addresses in an international setting. It is based on the hands-on implementation experience of the authors.

[Text appearing in gray are comments not showing up in presentation mode. The best way to view the slides as they were presented is with Opera, pressing F11.]

 

About the Slides

Slides avaliable at: http://www.sw.it.aoyama.ac.jp/
2012/pub/IRI-EAI

Protocol/Format I18N

 

Why Identifier I18N?

Would you like to type ωωω.γουγλ.κομ ?

Identifiers in a well-known script are easier to:

 

Identifier I18N Techniques

Identifier I18N Caveats

IRIs and EAI

Email Addresses

 

Email Protocol/Format

 

ASCII Email Format

From: duerst@it.aoyama.ac.jp
To: attendee@example.org
Subject: IUC in Santa Clara
Date: Wed, 24 Oct 2012 14:05:05 -0700

Dear Attendee,

I hope you enjoy IUC in Santa Clara.

Regards,    Martin.

 

non-ASCII Email Format

From: Martin =?iso-8859-1?Q?D=FCrst?= <duerst@it.aoyama.ac.jp>
To: A. Attendee <attendee@example.org>
Subject: IUC in Santa Clara
Date: Wed, 24 Oct 2012 14:05:15 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Dear Attendee,

I hope you enjoy IUC in Santa Clara.

Regards,    Martin D=FCrst.

 

Email Format Problems

EAI: Clean Approach

Radical change to the "Email will always be 7bit." approach

 

EAI Email Format

(all UTF-8)

From: Martin Dürst <dürst@example.org>
To: A. Attendee <attendee@example.org>
Subject: IUC in Santa Clara
Date: Wed, 24 Oct 2012 14:05:15 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Dear Attendee,

I hope you enjoy IUC in Santa Clara.

Regards,    Martin Dürst.

 

Why This Change?

 

In the Making since 2006

Experimental Standards Track
Framework RFC 4952, 2007 RFC 6530, 2011
SMTP Extension RFC 5336, 2008 RFC 6531, 2011
Message Format RFC 5335, 2008 RFC 6532, 2011
POP3 Extension RFC 5721, 2009 draft-ietf-eai-rfc5721bis
IMAP Extension RFC 5738, 2009 draft-ietf-eai-5738bis

For more, see WG tools page

 

Compatibility

 

Deployment

 

What Can You Do?

Interlude: RFC I18N

Internet drafts and RFCs are ASCII only:

Internationalized Resource Identifiers                         M. Duerst
(iri)                                           Aoyama Gakuin University
Internet-Draft                                               L. Masinter
Intended status: BCP                                               Adobe
Expires: April 24, 2013                                        A. Allawi
                                                  Diwan Software Limited
                                                        October 21, 2012

 

Hacking XML2RFC

XML2RFC: Tool chain for draft and RFC production

Internationalized Resource Identifiers                          M. Dürst
(iri)                                           Aoyama Gakuin University
Internet-Draft                                                  (青山学院大学)
Intended status: BCP                                         L. Masinter
Expires: April 28, 2013                                            Adobe
                                                  A. Allawi (عادل علاوي)
                                          Diwan Software Limited (ديوان)
                                                        October 25, 2012

 

Versions

Production: Ruby, XSLT, .pdf still needs manual work

Pieces necessary are available from IETF SVN (HOWTO.txt)

Official effort: Proposal, Mailing List

 

What's an IRI

In full: Internationalized Resource Identifier

Version of URI/URL where non-ASCII characters are allowed

URIs are often also called URLs (Uniform/Universal Resource Locators), although strictly speaking, URLs are a subset of URIs.

Examples: http://räksmörgås.josefsson.org,, http://بوابة.تونس/
The first internationalized top-level domain names were finally introduced by ICANN starting mid 2010 after extensive agonizing, deliberation and testing.

http://ja.wikipedia.org/wiki/青山学院大学,
http://www.sw.it.aoyama.ac.jp/Dürst/,
http://www.w3.org/People/Dürst/
These just work in all modern browsers (which doesn't include IE6), but may not always be displayed as such.

 

How IRIs Work

The extended character repertoire is essentially the only difference between URIs and IRIs, and conversion is easy using UTF-8 and percent-encoding. However, as in many other areas of Unicode and internationalization, the details can be surprisingly tricky.

 

IETF IRI WG

IETF: Internet Engineering Task Force

WG: Working Group

BOF November 2009

Chartered early 2010

Work is slow and spurty, but progressing

Documents being Updated

RFC 3987 - Internationalized Resource Identifiers (IRIs)

Latest Drafts:

RFC 4395 - Guidelines and Registration Procedures for New URI/IRI Schemes

Latest Draft: draft-ietf-iri-4395bis-irireg-04

List of WG documents

 

Registration Guidelines

Clarified or new:

Main Issues for IRIs

 

Decomposition of an IRI

The term resource identifier is used generically, to denote URIs and IRIs.

scheme://userinfo@host:port/path?query#fragment

The most important parts:

JavaScript also has access to authority (userinfo/host/port together)

 

To Punycode or not to Punycode

 

Query Part

Details need to be worked out.

 

Bidirectionality

 

Conventions for Bidi Display

Checking BIDI Classes

Separator Jumping

Many things can happen with bidi

A very undesirable thing is separator jumping:

Logical: ABC1.3DEF

Visual: FED1.3CBA

Arabic: ابة1.3تجج

"1" and "3" 'jumped' over the dot to the other side

Jumping in IRIs

Preliminary simulation results:

Working on confirmation from algorithm

Normalization

 

Browser Quirks and Other Legacy

 

Scheme Names

Warning: This is not part of current work!

 

How You Can Contribute

 

Conclusions

Q & A

Further Material

Lots of links everywhere throughout the talk, please use them!

Some older material (more background information):