Update on Internationalized Resource Identifiers
(IRI)
and Email Address Internationalization (EAI)
IUC 36, Santa Clara, CA, U.S.A.,
24 October 2012
Martin J. DÜRST
with Shunsuke OSHIMA and Ayumu NOJIMA
duerst@it.aoyama.ac.jp
Aoyama Gakuin
University
© 2012 Martin J.
Dürst, Aoyama Gakuin University
Overview
- Introduction
- Email Address I18N
- RFC/draft I18N
- IETF IRI WG
- IRI details and issues
- How you can contribute
- Conclusions
Abstract
On the Internet, exchanging content (documents, web pages, e-mail messages
and the like) in the world's languages and script can be taken for granted. On
the other hand, identifying content and people using anything but a very
limited set of Latin characters is still difficult if not outright impossible.
For URIs (Uniform Resource Identifiers), this is being changed by IRIs
(Internationalized Resource Identifiers). For electronic mail addresses, this
is being changed by EAI (Email Address Internationalization). IDNs
(Internationalized Domain Names) are used in both cases. This paper will give
an update on the latest developments in the area of internationalized
identifiers. This update will discuss motivations and limitations, basic
approaches and methods, the current state of implementation and
standardization, and unsolved problems.
For all kinds of identifiers that extend beyond non-accented Latin
characters, Unicode is indispensable, but different kind of identifiers take a
different approach to using Unicode. For processing in the DNS proper, IDNs use
punycode, a compact but cryptic and therefore often despised "ASCII-compatible"
encoding. IRIs use UTF-8 and %-encoding when downgrading to URIs. EAI is the
most straightforward: In a bold move that would have been unthinkable a few
years ago, it extends the underlying e-mail infrastructure to "just use
UTF-8".
When this abstract was written, the most serious unsolved problem relating
to internationalized identifiers was bidirectional identifiers, i.e.
identifiers including characters written right-to-left such as Arabic and
Hebrew. We will report on progress in this area, even if we don't expect it to
be completely solved by the time of the conference.
This presentation is of interest to anybody working on or using e-mail
addresses or web addresses in an international setting. It is based on the
hands-on implementation experience of the authors.
[Text appearing in gray are comments not showing up in presentation
mode. The best way to view the slides as they were presented is with Opera, pressing F11.]
About the Slides
Slides avaliable at: http://www.sw.it.aoyama.ac.jp/
2012/pub/IRI-EAI
Protocol/Format I18N
- Content ⇒ almost 100% done
Emails, Web pages,...
- Metainformation ⇒ mostly done
Subject, title,...
- Identifiers ⇒ see remainder of talk
Email addresses, Web addresses,...
- Protocol/format ⇒ better not
EHLO, RCPT, GET, head, h1, img,...
Why Identifier I18N?
Would you like to type ωωω.γουγλ.κομ ?
Identifiers in a well-known script are easier to:
- Remember
- Interpret
- Guess
- Create
- Transcribe
- Identify with
- Be searched
Identifier I18N Techniques
- Ascii-compatible encoding (ACE)
- Just Unicode
Identifier I18N Caveats
- Identity
- Normalization
- Best at origin
- Security
- Script mixing
- Bidirectionality
IRIs and EAI
- IRI: Internationalized Resource Identifier
I18N version of URI/web address
- EAI: Email Address Internationalization
Unicode in email address itself
+ Extensions to email protocols
Email Addresses
name@dept.company.com
LeftHandSide@RightHandSide
- RHS is domain name: I18N possible with punycode
- LHS is restricted to ASCII
- Domain names are carefully restricted
- LHS is purely the final mail server's business
Email Protocol/Format
- Protocols
- SMTP: Simple Mail Transfer Protocol
- POP3: Post Office Protocol
- IMAP: Internet Message Access Protocol
- Fomat
- Envelope: Addresses in SMTP
- Header: Email metadata
- Body: Email text
ASCII Email Format
From: duerst@it.aoyama.ac.jp
To: attendee@example.org
Subject: IUC in Santa Clara
Date: Wed, 24 Oct 2012 14:05:05 -0700
Dear Attendee,
I hope you enjoy IUC in Santa Clara.
Regards, Martin.
non-ASCII Email Format
From: Martin =?iso-8859-1?Q?D=FCrst?= <duerst@it.aoyama.ac.jp>
To: A. Attendee <attendee@example.org>
Subject: IUC in Santa Clara
Date: Wed, 24 Oct 2012 14:05:15 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Dear Attendee,
I hope you enjoy IUC in Santa Clara.
Regards, Martin D=FCrst.
Email Format Problems
- Cryptic 7-bit encoding in headers and body
- Many different character encodings
- Email address is ASCII only
EAI: Clean Approach
- Allow Unicode in email addresses
- Use raw UTF-8 in headers
- Use UTF-8 only in bodies
- No ASCII-Compatible Encoding
- No alternate/fallback addresses
(eliminated in experimental phase)
Radical change to the "Email will always be 7bit." approach
EAI Email Format
(all UTF-8)
From: Martin Dürst <dürst@example.org>
To: A. Attendee <attendee@example.org>
Subject: IUC in Santa Clara
Date: Wed, 24 Oct 2012 14:05:15 -0700
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Dear Attendee,
I hope you enjoy IUC in Santa Clara.
Regards, Martin Dürst.
Why This Change?
- Bad press for punycode
- Sendmail 8-bit clean (since a while)
- Interest from China
- Email people looking for new work
- It just makes sense
In the Making since 2006
For more, see WG tools page
Compatibility
- For a message to go through, sender/receiver MUA and all MTAs in between need to be
upgraded
- Down-conversion is possible for:
- bodies
- text in headers
- not for addresses
- Downgrading tries to get something through, even if it's just "you got
mail"
Deployment
- Implementing the upgrades is easy
- Upgrading a webmail system or a company is possible
(central control, flag day)
- User agents need some smarts
(remember which correspondents can handle EAI mail)
- Users need some smarts
- Forwarding and mailing lists need special care
What Can You Do?
- Prepare your infrastructure
- Use early entry as marketing advantage
("Your real name in your mail address with FooWebMail.")
Interlude: RFC I18N
Internet drafts and RFCs are ASCII only:
Internationalized Resource Identifiers M. Duerst
(iri) Aoyama Gakuin University
Internet-Draft L. Masinter
Intended status: BCP Adobe
Expires: April 24, 2013 A. Allawi
Diwan Software Limited
October 21, 2012
Hacking XML2RFC
XML2RFC: Tool chain for draft and RFC
production
Internationalized Resource Identifiers M. Dürst
(iri) Aoyama Gakuin University
Internet-Draft (青山学院大学)
Intended status: BCP L. Masinter
Expires: April 28, 2013 Adobe
A. Allawi (عادل علاوي)
Diwan Software Limited (ديوان)
October 25, 2012
Versions
Production: Ruby, XSLT, .pdf still needs manual work
Pieces necessary are available from IETF
SVN (HOWTO.txt)
Official effort: Proposal, Mailing List
What's an IRI
In full: Internationalized Resource Identifier
Version of URI/URL where non-ASCII characters are allowed
URIs are often also called URLs (Uniform/Universal Resource
Locators), although strictly speaking, URLs are a subset of URIs.
Examples: http://räksmörgås.josefsson.org,,
http://بوابة.تونس/
The first internationalized
top-level domain names were finally introduced by ICANN starting mid 2010 after extensive
agonizing, deliberation and testing.
http://ja.wikipedia.org/wiki/青山学院大学,
http://www.sw.it.aoyama.ac.jp/Dürst/,
http://www.w3.org/People/Dürst/
These just work in all modern browsers (which doesn't include IE6), but may not
always be displayed as such.
How IRIs Work
- IRIs are transferred as characters
- Encoding depends on container format
E.g. Shift_JIS on disk, iso-2022-jp in email, UTF-8 in a Web page, ink on
napkin, sound waves
- When an URI is needed:
- Convert to UTF-8
- Percent-encode (used for non-ASCII bytes in URIs)
Example: 青山 (U+9752 U+5C71) →
%E9%9D%92%E5%B1%B1
(use Richard Ishida's Unicode Code Converter
for help)
The extended character repertoire is essentially the only
difference between URIs and IRIs, and conversion is easy using UTF-8 and
percent-encoding. However, as in many other areas of Unicode and
internationalization, the details can be surprisingly tricky.
IETF IRI WG
IETF: Internet Engineering Task Force
WG: Working Group
BOF November 2009
Chartered early 2010
Work is slow and spurty, but progressing
Documents being Updated
RFC 3987 -
Internationalized Resource Identifiers (IRIs)
Latest Drafts:
RFC 4395 - Guidelines and
Registration Procedures for New URI/IRI Schemes
Latest Draft: draft-ietf-iri-4395bis-irireg-04
List of WG documents
Registration Guidelines
Clarified or new:
- URI schemes are IRI schemes are URI schemes
→ only one IANA Scheme
Registry
- Syntax description can be on URI or IRI level
(Syntax can be expressed either in Unicode characters or in ASCII
characters (including %-encoding). This needs some additional careful
work.)
- Register
example:
scheme
Main Issues for IRIs
- Conversion using decomposition into pieces
- Where and when to Punycode
- Query part
- Bidirectionality
- Normalization
- Legacy issues
- Scheme names
Decomposition of an IRI
The term resource identifier is used generically, to
denote URIs and IRIs.
scheme://userinfo@host:port/path?query#fragment
The most important parts:
scheme
: E.g. http, ftp, mailto,...
host
: reg-name (domain-name) or IP address
path
: Similar to directory hierarchy
query
: Additional data for form
submission,...
fragment
: Location or part inside a document
JavaScript also has access to authority
(userinfo/host/port together)
To Punycode or not to Punycode
- Internationalized Domain
Names (IDNs) use Punycode
Example: بوابة.تونس → xn--mgbbbe6m.xn--pgbs0dh
- IRIs and URIs can use
%-encoding
Example: تونس →%D8%AA%D9%88%D9%86%D8%B3
- Choice A: Use Punycode
- Choice B: Use %-encoding
- Not yet widely implemented
- Conclusion: Both are okay
Query Part
- Example: http://www.google.com/search?q=東京
- HTML Form submission uses document encoding
- IRIs are supposed to use UTF-8, but browsers use document encoding
- Conflicts between big sites and small sites
- Limit by scheme (maybe http: only)
Many schemes don't use query parts, and some definitely don't want document
encoding (e.g. mailto:)
- Display query part with %-encoding
This is used by some browsers, and can also be used when copying IRIs to
the clipboard. This transfers bytes, not characters, and therefore keeps
the encoding.
- Use UTF-8 for Web pages
This makes the distinction between document encoding and UTF-8
irrelevant.
- Use side channel
E.g. query parameter indicating encoding or attribute on <form>
element indicating query submission should use UTF-8.
Details need to be worked out.
Bidirectionality
- Restrictions on character combinations
- Conventions for display
- Obvious updates:
- Allow combining marks at end of RTL component
(needed for Dhivehi, Yiddish...)
- Allow numbers at end of RTL component
- Adjusting rules to IDNA
2008
Conventions for Bidi Display
- Same display anywhere, or
- Optimal display when we know it is an IRI?
- Overall LTR or RTL or depending on some component
- Reordering by component or by directionality run
Checking BIDI Classes
- Domain name labels:
- "." is the only separator
- Not many symbols in actual labels
- IRI components:
- Many different separators:
: / ? # . = &
...
- Many symbols in components
Separator Jumping
Many things can happen with bidi
A very undesirable thing is separator jumping:
Logical: ABC1.3DEF
Visual: FED1.3CBA
Arabic: ابة1.3تجج
"1" and "3" 'jumped' over the dot to the other side
Jumping in IRIs
Preliminary simulation results:
- All major separators except
#
behave the same as "."
#
is slightly more dangerous
- We can probably live with about the same restrictions as for IDNA
(which isn't perfect, either)
- In many parts of an IRI, there is more coordination possible than in
DNS
Working on confirmation from algorithm
Normalization
- Normalization important on different levels
- Upper/lower case for domain names,...
(usually works or gets fixed pretty quickly)
- Full/half width for Japanese,...
(difference is at least visible to users)
- Precomposed or decomposed for Vietnamese,...
- RFC 3987 required normalizing transcoders, but assumed NFC for Unicode data
- Moved from normative language to advice
Browser Quirks and Other Legacy
- Browsers try to be permissive (
///
, %xy
,...)
- Make permissive implementations interoperate
- Distinguish from "Valid to send"
- Creates
- XML used very old definition of IRI
- Included spaces, control characters,...
- Defined LEIRI (Legacy Extended IRI)
Scheme Names
Warning: This is not part of current work!
- Switching scripts when typing is annoying
- If we want internationalization, do it all the way
(But punctuation such as :, /, ?, #,... would still be left)
- Technology is available, but not implemented (e.g. NAPTR)
- Scripts, not languages (in what language is
http
?)
- Limit to widely used schemes (http, ftp,...)?
- Limit to RTL scripts only?
How You Can Contribute
- Join the Working Group
by subscribing
to the WG mailing list
- Read the mailing list (archives) and
the drafts
(the mailing list is hosted by W3C, but it is an IETF WG)
- Contribute to the discussion
- Help with editing
- Work on implementations and tests
- Attend meetings (next: IETF
85 - Atlanta, USA,November 4 - 9)
(you can also attend it via Jabber/Audio Stream/meetecho)
Conclusions
- IRI basics work in browsers
- Lots of work on EAI needed
- Update your scripts and tools
- Contribute to ongoing work
Q & A
Further Material
Lots of links everywhere throughout the talk, please use them!
Some older material (more background information):
- IRIs
Beyond the Napkin: A Survey of Internationalized Resource Identifier Issues
and Implementation, Martin J. Dürst and Addison Phillips,
Internationalization and Unicode Conference 34, Santa Clara, CA, USA,
October 2010
- Update on
Internationalized Domain Names and Internationalized
Resource Identifiers, Martin J. Dürst, Internationalization and
Unicode Conference 33, San Jose, CA, USA, October 2009.
- IRIs and
IDNs: Testing, Implementation, and Specification Evolvement, Martin J.
Dürst, 31st Internationalization and Unicode Conference, San Jose, CA,
USA, October 2007.
- Internationalized
Resource Identifiers, Internationalization & Unicode
Conference 26, San Jose, CA, USA, Sept. 2004.
- Recent Progress on
Internationalized Resource Identifiers (IRIs), Martin J. Dürst, 25th Internationalization and
Unicode Conference, Washington DC, USA, April 2004.
- Internationalized
Resounce Identifiers (IRIs) - Server-side Implementation, Martin J.
Dürst, 24th
Internationalization & Unicode Conference, Atlanta, GA, USA, Sept.
2003.
- Internationalized
Resource Identifiers: From Specification to Testing, Martin J. Dürst,
19th International Unicode
Conference, San Jose, CA, USA, Sept. 2001.
- Internationalizing
Internet Identifiers, Martin J. Dürst, 11th International Unicode
Conference, San Jose, CA, USA, Sept. 1997.