IRIs and IDNs: Testing, Implementations, and
Specification Evolvement
IUC 31, San Jose, 2007
Martin J. DÜRST
duerst@it.aoyama.ac.jp
Aoyama Gakuin University
© 2007 Martin
J. Dürst, Aoyama Gakuin
University
Overview
- What are IRIs and IDNs?
- Why IRIs and IDNs?
- Testing
- Implementations
- Specification Update
- Open Issues
- Outlook
- Q & A
Paper and Slides available
online
Background
Internet/Web internationalization in waves:
- First wave: Internationalization of content (email, Web pages,...)
'Policy' documents:
- Second wave: Internationalization of Identifiers
What are IRIs and IDNs?
IRIs: Internationalized Resource Identifiers, internationalization of URIs
(/URLs)
IDNs: Internationalized Domain Names, internationalization of domain
names
Internationalization here means:
- Syntax characters stay the same
- 'Payload' characters allow almost all Unicode character
(rather than just US-ASCII subset)
Why IRIs and IDNs?
Native script is easier to:
- Devise identifiers (what is my identifier for X?)
- Memorize identifiers (what was the identifier of X?)
- Guess identifiers (what could be the identifier of X?)
- Understand identifiers (what does X refer to?)
- Manipulate identifiers (write, type,...)
- Correct identifiers (spelling errors)
- Identify with identifiers (nice identifier, isn't it?)
Because of higher familitarity and no need for transcription
[from a talk of
mine 10 years ago]
URI/IRI structure
scheme:
hierarchical-part?
query#
fragment-identifier
Example an actual URI containing all four parts:
http:
//www.w3.org/2005/11/Translations/Query?
titleMatch=HTML&lang=fr#
xhtml1-2
hierarchical-part often includes a
domain name (www.w3.org in the above example)
Encoding of IRIs
When:
- URI-only protocols and formats
How:
- Encode characters as UTF-8
- Take bytes and encode them as %HH (percent-encoding)
- Result is a legal URI
Examples: Dürst →
D%C3%BCrst, 渋谷駅 → %E6%B8%8B%E8%B0%B7%E9%A7%85
Encoding of IDNs
When:
- In the Domain Name System/Protocol itself
How:
- Normalize with NFKC, downcase, +some additional
mappings/prohibitions
- Encode with punycode (elaborate special-purpose compression)
- Prefix each label with
xn--
- Result is a domain name label
Examples: 渋谷駅.jp →
xn--i5wq75dpjj.jp, www.résumé.jp →
www.xn--rsum-bpad.org
Fallbacks
Foreign script not usable due to:
- Lacking user familiarity
- System limitations
Solutions:
- Use Ascii Compatible Encoding (ACE)
- Percent-encoding for IRIs
- Punycode for domain names
- Use hand-selected alternatives (e.g. transliterations or
translations)
Testing
- Purpose
- Testing requirements
- Test generation framework
- Other tests
Why Testing?
- Encourage implementations
- Increase implementation coverage
- Provide implementation evidence
- Weed out unimplemented parts of specification
Why testing first: Test-driven development
Charmod Testing Requirements
Axes listed in W3C Character
Model 1.0: Resource Identifiers:
- IRIs in several document formats (HTML, CSS, SVG, Atom,...)
- IRIs in several locations in the same document format
- non-ASCII characters in different parts of an IRI (e.g. domain name
part, path part)
- IRIs in documents with various widely used character encodings and with
characters from various scripts
- Document-specific escapes in IRIs
- IRIs in various URI schemes
- Setup of various servers for IRIs
- Translation of IRIs into URIs (needed for all the above)
Over the years, IRI tests for various purposes have been created and made
available at various locations. An overview is given at http://www.w3.org/International/iri-edit/testing.html;
if some tests are not listed there, please inform the author.
Testing Framework
- Idea and history
- Test types
- Modalities
- Demo
- Next steps
Framework Idea and History
- Producing tests for many combinations is tedious
- Producing correct tests is difficult
- Test generation framework written in Perl jointly with Kazuhiro
Yonekawa 2005/2006
- Rewritten recently in Ruby: More functionality, less and better
organized code
- Currently more than 3000 files produced, increasing
- Published today, but many improvements possible, please comment
Test Types
- Control tests: Test preconditions for actual tests
E.g.: Support for some functionality with URIs
- Verification tests: Tests that implementations have to pass
- Exploratory tests: Tests to assess implementation coverage with
potential for spec change
Modalities
Abstraction conveniently combining:
- Encodings used:
- US-ASCII
- Legacy encodings
- UTF-8
- Escapes such as Numeric Character References (NCRs)
- Encodings with normalization issues
- Examples used (language/encoding)
- Test type
- Presence or absence of explicit 'wrong' results
Human-oriented vs. Machine-oriented Tests
- Try to distinguish explicit test failure vs. other problem (e.g.
network failure)
- Traditionally, one frequent failure pattern: direct percent-encoding of
legacy encoding
- Provide explicit WRONG! message for
human viewers
- Provide
401 Not found
responses for automatic tests
Demo!
Version 0.10 of tests published today:
http://www.sw.it.aoyama.ac.jp/2005/iritest/
Next Steps
- Tests for FTP
- Tests for fragment identifiers
- Tests for directory name rather than filename
- Tests for domain names
- Tests for form submission
Can you ever have enough tests?
Other Tests
Overview page with pointers at http://www.w3.org/International/iri-edit/testing.html
If you know about some test that is not linked, please tell me!
More tests needed for other aspects than resolution:
- Display
- Generation/creation
Browser Implementations
Coverage is reasonably good:
- IE6: IRIs supported, IDNs need plugin
- Mozilla Firefox (2.0): IDNs supported, IRIs need option setting
- Opera: both IRIs and IDNs supported
- Amaya: both IRIs and IDNs supported
- Safari: both IRIs and IDNs supported
- IE7: both IRIs and IDNs supported
(mostly checked on Windows)
Other Implementations
- Is your Web site IRI-ready?
- Relevant for spiders/search engines
- Relevant for tool Web sites (e.g. W3C Validator, Link Checker)
- Are your scripts IRI-ready?
- Is your programming language/library IRI-ready?
Implementing IRIs/IDNs in cURL
- Joint work with Takeo Kobayashi
- cURL: Very powerful URI/resource
manipulation tool and library
- Using libidn for idns
- Using iconv for transcoding
- Detecting encoding of IRI:
- via option (-Z was the last one available)
- from OS (Unix/Linux and Windows)
- ANSI?/OEM? problem remaining on Windows
Specification Update
- IETF specifications move forward in stages
- The IRI spec (RFC3987) is
at Proposed Standard level
- The goal is to move the IRI spec to Draft Standard level
- Drafts of this update get published as draft-duerst-iri-bis-xx
- The URI spec (RFC3986) is
at Standard level
The IETF Standards Track
IETF: Internet Engineering Task Force
Three standards levels:
- (full) Standard (RFC)
- Draft Standard (RFC)
- Proposed Standard (RFC)
RFC: Request for Comments (also: Experimental, Informational, Historical,
Obsolete)
For implementers, Proposed Standard is good enough, even just an RFC is
fine
Very few things make it to full Standard (currently 67, in: URIs,
out: SMTP)
Current Issues in draft-duerst-iri-bis
Issues list at http://www.w3.org/International/iri-edit
Mailing list is public-iri@w3.org
(archives at http://lists.w3.org/Archives/Public/public-iri/)
- Normalization: where and when to normalize
- Bidi limitations: needs fix for combining characters
- 'Legacy Extended' IRIs: syntax extension for some XML specs
- IRIs and HTML forms
- IRIs in context
- Coordination with ongoing IDN update
- Security issues
Open Issues
- Email Addresses
- URI schemes (http:, ftp:,...)
- Top-level domain names
Email Address Internationalization
- IETF formed EAI Working
Group in March 2006
- Different approach from IRIs and IDNs:
- Sweeping upgrade from 7-bit only to UTF-8
- for basic protocol (SMTP)
- for email headers
- This means we will eventually get rid of things like
=?ISO-8859-1?Q?Patrik_F=E4ltstr=F6m?=
- Alternative address used in protocol for fallback
- First batch of specs soon available as experimental
Top-Level Domain Names
- Technically no problem, same on all levels
- ICANN is responsible, has been moving slowly (recent example.test evaluation)
Internationalization of URI Schemes
- The whole IRI will be non-ASCII, except http://
- Possible solutions:
- http:// can be left out in browsers
- Dropdown menu in browser, with entries in local language like:
- http (Web Page Address)
- ftp (File Exchange)
- mailto (Email Address)
- Local script mapping (if a Greek user starts to type
ηττπ://
, the browser converts it to
http://
)
- Scheme mapping? (NAPTR,...)
Conclusions & Outlook
- Are you IRI-ready?
- Are you IDN-ready?
- Comment on tests
- Comment on draft-duerst-iri-bis-xx (currently xx=01)
Q & A