Making Cents of Yens and Euros: Web 2.0

Download Report

Transcript Making Cents of Yens and Euros: Web 2.0

Making Cents of Yens and
Euros: Web 2.0
Internationalization
Achim Ruopp
[email protected]
http://www.digitalsilkroad.net/
© Copyright 2007 Achim Ruopp
Web 2.0 Expo 2007
Demo
A Currency Converter Application –
before and after
Web 2.0 Internationalization
Agenda

Introduction to Web Internationalization (i18n)
•
•
•
•

Selecting and Persisting User Preferences
Locales and Locale Identifiers
Unicode
Localization – Model and Tools
Multi-lingual Syndication
• RSS
• Atom

Client-side Scripting
• Javascript Internationalization
• Ajax

International Web Services Design
• REST
• SOAP
Intro to Web Internationalization
Language and Location
fr en;0.8
en-US
da-DK
Intro to Web Internationalization
User Preferences

Language
• HTTP Accept-Language header
• E.g.: en, fr-CA;0.8, fr;0.6
• Language negotiation with the server

Locale
• Cultural preferences for formatting, sorting etc.
• Infer from Accept-Language header
• Map IPv4 address to ccTLD (country code top-level
domain)

Public information accessible through libraries
• E.g. Perl IP::Country CPAN module



Commercial services offer more precision
Always provide option to change defaults
Store preferences in cookies
Intro to Web Internationalization
Internet Language Tags


IETF Language Tags (BCP 47)
Language[-Language]*3
[-Script][-Region]
[-Variant]*[-Extension]*[-PrivateUse]*
Examples
• en-CA: English in Canada
• Zh-Hant-TW: Chinese written in traditional
Chinese script used in Taiwan

Obsoletes RFC 3066 & RFC 1766
• Often still used in products/earlier standards
Internationalization Changes
Intro to Web Internationalization
POSIX Locales

Cross-platform API
• Locale-identifiers can have variations


Un*x: en_US
Windows: English_United States
• Results can be platform-dependent


Basis for locale functionality in all scripting
languages
Provides functionality for
•
•
•
•
•
Number Formatting: 1,000,000.23
Date/Time Formatting: 8 Μάρτιος 2007 12:00:00 μμ
Sorting
String processing (e.g. upper-/lower-casing)
Some translated strings like weekdays, yes/no messages
Intro to Web Internationalization
International Components for Unicode


IBM Open Source project
Extensive locale data and APIs
• Data vetted as part of Common Locale
Data Repository (CLDR) project


Java and C++ APIs
Wrappers for scripting languages
• PyICU (Python)
• ICU4R (Ruby) – abandoned?
• DIY – difficult because of API complexity
and character encoding issues
Intro to Web Internationalization
Microsoft Internationalization APIs



Windows NLS API
Microsoft .NET Framework
System.Globalization namespace
Similar set of data to ICU
• Data vetted by Microsoft subsidiaries

APIs accessible from all Microsoft
programming languages
Intro to Web Internationalization
Unicode 5.0
99,024 of 1,114,112 code points (U+0000 to U+10FFFF) defined
00000
10000
20000
Basic Multilingual Plane
Dead Languages & Math
Han Characters
30000
Alphabets
2000
Punctuation
3000
Asian Languages
5000
Language Tags
F0000
100000
1000
4000
…
E0000
0000
Private Use
6000
7000
8000
Han Characters
9000
A000
B000
C000
D000
E000
F000
Yi
Hangul
Surrogates
Private Use
Legacy/Compatibility
Intro to Web Internationalization
Unicode Encodings Forms



Variable length: UTF-8/UTF-16
Fixed length: UTF-32
U+2122: ™: Trade Mark Sign
UTF-8
0xE2 0x84 0xA2
UTF-16
0x2122
UTF-32
0x00002122
11100010
10000100
10100010
00100001 00100010
0…00100001 00100010
Intro to Web Internationalization
Unicode on the Web


XML processors are required to process UTF8/UTF-16
Encoding declaration precedence
1. HTTP Content-Type header charset declaration
2. XML encoding declaration (XHTML)
3. meta charset declaration in (X)HTML
4. link element charset attribute


Approx. 4% of pages have encoding errors*
No real need for character references
• ü: ü or &#252
• Exceptions: <,>,&,"

Use styles to control font selection
* source: Google presentation at IUC30
Demo
A Currency Converter Application –
globalized but not localized
Intro to Web Internationalization
Localization Recommendations
Avoid translatable text in
graphics
Make sure graphics are
culturally neutral
Avoid
absolute
sizing
Use
HTML
flow
layout
Write complete sentences
Intro to Web Internationalization
Localization Model and Tools

Text translation
• Localization formats

HTML with template library
• W3C Internationalization Tag Set (tool support?)


GNU gettext/PO
XLIFF - XML Localization Interchange File Format
• Localization tools





OmegaT
Open Language Tools (Sun)
The WordForge Project: Pootle
…
Searchability – Links/Sitemap
Demo
A Currency Converter Application –
fully internationalized Web 1.0
application
Client-side Scripting
Javascript Internationalization

ECMAScript edition 3 added a range of
internationalization features (1999)
• Good support for Unicode processing
• Set of locale-sensitive functions

Dependent on host locale (i.e. browser)
• Set of locale-insensitive functions
• No number or date/time parsing

Javascript libraries with additional
internationalization functionality
• dojo Toolkit (i18n contributed by IBM)
• Microsoft AJAX Library
Client-side Scripting
AJAX Recommendations

Late globalization
• Transmit data in locale-independent form with
XMLHttpRequest
• Might require some creative parsing/UI

Early localization
• Text localization server-side
• Browsers are missing a message-catalog
facility
• Dynamically created page content is invisible
to search engines
Multi-lingual Syndication
RSS 2.0

Character encoding
• RSS 2.0 is an XML application
• XML encoding rules apply

Language
• Element only on channel (feed), not on item

Create one channel per language
• Specified to comply to RFC1766 language tags

Date/Time
• In standard RFC 822 format (including 4-digit
years)

E.g. “Wed, 02 Oct 2002 08:00:00 EST”
Multi-lingual Syndication
Atom Syndication

More granular language marking
• xml:lang can be applied to any human
readable text in the format
• Aggregators need to deal with this

Better date/time format: RFC 3339
• E.g. “2003-12-13T18:30:02-05:00”

Acknowledgement: Tim Bray
Demo
A Currency Converter Application –
adding a syndication feed with
exchange rate information
International Web Services Design
Service Patterns
Description
Locale Neutral
Neutral data
formats
Client
Influenced
Service reacts
to client-locale
e.g. HTTP
AcceptLanguage
Service
Determined
Service is
locale-specific
and ignores
client
preference
Data Driven
Service adjusts
formatting and
language to
locale the data
refers to
Request data
CAD
Return data
1.1785
CAD
(AcceptLanguage: de)
Kanadischer
Dollar
03/08/2007
12:00pm EST
NOK
norske kroner
CHF
?
International Web Services Design
REST

REST naturally ties into i18n features in
HTTP/HTML/XML
• Locale indicated with HTTP Accept-Language
• Encoding and language marking in markup

Special caution for HTTP GET parameters
• Locale-independent formatting recommended
• Text parameters


Encode in UTF-8 and escape in URIs
IRI (International Resource Identifier) functionality
might provide this for you
International Web Services Design
SOAP

Locale can be communicated in
• Transport header (e.g. HTTP)
• SOAP header
• SOAP message body

Beware of automatically generated SOAP
interfaces
• Might be locale-dependent, but not allow to
specify locale


Use of XML Schema data types promotes
locale-independence
Also consider localization of error
messages
Conclusions

Unification
• One code base

Customization
• Localization and adaptation for locales

Next step: cross-language “leakage”
• Provide views in multiple languages to the
same (user-generated) data
• Translate user-generated content


Volunteers
Machine Translation
Call for Contributions

Presentation and Perl CGI demo code
• http://www.digitalsilkroad.net/web2expo

Add a version in your preferred language
•
•
•
•

Ruby on Rails
PHP
Python
…
Similar ASP.NET application
• http://quickstarts.asp.net/QuickStartv20/aspn
et/doc/localization/default.aspx
References

W3C Internationalization Activity
• http://www.w3.org/International/

POSIX Locale
• http://www.opengroup.org/onlinepubs/009695399/base
defs/xbd_chap07.html

International Components for Unicode
• http://www-306.ibm.com/software/globalization/icu/

Unicode/Common Locale Data Repository
• http://www.unicode.org/

Microsoft Internationalization APIs
• http://msdn2.microsoft.com/enus/library/ms776254.aspx
• http://msdn2.microsoft.com/enus/library/system.globalization.aspx
References

OmegaT
• http://www.omegat.org/omegat/omegat_en/omegat.html

Open Language Tools
• https://open-language-tools.dev.java.net/

The WordForge Project
• http://www.wordforge.org/drupal/

Javascript Internationalization
• http://www.icuproject.org/docs/papers/internationalization_support_for_javascript.ht
ml

RSS 2.0
• http://www.rssboard.org/rss-specification

Atom Syndication
• http://www.atomenabled.org/developers/syndication

RSS 1.0
• http://web.resource.org/rss/1.0/spec

W3C Web Services Internationalization Usage Scenarios
• http://www.w3.org/TR/ws-i18n-scenarios/
Additional Slides
Multi-lingual Syndication
RSS 1.0

Character encoding
• RSS 1.0 is an XML application
• XML encoding rules apply

Complies to RDF (Resource Description
Framework) specification
• Definition of language and date/time formats
are left to RDF metadata formats



Dublin Core Metadata Element Set
Language: RFC1766/ISO639-2
Date/Time: ISO 8601 (superset of RFC 3339)
• Also Dublin Core allows to specify time periods!