Making Cents of Yens and Euros: Web 2.0
Download
Report
Transcript Making Cents of Yens and Euros: Web 2.0
Making Cents of Yens and
Euros: Web 2.0
Internationalization
Achim Ruopp
[email protected]
http://www.digitalsilkroad.net/
© Copyright 2007 Achim Ruopp
Web 2.0 Expo 2007
Demo
A Currency Converter Application –
before and after
Web 2.0 Internationalization
Agenda
Introduction to Web Internationalization (i18n)
•
•
•
•
Selecting and Persisting User Preferences
Locales and Locale Identifiers
Unicode
Localization – Model and Tools
Multi-lingual Syndication
• RSS
• Atom
Client-side Scripting
• Javascript Internationalization
• Ajax
International Web Services Design
• REST
• SOAP
Intro to Web Internationalization
Language and Location
fr en;0.8
en-US
da-DK
Intro to Web Internationalization
User Preferences
Language
• HTTP Accept-Language header
• E.g.: en, fr-CA;0.8, fr;0.6
• Language negotiation with the server
Locale
• Cultural preferences for formatting, sorting etc.
• Infer from Accept-Language header
• Map IPv4 address to ccTLD (country code top-level
domain)
Public information accessible through libraries
• E.g. Perl IP::Country CPAN module
Commercial services offer more precision
Always provide option to change defaults
Store preferences in cookies
Intro to Web Internationalization
Internet Language Tags
IETF Language Tags (BCP 47)
Language[-Language]*3
[-Script][-Region]
[-Variant]*[-Extension]*[-PrivateUse]*
Examples
• en-CA: English in Canada
• Zh-Hant-TW: Chinese written in traditional
Chinese script used in Taiwan
Obsoletes RFC 3066 & RFC 1766
• Often still used in products/earlier standards
Internationalization Changes
Intro to Web Internationalization
POSIX Locales
Cross-platform API
• Locale-identifiers can have variations
Un*x: en_US
Windows: English_United States
• Results can be platform-dependent
Basis for locale functionality in all scripting
languages
Provides functionality for
•
•
•
•
•
Number Formatting: 1,000,000.23
Date/Time Formatting: 8 Μάρτιος 2007 12:00:00 μμ
Sorting
String processing (e.g. upper-/lower-casing)
Some translated strings like weekdays, yes/no messages
Intro to Web Internationalization
International Components for Unicode
IBM Open Source project
Extensive locale data and APIs
• Data vetted as part of Common Locale
Data Repository (CLDR) project
Java and C++ APIs
Wrappers for scripting languages
• PyICU (Python)
• ICU4R (Ruby) – abandoned?
• DIY – difficult because of API complexity
and character encoding issues
Intro to Web Internationalization
Microsoft Internationalization APIs
Windows NLS API
Microsoft .NET Framework
System.Globalization namespace
Similar set of data to ICU
• Data vetted by Microsoft subsidiaries
APIs accessible from all Microsoft
programming languages
Intro to Web Internationalization
Unicode 5.0
99,024 of 1,114,112 code points (U+0000 to U+10FFFF) defined
00000
10000
20000
Basic Multilingual Plane
Dead Languages & Math
Han Characters
30000
Alphabets
2000
Punctuation
3000
Asian Languages
5000
Language Tags
F0000
100000
1000
4000
…
E0000
0000
Private Use
6000
7000
8000
Han Characters
9000
A000
B000
C000
D000
E000
F000
Yi
Hangul
Surrogates
Private Use
Legacy/Compatibility
Intro to Web Internationalization
Unicode Encodings Forms
Variable length: UTF-8/UTF-16
Fixed length: UTF-32
U+2122: ™: Trade Mark Sign
UTF-8
0xE2 0x84 0xA2
UTF-16
0x2122
UTF-32
0x00002122
11100010
10000100
10100010
00100001 00100010
0…00100001 00100010
Intro to Web Internationalization
Unicode on the Web
XML processors are required to process UTF8/UTF-16
Encoding declaration precedence
1. HTTP Content-Type header charset declaration
2. XML encoding declaration (XHTML)
3. meta charset declaration in (X)HTML
4. link element charset attribute
Approx. 4% of pages have encoding errors*
No real need for character references
• ü: ü or ü
• Exceptions: <,>,&,"
Use styles to control font selection
* source: Google presentation at IUC30
Demo
A Currency Converter Application –
globalized but not localized
Intro to Web Internationalization
Localization Recommendations
Avoid translatable text in
graphics
Make sure graphics are
culturally neutral
Avoid
absolute
sizing
Use
HTML
flow
layout
Write complete sentences
Intro to Web Internationalization
Localization Model and Tools
Text translation
• Localization formats
HTML with template library
• W3C Internationalization Tag Set (tool support?)
GNU gettext/PO
XLIFF - XML Localization Interchange File Format
• Localization tools
OmegaT
Open Language Tools (Sun)
The WordForge Project: Pootle
…
Searchability – Links/Sitemap
Demo
A Currency Converter Application –
fully internationalized Web 1.0
application
Client-side Scripting
Javascript Internationalization
ECMAScript edition 3 added a range of
internationalization features (1999)
• Good support for Unicode processing
• Set of locale-sensitive functions
Dependent on host locale (i.e. browser)
• Set of locale-insensitive functions
• No number or date/time parsing
Javascript libraries with additional
internationalization functionality
• dojo Toolkit (i18n contributed by IBM)
• Microsoft AJAX Library
Client-side Scripting
AJAX Recommendations
Late globalization
• Transmit data in locale-independent form with
XMLHttpRequest
• Might require some creative parsing/UI
Early localization
• Text localization server-side
• Browsers are missing a message-catalog
facility
• Dynamically created page content is invisible
to search engines
Multi-lingual Syndication
RSS 2.0
Character encoding
• RSS 2.0 is an XML application
• XML encoding rules apply
Language
• Element only on channel (feed), not on item
Create one channel per language
• Specified to comply to RFC1766 language tags
Date/Time
• In standard RFC 822 format (including 4-digit
years)
E.g. “Wed, 02 Oct 2002 08:00:00 EST”
Multi-lingual Syndication
Atom Syndication
More granular language marking
• xml:lang can be applied to any human
readable text in the format
• Aggregators need to deal with this
Better date/time format: RFC 3339
• E.g. “2003-12-13T18:30:02-05:00”
Acknowledgement: Tim Bray
Demo
A Currency Converter Application –
adding a syndication feed with
exchange rate information
International Web Services Design
Service Patterns
Description
Locale Neutral
Neutral data
formats
Client
Influenced
Service reacts
to client-locale
e.g. HTTP
AcceptLanguage
Service
Determined
Service is
locale-specific
and ignores
client
preference
Data Driven
Service adjusts
formatting and
language to
locale the data
refers to
Request data
CAD
Return data
1.1785
CAD
(AcceptLanguage: de)
Kanadischer
Dollar
03/08/2007
12:00pm EST
NOK
norske kroner
CHF
?
International Web Services Design
REST
REST naturally ties into i18n features in
HTTP/HTML/XML
• Locale indicated with HTTP Accept-Language
• Encoding and language marking in markup
Special caution for HTTP GET parameters
• Locale-independent formatting recommended
• Text parameters
Encode in UTF-8 and escape in URIs
IRI (International Resource Identifier) functionality
might provide this for you
International Web Services Design
SOAP
Locale can be communicated in
• Transport header (e.g. HTTP)
• SOAP header
• SOAP message body
Beware of automatically generated SOAP
interfaces
• Might be locale-dependent, but not allow to
specify locale
Use of XML Schema data types promotes
locale-independence
Also consider localization of error
messages
Conclusions
Unification
• One code base
Customization
• Localization and adaptation for locales
Next step: cross-language “leakage”
• Provide views in multiple languages to the
same (user-generated) data
• Translate user-generated content
Volunteers
Machine Translation
Call for Contributions
Presentation and Perl CGI demo code
• http://www.digitalsilkroad.net/web2expo
Add a version in your preferred language
•
•
•
•
Ruby on Rails
PHP
Python
…
Similar ASP.NET application
• http://quickstarts.asp.net/QuickStartv20/aspn
et/doc/localization/default.aspx
References
W3C Internationalization Activity
• http://www.w3.org/International/
POSIX Locale
• http://www.opengroup.org/onlinepubs/009695399/base
defs/xbd_chap07.html
International Components for Unicode
• http://www-306.ibm.com/software/globalization/icu/
Unicode/Common Locale Data Repository
• http://www.unicode.org/
Microsoft Internationalization APIs
• http://msdn2.microsoft.com/enus/library/ms776254.aspx
• http://msdn2.microsoft.com/enus/library/system.globalization.aspx
References
OmegaT
• http://www.omegat.org/omegat/omegat_en/omegat.html
Open Language Tools
• https://open-language-tools.dev.java.net/
The WordForge Project
• http://www.wordforge.org/drupal/
Javascript Internationalization
• http://www.icuproject.org/docs/papers/internationalization_support_for_javascript.ht
ml
RSS 2.0
• http://www.rssboard.org/rss-specification
Atom Syndication
• http://www.atomenabled.org/developers/syndication
RSS 1.0
• http://web.resource.org/rss/1.0/spec
W3C Web Services Internationalization Usage Scenarios
• http://www.w3.org/TR/ws-i18n-scenarios/
Additional Slides
Multi-lingual Syndication
RSS 1.0
Character encoding
• RSS 1.0 is an XML application
• XML encoding rules apply
Complies to RDF (Resource Description
Framework) specification
• Definition of language and date/time formats
are left to RDF metadata formats
Dublin Core Metadata Element Set
Language: RFC1766/ISO639-2
Date/Time: ISO 8601 (superset of RFC 3339)
• Also Dublin Core allows to specify time periods!