Unicode & Encoding

Download Report

Transcript Unicode & Encoding

Unicode & W3C
Jataayu Software
C. Kumar
January 2007
Agenda
About Jataayu
Unicode & Encoding
W3C Specification for multi-lingual authoring
Multilingual WEB Address
Indian WEB Sites an Overview
W3C Activity
About Jataayu
Jataayu formed with a clear focus of delivering solutions for
wireless data services
Over 60% of the data traffic in Indian Mobile Networks for
WAP, Mobile WEB and MMS handled by Jataayu Products
Mobile Device Solution Division focusing on wireless data
applications like WAP, MMS, SyncML, IMPS, Email, Web
Browsing, Download
Active participants in OMA, W3C and MWI
Over 350 people strong with offices in UK, Singapore, Korea,
Taiwan and the US; headquartered in India with major
development center in Bangalore
Localization - Internationalization
Localization (l10n)
Adaptation of the content to meet the language, cultural
and other requirements of a specific target market
Internationalization (i18n)
Design & Development of the content that enables easy
localization for target audiences that vary in culture, region
or language.
Mission of W3C i18n Activity is to ensure the W3C’s
formats and protocols are usable worldwide in all
languages and in all writing systems.
Need for Unicode
Early character sets based on 7-bit, gave 27 (ie. 128)
possible characters
Adding the 8th bit gave a total of 256 possible
characters. Still not enough for all the European
languages.
Code page mechanism helped a little by changing the
upper cells (0xA0 to 0xFF), but was very complex.
Addressing the needs of the other languages requires
thousands of ideographic characters at a time.
Unicode & Encoding
Unicode, universal character set contains all the
characters needed for writing the majority of
living languages in use on computers.
Allows for simple display and storage of multilingual
content
An encoding refers to the way that characters
are mapped from the character set to actual
Unicode value.
Different encoding yield different byte sequences.
Unicode & Encoding
UTF-8 (Unicode Transformation Format)
Variable length 8-bit character encoding for Unicode
Able to represent any universal character in the
Unicode Standard
Uses one to four bytes to encode a Unicode symbol
Only one byte is needed to encode the US-ASCII
characters
Unicode & Encoding
UTF-16 (16-bit Unicode Transformation Format)
Variable length 16-bit character encoding for Unicode
Uses two or four byte sequence to encode a Unicode
symbol
Two byte is required to encode the US-ASCII character
UCS-2 (2-byte Universal Character Set)
Fixed length encoding that always encodes characters into
a single 16-bit value
It can encode characters in the range 0x0000 to 0xFFFF
Unicode & Encoding
UCS-4 / UTF-32 (32-bit Unicode Transformation
Format)
Fixed length 32-bit character encoding for Unicode
Every character it uses 4 bytes and it is very space
inefficient
Little used in practice with UTF-8 and UTF-16 being the
normal ways of encoding Unicode Text
http://www.unicode.org/
Unicode & Encoding
Devanagari (0x0900 – 0x097F)
Bengali (0x0980 – 0x09FF)
Tamil (0x0B80 – 0x0BFF)
Kannada (0x0C80 – 0x0CFF)
Code Point
U+0041
U+05D0
U+597D
U+233B4
UTF-8
41
D7 90
E5 A5 BD
F0 A3 8E B4
UTF-16
00 41
05 D0
59 7D
D8 4C DF B4
UTF-32
00 00 00 41
00 00 05 D0 00 00 59 7D 00 02 33 B4
Unicode & Encoding
Alternate way to represent the character is by using
escape value. (א)
Not all documents have to be encoded as Unicode
But documents can only contain characters defined by
Unicode Standard
Any encoding can be used as long as it is properly
declared and it is the subset of Unicode
Unicode encoding also allows many more languages
to be mixed on a single page
Other Encoding formats …
Shift_JIS (SJIS), character encoding for the
Japanese Language
Single byte character encoding for the lower-ASCII
characters (0x00 – 0x7F)
Double-byte character encoding for the upperASCII bytes
GB2312, character encoding for simplified
Chinese characters
W3C Specification - Encoding
W3C specification for multi-lingual authoring
Encoding of the document needs to be mentioned, so that the
application that consumes can interpret it.
Meta Tag
<meta http-equiv=“Content-type” content=“text/html;charset=UTF-8”
/>
XML
<?xml version=“1.0” encoding=“UTF-8”?>
Content-type header returned from the WEB server should
also contain the character encoding of the document
Content-Type: text/html; Charset=utf-8
W3C Specification - Language
Author needs to specify the language of the
document (web page content)
Browser can choose the appropriate font selection
using the Lang attribute
Search Engine can group or filter results based on
the user’s linguistic preferences (using meta)
Translation tools use to recognize the section of text
in a particular language
W3C Specification - Language
HTTP Content Language Header
Content-Language: hi
Language Attribute on html tag
<html lang=“hi”>
<html xml:lang=“hi”>
Content Language in meta tag
<meta http-equiv=“Content-Language” content=“hi” />
Language attribute on embedded content
<div lang=“en” xml:lang=“en”> Some English Content
</div>
What value to use for lang?
IANA (Internet Assigned Numbers Authority)
Provides a unique value for each language
It is available in the Subtag value in the new IANA
Language
http://www.iana.org/assignments/language-subtagregistry
Hindi – hi, Kannada – kn, Tamil – ta
Bi-directional text
Additional information is required in addition to
the language attribute to provide support for
non-Latin scripts (like Arabic, Hebrew, Urdu)
In HTML, dir attribute is used to specify the
direction of the text
The title says “<span dir=“rtl”> ‫ם ו א נ י ב ה ת ו ל‬
‫ י ע פ‬, W3C</span>” in Hebrew.
Multilingual WEB Address
A Web address is used to point a resource on the
WEB
Web address are typically expressed using URIs (Uniform
Resource Identifiers)
Restricts to a small number of characters (upper & lower
case letters of the English alphabet, numerals and few
symbols).
User’s expectations and use of the Internet have
changed this restrictions.
There is a growing need to use any language characters in
WEB Addresses.
Multilingual WEB Address …
A Web address in your own language and
alphabet is easier to create, memorize, interpret
and relate it. (Ex: http://खोज.com)
Punycode is a way of representing Unicode
code points using only ASCII characters. (Ex:
http://xn--21bm4l.com)
Indian Content an Overview
Most Indian Websites are not using Unicode
Content are generated within the ASCII range and
provide the proprietary fonts which maps the ASCII
character set to Indian Languages.
Visually it will be fine, but no other entities will be
able to interpret it
For each site, the user may need to download the
proprietary fonts, which is not user friendly
Search Engine will not be able to interpret the
content which is intended by author as it does not
follow the standard encoding.
Indian Content an Overview
Unicode & W3C Importance
WEB is also moving towards
the mobile
W3C Mobile Web Initiative
(MWI) defines the best
practices for Mobile Browsing
Cannot install the required
font’s during run-time as
used to do in desktop
If Unicode character are
used the required font may
be available within the
device
Firefox
Firefox (http://www.getfirefox.com)
Provides extensive support for Unicode and related
fonts
Provides the Add-ons to type in Indian Languages
in web pages in Linux. (Such tools are already
available for Windows XP Users through the
language packs)
https://addons.mozilla.org/firefox/5484/author/
W3C i18n activity
Core Working group
Enable universal access to the World Wide Web by
providing adequate support to other W3C Working Groups
GEO (Guidelines, Education & Outreach)
Internationalization aspects of W3C technology better
understood and more widely and consistently used
ITS (Internationalization Tag Set)
Develop a set of elements and attributes that can be used
with new DTDs/Schemas to support the internationalization
and localization of documents
Thanks
[email protected]