Corporate Template August 2001 - Inter
Download
Report
Transcript Corporate Template August 2001 - Inter
Creating Multi-Lingual and
Multi-Locale Databases
International Unicode Conference 19
Presented by Addison Phillips
Globalization Architect
webMethods, Inc.
Introduction
Audience:
Presenter:
Beginning Developers
Addison Phillips
Globalization Architect
webMethods, Inc.
mailto:[email protected]
Presentation: http://www.inter-locale.com
Creating complex systems in a global environment requires more than
internationalized code. Since most Enterprise system rely on relational
databases, a global-ready system must also consider database design
in order to be truly effective.
Our Problem
This presentation is based on the lessons learned in developing and
deploying a B2B “conversation management” system (webMethods for
Trading Networks) and “partnership management” software
(webMethods PartnerConnect).
The products we created share a central database that allow
webMethods customers to manage their B2B trading partnerships.
Terminology:
A “trading partner” is a company that you want to do business with.
An “initiative” is a specific opportunity to work together.
Trading Networks
Companies need to store
information about transactions
and business relationships
world wide and in real time.
We call this “Global Business
Visibility”
Partner Connect Goals
Centrally served (one
instance).
Centrally managed (initiatives
can be deployed anywhere).
Localized (so partners can
interact with initiatives in their
own language).
Cultural and market sensitivity
(customized to fit different
market conditions locally).
Created and managed by the
customer entirely through
HTML interface.
Profile and Conversation Management
Enter the Database
Serve both global and local
initiatives from a single
instance.
Store data in multiple writing
systems (scripts,
languages).
Provide for actual differences
in the data due to user
location (“locale”).
Provide for localization of
global content.
Provide for local content
management.
Basic Rules for a Global DB Schema
1. Expand fields to support changes in character encoding.
2. Expand fields to support differences in the storage
requirements of other locales (cultural or linguistic expansion,
as opposed to encoding)
3. Classify data as locale-neutral, locale-intrinsic, or localerelated and re-normalize the tables accordingly.
4. Create efficient access to both global and locale-specific
information.
Selecting an Encoding
If a database instance will only serve a single locale (or compatible
locales), then the character encoding can be selected based on local
requirements (“legacy encoding”).
If the database must store data from many locales (or incompatible
writing systems), then the character encoding selected must be a
Unicode encoding.
Each database vendor has a unique approach to this.
Encodings vary in terms of performance and capability.
Generally the two choices you have are:
UTF-8
UTF-16 (formerly known as UCS-2)
Character Encodings and DDL
Each database vendor provides
their own encoding support.
Most support “legacy” encodings
and their variants.
Need a Unicode encoding to support
multiple languages (globally)*
Each vendor handles Unicode
encodings differently.
CREATE TABLE Address (
cust_id
number,
attn
varchar(50),
department
varchar(50),
street1
varchar(50),
street2
varchar(50),
city
varchar(50),
state
char(2),
zip
varchar(5),
country
varchar(18));
Example: Cloudscape
Cloudscape is a pure Java
database.
Uses java.lang.String objects to
store char and varchar data, so all
string data is stored as UCS-2.
(1) java.lang.Character equals (1)
unit in DDL
CREATE TABLE Address (
cust_id
number,
attn
varchar(50),
department
varchar(50),
street1
varchar(50),
street2
varchar(50),
city
varchar(50),
state
char(2),
zip
varchar(5),
country
varchar(18));
Example: Oracle
Oracle provides several native
Unicode encodings. The most
commonly used one is called
“UTF8”.
Characters in UTF-8 range from one
to four bytes*
Char and varchar2 types are defined
in bytes, not characters.
So a varchar2(30) can reliably store
10 Unicode 2.1.8 characters (and as
many as 30).
Note that a varchar2(60) is required
to store surrogate pairs.
CREATE TABLE Address (
cust_id
number,
attn
varchar2(150),
department
varchar2(150),
street1
varchar2(150),
street2
varchar2(150),
city
varchar2(150),
state
char(6),
zip
varchar2(15),
country
varchar2(18));
Oracle Example
Create a table and insert values.
Notice that “multibyte” values take
more room to store.
Example: MS SQL Server 2000
SQL Server 2000 provides support
for the UTF-16 encoding of Unicode
via the nchar and nvarchar
datatypes.
Char and varchar2 must use a
legacy encoding, with sizes defined
in bytes (so a varchar(30) can store
as many as 30 and as few as 15
characters in Shift-JIS [CP932]).
Nchar and nvarchar are defined in
characters, so an nvarchar(30) can
store 30 characters.
Note that an nvarchar(30) can only
store 15 characters beyond
U+FFFF.
CREATE TABLE Address (
cust_id
integer,
attn
nvarchar(50),
department
nvarchar(50),
street1
nvarchar(50),
street2
nvarchar(50),
city
nvarchar(50),
state
nchar(2),
zip
nvarchar (5),
country
nvarchar(18));
SQL Server Example
Create a table.
Insert data using “multibyte”
characters.
Insert data using “single-byte”
characters.
Oracle Unicode Encoding Variations
AL24UTFFSS. The original Unicode encoding supported by Oracle. It is
not compatible with modern Unicode and should be avoided.
UTF8. A multibyte encoding used by most versions of Oracle. This
version encodes Unicode Scalar Values larger than U+FFFF as the
UTF-8 sequence for a pair of surrogate characters.
This results in binary sorting sequences compatible with UTF-16
representations.
This violates the Unicode “shortest form” requirement (note that this is
invisible to my Java application).
*(All JDBC drivers adjust the connection to use this encoding automatically.)
AL32UTF8. A UTF-8 encoding provided in Oracle 9i that correctly
encodes Unicode Scalar Values larger than U+FFFE using the shortest
form. Note that the sorting sequence is different.
nchar/nvharchar support for UTF-16 in Oracle 9i.
MS SQL Server 2000 Issues
Code Page 65001: This is Microsoft’s code page for UTF-8.
It can not be used as a char/varchar/text encoding, even in the most recent
versions of MS SQL Server.
See http://support.microsoft.com/support/kb/articles/Q232/5/80.ASP for
more information.
You can use a different encoding (by setting the collation) for each data
column, but this is not a very convenient way to work in a global
environment.
JDBC connections to SQL Server use the JDBC-ODBC driver. This
driver cannot tell the difference between n-types and “regular” types,
and thus cannot retrieve Unicode string values.
Note that this also applies to several middleware products, notably Merant.
Note that this also applies to use of variant text types in other databases
(such as Oracle 9i).
Some Other Databases
Sybase
ASE supports UTF-8.
Sybase 11 supports UTF-8 via an add-on.
ASE is adding support for UTF-16 via a new data type.
IBM DB/2
Supports UTF-16 (as CCSID 13844).
Supports UTF-8 as a database encoding.
MySQL doesn’t support Unicode.
The Open Source folks need to get to work …
Modifying Size Constraints
UTF-8 has a maximum number of
bytes-per-character of 4.
Vast majority of characters use 3 or
less.
Older systems (JDK, for example)
cannot access the 4-byte
characters.
Determine size requirements:
Specific constraint
--or--
Arbitrary constraint.
Specific Size Limit:
Check length using code (database
fields have variable restrictions).
For UTF-8: Multiply by 3 (or 4) bytes
to get field length.
Example: varchar2(10) becomes
varchar2(30).
Arbitrary Size Limit:
Multiply the desired maximum by 3
bytes to get approximate size.
Adjust according to database and
performance requirements.
Example: varchar2(100) becomes
varchar2(255). [Was able to store
100 characters, now a “minimum
maximum” of 85.]
Cultural Data Expansion
Data also changes size (and sometimes type) because of “culture” or
locale.
Examples:
Social Security Number is 13 digits in France
“Postal code” not all numeric outside USA and may be quite long.
Different address units than “State”
Spanish users often have two or three “middle” names.
Avoid arbitrarily small char and varchar field lengths. Most databases
optimize storage of variable length fields.
But avoid performance killing sizes.
Oracle block size limitations.
Oracle JDBC character conversion “latching” maximum
(2000 bytes in 8.0.5).
Cultural Data Expansion
State (char2) becomes province (up
to 85 characters).
ZIP code (varchar9) becomes
postalcode (up to 50 characters and
probably much more).
Address fields expand from 50 bytes
to 255 bytes (or about 85
characters).
Don’t assume that the same fields
will always represent the exact
same data values.
CREATE TABLE Address (
address_id
char(24),
cust_id
char(24),
contact_id
char(24),
country_id
char(2),
department
varchar2(255),
street1
varchar2(255),
street2
varchar2(255),
city
varchar2(255),
province
varchar2(255),
postalcode
varchar2(150));
What’s Left?
So far we’ve:
Expanded storage to deal with character encodings.
Expanded storage to deal with cultural and linguistic data expansion.
We still need to:
Allow for localization of textual elements.
Allow for relational changes due to cultural or linguistic requirements.
Basic Questionnaire Table Structure
QUESTION
-------Q_ID
QUES_ID
TYPE_ID
QUESTION
SEQ_NUM
char(24)
char(24)
char(2)
varchar(255)
number
Structure with Localization
QUESTION
-------Q_ID
QUES_ID
TYPE_ID
DESCRIPTION
SEQ_NUM
char(24)PK
char(24)PK
char(2)
varchar(255)
number
QUESTION_LOCALE
--------------Q_ID
char(24)PK, FK
QUES_ID char(24)PK, FK
LCID
number PK
NAME
varchar(255)
Selecting the Locale
How Java does it:
<baseclass>+<specific language>+ <specific country>+<specific variant>
<baseclass>+<specific language>+ <specific country>
<baseclass>+<specific language>
<baseclass>+<default language>+ <default country>+<default variant>
<baseclass>+<default language>+ <default country>
<baseclass>+<default language>
<baseclass>
How can we replicate this in SQL?
One Method…
SELECT * FROM Questionnaire WHERE InitiativeID = ?
SELECT * FROM Question WHERE Q_ID = ?
(while more questions)
(do)
SELECT * FROM QuestionLocale WHERE Ques_ID = ? AND LCID=?
(until you find a record…)
(wend)
QUESTIONLOCALE
-------------Q_ID
QUES_ID
LANG_ID
TERRITORY_ID
QUESTION
• Inefficient.
• Difficult to manage.
Our Solution
Concept of “installed
locale”
Create associated
“installed locale”
records at the hub or
questionnaire level.
Perform locale
negotiation once.
No additional searches
required.
Let’s look…
Appearance of Questions
Data and Locale
Some data is locale neutral.
Formatted at display time to
match user’s locale.
Values don’t vary by locale.
Note: It may be in a language.
Some data is locale related.
Data locale implied by context.
Formatting/Validation is supplied
by context.
Locale can be inherited or
cascaded.
Some data is locale intrinsic.
Business Logic (format/validation) changes due to data’s locale.
Locale must be tagged.
Implies a separate table.
Simplify with Locale Related
QUESTIONS AND ANSWERS
Presentation Available at http://www.inter-locale.com
mailto:[email protected]