National Language Support
Download
Report
Transcript National Language Support
www.brianhitchcock.net
Brian Hitchcock
OCP DBA 8i
Global Sales IT
Sun Microsystems
[email protected]
[email protected]
www.brianhitchcock.net
Session #403
NLS and
The Case of the
Missing Kanji
NLS -- National Language Support
Kanji -- Japanese characters
www.brianhitchcock.net
How It All Started
Existing Sybase database and application
Needed to convert to Oracle
Use Oracle Migration Workbench
–
OMWB works well
I wasn’t told there was multi-byte data in the
Sybase database
After the migration to Oracle
–
Kanji data were missing
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 4
Kanji Become Lost
How was Kanji stored in Sybase?
How was application working with Sybase?
Why lost when migrated to Oracle?
How to fix in Oracle?
Would application work with Oracle?
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 5
Before Upgrade to Oracle
0xB0A1
0xB0A1
Source System
Application
Sybase Db
EUC-JP character encoding
Source system inserts bytes
of Kanji characters into db
Application
retrieves char data
generates HTML
character set ISO1
0xB0A1
Browser examines each byte,
detects multi-byte characters,
displays Kanji character
EUC-JP character code for
this character is 0xB0A1
Netscape
Browser
Select “Japanese (Auto-Detect)”
character set in Netscape to view
Kanji characters
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 6
Moving Sybase Data to Oracle
0xB0A1
Existing
Sybase Db
0x3021
SQL*Loader
Oracle defaults to US7ASCII
character set ISO1
0x3021
Oracle Db
Character set
WE8ISO8859P1
0xB0A1
Application
retrieves char data
generates HTML
Flat file produced
using Sybase bcp utility
0xB0A1
EUC-JP character code for
this character is 0xB0A1
Browser displays the characters
for byte codes 0x30 and 0x21
which are 0 and !
0x3021
Netscape
Browser
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 7
0!
Moving Sybase Data to Oracle
What happened?
–
Oracle database is US7ASCII character set
7-bits per character
–
Import stripped the 8th bit off each byte
8th bit set to 0
–
–
–
8-bit characters are now 7-bit characters
Original character data is lost
8-bit characters can’t be represented in the
US7ASCII character set
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 8
US7ASCII to WE8ISO8859P1
B0A1 hexadecimal
B
Hex Decimal Binary
A
10
1010
B
11
1011
C
12
1100
D
13
1101
E
14
1110
F
15
1111
0
A
1
1011 0000 1010 0001
strip off 8th (highest order) bit, set this bit to 0
0011 0000 0010 0001
3
0
2
1
3021
0!
EUC-JP character code for
this character is 0xB0A1
Browser reads each byte, sees the 8th bit
set to 0, decides that each byte represents
a single byte character, character codes
30 and 21 represent the characters 0 and !
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 9
Fix -- Sybase Data to Oracle
Note: WE8ISO8859P1
character set does not
support Kanji characters
0xB0A1
Existing
Sybase Db
SQL*Loader
NLS_LANG=WE8ISO8859P1
character set ISO1
0xB0A1
0xB0A1
Oracle Db
Character set
WE8ISO8859P1
0xB0A1
Application
Flat file produced
using Sybase bcp utility
0xB0A1
0xB0A1
EUC-JP character code for
this character is 0xB0A1
retrieves char data
generates HTML
Browser detects multi-byte
characters, displays Kanji
character
Netscape
Browser
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 10
Current Oracle Production
0xB0A1
0xB0A1
Source System
Application
Oracle Db
EUC-JP character encoding
Source system inserts bytes
of Kanji characters into db
Note: WE8ISO8859P1
character set does not
support Kanji characters
Application
retrieves char data
generates HTML
character set
WE8ISO8859P1
0xB0A1
Browser examines each byte,
detects multi-byte characters,
displays Kanji character
EUC-JP character code for
this character is 0xB0A1
Netscape
Browser
Select “Japanese (Auto-Detect)”
character set in Netscape to view
Kanji characters
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 11
Existing Application
How does it store/retrieve Kanji?
–
–
–
Multi-byte Kanji characters
Stored in WE8ISO8859P1 (single-byte) db
Application
JDBC retrieves bytes from WE Oracle db
Java generates HTML, sent to client browser
Netscape, view HTML using “Japanese (AutoDetect)” character set
Display Kanji
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 12
Existing Application
Application
Retrieves
character
data
Converts 8859-1
to UCS2
8859-1
Generate
HTML
UCS2
UCS2 -- 2 byte Unicode
Oracle Db
ISO1
Send HTML
to client browser
ISO1
Netscape
Browser
Japanese Auto-Detect
8859-1
Character set
WE8ISO8859P1
Browser detects multi-byte
characters, displays Kanji
character
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 13
Existing Application
Each piece of software makes some decision
(default) about character set
You need to understand this process for your
application
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 14
What Really Happened?
Source Kanji data
–
–
From EUC-JP character set
Multi-byte
Kanji multi-byte stored in Sybase db
–
Default character set ISO-1
8-bit, single-byte
Kanji multi-byte stored in Oracle db
–
Character set WE8ISO8859P1
8-bit, single-byte
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 15
Convert to UTF8
Why?
–
–
Eliminate all the issues shown so far
Store multiple languages correctly
Correctly encoded
–
–
Support clients inserting data in languages other
than Japanese Kanji
Existing application can only support languages
based on Latin characters and Kanji
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 16
Conversion is Simple -- Isn’t
It?
Export WE8ISO8859P1 database
–
–
Set export client NLS_LANG
AMERICAN_AMERICA.WE8ISO8859P1
Import into UTF8 database
–
–
Set import client NLS_LANG
AMERICAN_AMERICA.WE8ISO8859P1
Test application
–
–
Application works!
Is everything OK?
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 17
Meanwhile, Back at the Ranch
While application testing is going on
–
Insert sample bytes for Kanji into WE db
Use Oracle SQL CHR() function
–
–
–
–
Export from WE db, import into UTF8 db
Examine same bytes in UTF8 db
Compare UTF8 bytes to manually generated
UTF8 bytes for the Kanji characters
NOT the same bytes!
What does this mean?
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 18
UTF8 Encoding Process
JA16EUC to UTF8 Conversion
Think your
life is
boring?
JA16EUC encoding is 0xB0A1, Unicode code point for this
character is 4E9C (no formula for this, Oracle uses a lookup table)
Number of bytes used in UTF8 encoding based on Unicode code point:
Unicode UTF8 bytes
0x0000 - 0x007f 0xxxxxxx
0x0080 - 0x07ff 110xxxxx 10xxxxxx
0x0800 - 0xffff 1110xxxx 10xxxxxx 10xxxxxx
4E9C requires 3 bytes in UTF8
4E9C bit pattern is
0100 1110 1001 1100
Metalink
Doc ID: Note:69518.1
Determining the codepoint
for UTF8 characters
right-most 6 bits go to third UTF8 byte 01 1100
next 6 bits go to second UTF8 byte 1110 10
remaining 4 bits go in first UTF8 byte 0100
11100100 10111010 10011100
E 4 BA9 C
UTF8 character code is 0xE4BA9C
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 19
Bytes is Bytes
Unicode byte code for this
character is 0x4E9C
ISO-2022-JP character
code for this character is
0x3021
Unicode
lookup table
Unicode
UCS2
ISO-2022-JP
Formula
Formula
Formula
Formula
JIS
UTF8
JIS row/cell values for this
character are Row 16, Column 1
Shift-JIS
Shift-JIS byte code for this
character is 0x889F
EUC-JP
UTF8 byte code for this
character is 0xE4BA9C
Kanji character
encodings shown for the
various character sets
EUC-JP byte code for this
character is 0xB0A1
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 20
Conversion Issue
Correct UTF8 byte code for this
character is 0xE4BA9C
0xC2B0, 0xC2A1
Application
Oracle Db
Import loads
export file
character set
WE8ISO8859P1
Oracle Db
Character set
UTF8
character set
WE8ISO8859P1
0xB0A1
0xB0A1
Flat file
Select char data
Spool to file
EUC-JP character code for
this character is 0xB0A1
Oracle export hard
codes the source db
character set into the
export file
Export File
0xB0A1
Browser displays the characters for
character codes 0xC2B0, 0xC2A1
which are the degree sign and the
inverted exclamation mark °¡
Netscape
Browser
View file
0xC2B0, 0xC2A1
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 21
Import to UTF8 Conversion
Existing WE8ISO8859P1 data to UTF8 Conversion
EUC-JP encoding for the character is 0xB0A1, but import detects that the data
came from a single-byte export file (WE8ISO8859P1)
Import reads each byte, one at a time, 0xB0A1 becomes 0xB0 followed by 0xA1, and
converts these to the Unicode (UCS2) equivalent -- for single-byte character
codes, the Unicode equivalent simply has two leading bytes of 0's -- 0xB0 and
0xA1 become U+00B0 and U+00A1, import then converts from UCS2 to UTF8
Number of bytes used in UTF8 encoding based on Unicode code point:
Unicode
0x0000 - 0x007f
0x0080 - 0x07ff
0x0800 - 0xffff
UTF8 bytes
0xxxxxxx
110xxxxx
1110xxxx
10xxxxxx
10xxxxxx 10xxxxxx
00B0 and 00A1 both require 2 bytes in UTF8
00B0 bit pattern is
0000 0000 1011 0000
00A1 bit pattern is
0000 0000 1010 0001
right-most 6 bits go to second UTF8 byte
next 5 bits go in first UTF8 byte
1100
C
0010
2
1011
B
0000
0
1100
C
0010
2
1010
A
0001
1
WE8ISO8859P1 character code 0xB0A1 becomes UTF character codes 0xC2B0, 0xC2A1
The correct conversion of this EUC-JP character code 0xB0A1 to UTF8 is 0xE4BA9C
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 22
What Happened?
Oracle did exactly what is was told to do
–
–
–
–
–
–
–
Take bytes from WE database
Convert to UTF8 bytes
Export file was made from WE database
WE is single-byte character set
Convert each byte one at a time to UTF8
Kanji character consists of 2 bytes in WE db
Converting each byte to UTF8 not the same as
converting the pair of bytes to UTF8
Yeah, but, application works! (?)
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 23
Where’s the Problem?
UTF8 db has Kanji as 0xC2B0, 0xC2A1
Correct UTF8 encoding is 0xE4BA9C
If new, correctly encoded Kanji is inserted
–
–
Database contains two sets of bytes for same
Kanji character
How does app deal with this?
Existing app only works using Netscape
Japanese (Auto-Detect) character set
–
App is not really UTF8, only works for Japanese
characters
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 24
How Does Application Work?
Review
–
–
Oracle db created using UTF8 character set
Java retrieves char data (bytes) from UTF8 db
Converts to UCS2 (Unicode)
–
–
Java code generates HTML
Client browser displays Kanji characters
Netscape, “Japanese (Auto-Detect)” char set
Application still works
–
bytes in UTF8 db don’t represent UTF8 encoded
Kanji
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 25
Application Works (?)
Application
0xB0, 0xA1
Retrieves
character
data
Converts UTF8
to UCS2
UTF8
0xB0, 0xA1
Generate
HTML
UCS2
0xC2B0, 0xC2A1
Oracle Db
Export WE database,
import into UTF8 db
8859-1
0xB0, 0xA1
Send HTML
to client browser
ISO1
ISO1
0xB0, 0xA1
Netscape
Browser
Japanese Auto-Detect
Character set
UTF8
EUC-JP character code for
this character is 0xB0A1
Browser detects multi-byte
characters, displays Kanji
character
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 26
Test Application
Insert bytes for correctly encoded Kanji
–
–
Into UTF8 db
Use CHR() function
Display this data using existing application
–
Does NOT display Kanji!
Using “Japanese (Auto-Detect)” character set
–
Try Netscape UTF8 character set
Doesn’t display Kanji
–
UTF8 character set should work, shouldn’t it?
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 27
Where Are We?
Correctly encoded UTF8 multi-byte character
data for Kanji does not work with existing
application
Simply “converting” (export WE, import to
UTF8) doesn’t result in correctly encoded
UTF8 character data
Need to figure out what app code is doing
–
–
Whoever wrote it is gone
The usual state of affairs
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 28
How To Debug App Code?
Don’t use app code
–
write very simple Java Servlet
(The Java Diva helps with this…)
–
Servlet simply retrieves character data from db
Runs in iPlanet web server
–
generates HTML for client browser
Use servlet to retrieve correct UTF8 Kanji
–
Does not display Kanji!
Fix servlet then can fix application code?
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 29
Modified Servlet Code
res.setContentType("text/html;charset=UTF-8");
PrintWriter out = new PrintWriter(
new OutputStreamWriter(res.getOutputStream(),
"UTF-8"),true);
out.println("<META HTTP-EQUIV=" + DQ +
"Content-Type" + DQ + " CONTENT=" + DQ +
"text/html; charset=utf-8" + DQ + ">");
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 30
Fix Application
Make same changes to application code
Browser displays Kanji correctly
–
Manually generated, correctly encoded UTF8
Application interacts with Dynamo
–
Need to reconfigure Dynamo for UTF8 data
Application fixed (?)
–
Works with correctly encoded UTF8 multi-byte
data
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 31
Is Application really fixed?
Fixed app retrieves correctly encoded UTF8
character data
What about existing character data?
–
Data that was exported from WE and imported
into UTF8 db
Use fixed app code to retrieve existing data
–
Existing Kanji are not displayed
Original app did display existing data...
–
Existing data is not correctly encoded UTF8
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 32
Fixed Application
Application
0xB0, 0xA1
Retrieves
character
data
Converts UTF8
to UCS2
UTF8
0xC2B0, 0xC2A1
Generate
HTML
UCS2
0xC2B0, 0xC2A1
Oracle Db
8859-1
Character set
UTF8
Send HTML
to client browser
UTF8
0xC2B0, 0xC2A1
UTF8
Netscape
Browser
UTF8t
EUC-JP character
code for this character
is 0xB0A1
Browser displays characters for
the UTF8 bytes 0xC2B0, 0xC2A1
which are degree sign and upside
down exclamation point °¡
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 33
How To Fix Existing Data?
What’s wrong with existing data (UTF8 db)
–
–
Character data is not correctly encoded UTF8
It is UTF8 encoded Unicode of each single byte
that was exported from WE database
Before importing into UTF8 database?
–
–
EUC-JP character set (Latin ASCII and Kanji)
Stored in single-byte WE database
Need to convert UTF8 of WE of EUC-JP to
correct UTF8 bytes for Kanji
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 34
Review of Bytes is Bytes
Original Kanji character 0xB0A1 (EUC-JP)
Inserted into Oracle database
–
0xB0, 0xA1 in WE8ISO8859P1 db
Exported/imported into Oracle UTF8 db
–
Individual bytes converted to UTF8
Original Kanji character was 2 bytes
Became 4 bytes in UTF8 db
0xC2B0, 0xC2A1
Correct UTF8 bytes are 0xE4BA9C
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 35
How to Convert Existing Data?
Fix in Oracle WE before export/import
–
No point, export/import will ‘corrupt’ character
data, will need to fix after export/import
Don’t export/import
–
–
SQL select each table to flat files from WE db
SQL*Loader into UTF8 database
Use CHARACTERSET JA16EUC option
More work moving each table one at a time
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 36
SQL*Loader Option
0xE4BA9C
Application
Oracle Db
SQL*Loader
CHARACTERSET JA16EUC
character set
WE8ISO8859P1
0xB0A1
0xB0A1
Oracle Db
Character set
UTF8
0xB0A1
Application
retrieves char data
generates HTML
SQL Select to flat file
for each table
0xB0A1
EUC-JP character code for
this character is 0xB0A1
Select “UTF8” character set
to view Kanji characters
Netscape
Browser
0xE4BA9C
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 37
Convert Existing Data
Fix data after import into UTF8 database
–
–
Export from WE, import into UTF8 database
Use Oracle SQL CONVERT() function
CONVERT() from UTF8 to WE8ISO8859P1
CONVERT() from JA16EUC to UTF8
–
Need to CONVERT() each column of each table
that contains multi-byte data
How to be sure which columns to CONVERT()?
CONVERT() all columns that contain char data?
–
Must test using CONVERT() to verify it works
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 38
Fix After Import
0xB0A1
0xC2B0, 0xC2A1
Application
Oracle Db
Application
Oracle Db
character set
WE8ISO8859P1
character set
UTF8
Export File
0xB0A1
Import loads
export file
character set
WE8ISO8859P1
0xB0, 0xA1
EUC-JP character code for
this character is 0xB0A1
Convert from UTF8
to WE8ISO8859P1
0xB0, 0xA1
Bytes of original
EUC-JP Kanji
character
Convert from
EUC-JP to UTF8
0xE4BA9C
Correctly encoded UTF8 bytes
for this Kanji are 0xE4BA9C
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 39
Oracle CONVERT()
Syntax, examples
–
select CONVERT(<column>, <destination char
set>, <source char set>)
select CONVERT(<column>, WE8ISO8859P1,
UTF8)
select CONVERT(<column>, UTF8, JA16EUC)
–
Don’t re-run CONVERT() without testing
re-run may corrupt data
regenerate original source data, re-run
CONVERT()
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 40
Overall Conversion Process
What we did…
–
–
–
Identify tables/columns contain multi-byte data
Export from WE database
Import into UTF8 database
rows=n, create tables, don’t load data
–
Widen columns for UTF8 multi-byte data
increase to 3 times original width
–
Import into UTF8 database (again)
ignore=y, load data into existing tables
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 41
Overall Conversion Process
Continued
–
–
CONVERT() columns that contain multi-byte data
Test, compare with data from existing
application/data
Conversion includes converting all pieces of
the application, not just the Oracle database
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 42
Details - Source Char Set?
How did I determine this?
–
–
Original Kanji data was from EUC-JP
How was this determined?
Examine bytes of original character data
Display Original Kanji characters
Find single Kanji in Japanese dictionary
- Gives row-cell code of Kanji in JIS-0208
Using other reference sources
- manually generate bytes for the Kanji in various
encodings
Compare with bytes of original Kanji data
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 43
Rosetta Stone?
Oracle8i
National Language Support Guide
Release 2 (8.1.6)
December 1999
Part No. A76966-01
page 3-22
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 44
Reference Books Used
The New Nelson Japanese-English Character Dictionary
By John H. Haig,Andrew N. Nelson
Published by Periplus Editions, Ltd
Date Published: 11/1996
ISBN: 0804820368
The Unicode Standard: With CD-ROM
By Unicode Consortium
Published by Addison Wesley Longman, Inc.
Date Published: 04/1995
ISBN: 0201483459
CJKV Information Processing
By Ken Lunde,Gigi Estabrook (Editor)
Published by O'Reilly & Associates, Incorporated
Date Published: 01/1999
ISBN: 1565922247
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 45
Lessons Learned
Oracle (and Sybase) don’t store characters
–
They store bytes, strings of bytes
Normally, Oracle does NO checking of
character set
–
does NOT check that bytes inserted represent
correct characters in database character set
Only under specific circumstances does
Oracle “apply” a character set to char data
Changing character set affects more than just
the database
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 46
Lessons Learned
Bytes of character from any char set can be
stored in db of any charset
–
–
–
–
EUC-JP char in WE db, in UTF8 db
bytes in db are not ‘correct’ bytes for the
character in the db character set
all apps, users, dbs must know that db contains
char data from other char set
Any char set conversion may corrupt the char
data -- import WE into UTF8 db
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 47
Lessons Learned
Simply exporting db, importing into UTF8
does not solve the problems
Testing requires generating correctly encoded
character data
Every piece of an application makes a
decision about character set (default)
If all data in db really is in the db char set
–
export, import to db of other char set works
Need to see original character data
–
Verify data after char set conversion
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 48
Fill Out a Survey and Get a Chance
to Win a Compaq iPAQ!
We want to know what you think!
Fill out the survey that was handed out at the beginning
of the session for a chance to win a Compaq iPAQ.
Remember to include your name and email in the
available section and we will enter your name into two
daily drawings to win an iPAQ
www.brianhitchcock.net
Brian Hitchcock October 21, 2001
Page 49