ISO8859-1 - Progress Community

Download Report

Transcript ISO8859-1 - Progress Community

OPS-25: Unicode and the DataServer
David Moloney
Software Architect
Agenda
Unicode deployment with OpenEdge® DataServers
 Unicode:
•
•
•
•
How did we get here ?
What are its broader OpenEdge implications ?
What are its DataServer implications ?
Specific Implementation in the DataServers for:
– Oracle®
– MS SQL Server
2
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Code Pages
ASCII: 7-bit 127 Character Set
Extended ASCII
Special Chars
Lower Case
128 €
129 �
\t(Tab)
10 \n(NL)
13 \r(CR)
65 A
97
66 B
98
67 C
99
32 Space
68 D
100 d
132 „
33 !
69 E
101 …
34 “
70 F
… …
… …
133 …
35 #
71 G
…
37 …
72 …
125
… ü
253 ý
… …
… …
… …
… …
… …
… …
126
254
127
255
9
3
Upper Case
OPS-25: Unicode and the DataServer
a
b
c
130 ‚
131 ƒ
… …
… …
Extended 255
Character Sets:
• ISO8859-1
• 1250
• IBM437/850
© 2008 Progress Software Corporation
8-bit Code Pages
 Examples of character encoding:
4
ISO8859-1
ISO8859-2
1252
1250
IBM437
IBM850
IBM852
a
61
61
61
61
61
61
61
á
E1
E1
E1
E1
A0
A0
A0
È
C8
n/a
C8
n/a
n/a
D4
n/a
Č
n/a
C8
n/a
C8
n/a
n/a
AC
“
n/a
n/a
93
93
n/a
n/a
n/a
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Data Corruption
ISO8859-1
1250
E8
Avoid This
“è”
France
5
E8
OPS-25: Unicode and the DataServer
“č”
Czech Republic
© 2008 Progress Software Corporation
What is Unicode ? (“Unique Code”)
 A character encoding standard that:
• Replaces all legacy SBCS & MBCS systems
• Can assign more than a million numbers
– Highest code point: “U+10FFFF”=2^20+2^16=1,114,112
• Gives one “unique” number/text-symbol-character
• Provides one internationalization process
• Is Not platform, program, country or language
specific
• Is essential to the Web (HTML, XML, etc.)
6
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
How is Unicode encoded ?
“UTF-x”
UTF = Unicode Transformation Format
x = Minimum length of coding unit
U+0000
U+0001
Extended
ASCII
(ISO8859-1)
U+0002
U+0003
…
… …
…
UTF-16
BMP
U+00FF
ÿ
…
… …
…
UTF-32
UTF-8
Ease of Use
Storage Space
UTF-32
U+FFFF
U+100000
…
…
… …
U+10FFFD
The Encoding Tradeoff
Supplementary
Range
U+10FFFE
7
Char
ANSI
Number
Unicode
Number
ANS
Hex
Unicode
Hex
Unicode
Range
ÿ
255
255
0xFF
U+00FF
Basic
Latin
OPS-25: Unicode and the DataServer
U+10FFFF
= 1,114,112
© 2008 Progress Software Corporation
Unicode
8
UTF-8
UTF-16
UTF-32
U+004D
4D
00 4D
00 00 00 4D
U+00A1
C2 A1
00 A1
00 00 00 A1
U+00E1
C3 A1
00 E1
00 00 00 E1
U+0470
D0 C0
04 70
00 00 04 70
U+4E9C
E4 BA 9C
4E 9C
00 00 4E 9C
U+10302
F0 90 9C 82 D8 00 DF 02 00 01 03 02
OPS-25: Unicode and the DataServer
BMP
UTF Encoding Examples
© 2008 Progress Software Corporation
Unicode
UTF-8
UTF-16
UTF-32
U+004D
4D
00 4D
00 00 00 4D
U+00A1
C2 A1
00 A1
00 00 00 A1
U+00E1
C3 A1
00 E1
00 00 00 E1
U+0470
D0 C0
04 70
00 00 04 70
U+4E9C
E4 BA 9C
4E 9C
00 00 4E 9C
U+10302
F0 90 9C 82 D8 00 DF 02 00 01 03 02
BMP
UTF Encoding Examples
(Oracle) NLS_LANG
UTF8 3-byte “Modified”: C0 D8 00 80 DF 02
AL32UTF8 4-byte “Standard”: F0 90 9C 82
9
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Unicode Conversion
 All code pages convert to Unicode
 Unicode may not convert to other code pages
IBM437
IBM852
IBM850
1250
1252
ISO8859-2
ISO8859-1
10

OPS-25: Unicode and the DataServer
Unicode
?
IBM437
IBM852
IBM850
1250
1252
ISO8859-2
ISO8859-1
© 2008 Progress Software Corporation
Agenda
The path to successful development & deployment
 Unicode:
•
•
•
•
How did we get there ?
What are its broader OpenEdge implications ?
What are its DataServer implications ?
Specific Implementation in the DataServers for:
– Oracle
– MS SQL Server
11
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
The Unicode “Solution” ? Yes !
 YES !
• One stop shopping for Internationalization!
 NO, there are considerations to be addressed:
•
•
•
•
•
•
12
Operating System
Web Server (XML Schemas and HTML)
Print drivers
Data from/to other systems
OCX’s
Terminal Emulators
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
OpenEdge Globalization Settings
For more info: See “Internationalizing Applications” Guide
Primary
Secondary Database
Parameters Parameters Settings
-cpinternal
-cplog
_db._db-xl-name
-cpstream
-cpterm
_db._db-coll-name
-cpcoll
-cpprint
-d
-numsep
-E
-numdec
-cprcodein
-cprcodeout
-lng
Existing OpenEdge Constructs:
• Convmap.cp – Character Processing Tables
• Progress.ini Fonts
New OpenEdge Construct:
• ICU Library – For Linguistic Sorting
13
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Common Mistakes
Loading or importing data with the wrong code page
ÄŚzech
C4 8C 7A 65
63 68
ISO8859-1
Č
zech
Čzech
14
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Byte Order Mark (BOM)
Čzech
EF BB DF C4
8C 7A 65 63
Čzech
ISO8859-1
68
Čzech
OUTPUT TO text.txt CONVERT TARGET "UTF-8".
PUT CONTROL "~357~273~277". /* BOM */
PUT UNFORMATTED "UTF-8 text".
OUTPUT CLOSE
15
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Common Mistakes
Loading or importing data with the wrong code page
(…)
"imuller" "Ian Muller" "Y" "C" 1657 283200
"jdoe" "Jane Doe" "N" "U" 3275 450010
"jsmith" "John Smith" "Y" "C" 1450 323700
"jsanchez" "Juan Sánchez" "Y" "C" 4250 323900
.
PSC
filename=users
records=0000000001133
ldbname=mydatabase
timestamp=2007/03/28-20:55:03
numformat=44,46
dateformat=mdy-1950
map=NO-MAP
cpstream=ISO8859-1
.
0000143373
16
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Common Mistakes
Updating data with the wrong code page
_progres
_mprosrv
E0
-cpinternal ISO8859-1
-cpinternal IBM850
D3
E0
-cpstream IBM850
OS = 1252
E0
à
17
OPS-25: Unicode and the DataServer
_db-xl-name
ISO8859-1
D3
Ó
© 2008 Progress Software Corporation
Common Mistakes
Updating data with the CORRECT code page
_progres
_mprosrv
85
-cpinternal ISO8859-1
-cpinternal IBM850
E0
E0
-cpstream 1252
OS = 1252
E0
à
18
OPS-25: Unicode and the DataServer
_db-xl-name
ISO8859-1
E0
à
© 2008 Progress Software Corporation
Real Life Story
ASCII Linefeed (0x0A) to EBCDIC Newline (0x25)
Hi Bob,CRLF
How are you?CRLF
Bye
DataServer
for ODBC
0x0A
IBM037 EBCDIC
OpenEdge Client
-cpstream
iso8859-1
0D 0A
-cpinternal iso8859-1
_db-xl-name
IBM037
0x0A
0x0A
Iso8859-1 ASCII
Hi Bob,▐How are you?▐Bye
0x0A
19
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Real Life Story
ASCII Linefeed (0x0A) to EBCDIC Newline (0x25)
Hi Bob,CRLF
How are you?CRLF
Bye
DataServer
for ODBC
0x25
IBM037 EBCDIC
OpenEdge Client
-cpstream
IBM850
OD 0A
-cpinternal IBM850
_db-xl-name
IBM037
0x25
0x0A
IBM850 ASCII
Hi Bob,
How are you?
Bye
0x0A
20
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Tips & Hints
Un-corrupting data
 ISO8859-1 database with data encoded in
IBM850
 Run on session with -cpinternal iso8859-1
FOR EACH myTable EXCLUSIVE-LOCK.
RUN FixChar(INPUT-OUTPUT myTable.myField).
END.
PROCEDURE FixChar:
DEF INPUT-OUTPUT PARAM c AS CHAR NO-UNDO.
c = CODEPAGE-CONVERT(c,"IBM850","ISO8859-1").
END PROCEDURE.
21
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Database Sorting Rules
Are not all the same
FOR EACH table WHERE name <= CHR(126).
FOR EACH table WHERE name >= CHR(126).
22
-cpinternal
MSS 1252
_Db._Db-collate
Iso8859-1 Basic
#
$
~
Alphanumerics
#
$
Alphanumerics
~
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Agenda
The path to successful development & deployment
 Unicode:
•
•
•
•
How did we get there ?
What are its broader OpenEdge implications ?
What are its DataServer implications ?
Specific Implementation in the DataServers for:
– Oracle
– MS SQL Server
23
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Under Development
D I S C L A I M E R
 This talk includes information about potential
future products and/or product enhancements.
 What I am going to say reflects our current
thinking, but the information contained herein is
preliminary and subject to change. Any future
products we ultimately deliver may be materially
different from what is described here.
D
24
I
S
OPS-25: Unicode and the DataServer
C
L
A
I
M
E
R
© 2008 Progress Software Corporation
Unicode Deliverables
Unicode
25
Unicode
for
MSS
Unicode
+
for
Oracle
MSS
ICU
Collation DataSrvr DataSrvr
+
(limited)
CLOBs
OPS-25: Unicode and the DataServer
Oracle
NCLOB
Support
MSS
CLOB
Support
+
CLOB
Params
To
Stored
Proc.’s
© 2008 Progress Software Corporation
OpenEdge Settings
_db-xl-name, -cpinternal and -cpstream
OpenEdge Process
GUI
-cpinternal
CHUI
Database
Keyboard
-cpstream
Screen
Printer
OpenEdge
code page
conversions
_db-xl-name
OS files
26
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
OpenEdge Settings
_db-xl-name, -cpinternal and -cpstream
OpenEdge Process
GUI
-cpinternal
-cpstream
Screen
Printer
DataServer
Layer or process
CHUI
Keyboard
Driver
Conversions ?
OpenEdge
code page
conversions
DB
Driver
Foreign
Data
Source
Database CP
OS files
Schema
Holder
_db-xl-name
27
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
OpenEdge Settings
WEBSPEED™
_progres -web
DATASERVER
_orasrv
-cpinternal
-cpinternal
Web
Browser
ORACLE
Database
Driver
_db-xl-name
-cpstream
-cpstream
OS files
APPSERVER™
_proapsv
OS files
-cpinternal
GUI CLIENT
prowin32
-cpstream
CHUI CLIENT
_progres
Keyboard
-cpinternal
Printer
Screen
-cpstream
Schema
Holder
_db-xl-name
OS files
-cpinternal
Keyboard
-cpstream
Printer
Screen
OS files
Printer
28
OPS-25: Unicode and the DataServer
OS files
© 2008 Progress Software Corporation
Dictionary Utilities changed for Unicode
For Both Oracle and MS SQL Server
• Schema Migration *
– Including Unicode batch mode parameters
•
•
•
•
•
Update/Add Table Definitions +
Verify Table Definitions +
Adjust Schema +
Generate delta.sql *
Dump as Create Table Statement *
* “Use Unicode Types” GUI selection provided
+ Modified to handle Unicode types internally
29
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Comparing 10.1C Unicode: Oracle vs. MSS
Attribute
ORACLE
MSS
Unicode
Definitions
 DB-Codepage
(_db._db-xl-name)
 DB-Codepage
 Data Types
 Data Types
Data Types
CHAR,
LONGCHAR,
CLOB
CHAR,VARCHAR2, LONG, CLOB
NCHAR, NVARCHAR2,
NCLOB (in 10.1C01)
NCHAR, NVARCHAR,
NVARCHAR(max)and NTEXT
mapped to OpenEdge CHAR
Max. Char Size
CHAR: 30,000 bytes
LONGCHAR/CLOB: 1G
CHAR types: 4000 bytes
CLOB types: 4G
CHAR types: 8000 bytes
CLOB types: 2G
Max. Char Size
for Unicode
Same as above but...
CHAR: 15,000 bytes
using MSS DataServer
4000 bytes
4000 chars
Semantics
Character
Character or Byte
(double-byte) Character
Driver Settings
N/A
NLS_LANG=.AL32UTF8
ACP=Active Code Page
UTF-8
NLS_CHARACTERSETS:
AL32UTF8 & UTF8
NLS_NCHAR_CHARACTERSETS
AL16UTF16 or UTF8
UCS-2 (partial UTF-16)
Database Code
Pages
30
OpenEdge
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Common Unicode Requirements
DataServer Migration
OpenEdge Process
Driver
Conversions ?
DataServer
DB
Driver
Layer or process
-cpinternal
UTF-8
-cpstream
OpenEdge
code page
conversions
Foreign
Data
Source
Database CP
UTF-8
Schema
Holder
_db-xl-name
UTF-8
Database
cpstream=ISO8859-1
_db-xl-name
ANSI
or
UTF-8
31
OPS-25: Unicode and the DataServer
.d file
cpstream=ISO8859-5
PRODB
.d file
Recommended: Set $DLCDB
environment variable to
$DLC/prolong/utf
Build from:
$DLC/prolong/utf/empty
© 2008 Progress Software Corporation
Agenda
The path to successful development & deployment
 Unicode:
•
•
•
•
How did we get there ?
What are its broader OpenEdge implications ?
What are its DataServer implications ?
Specific Implementation in the DataServers for:
– Oracle
– MS SQL Server
32
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Oracle DataServer Migration
_db-xl-name, -cpinternal and -cpstream
OpenEdge Process OpenEdge
Driver
conversions
conversions
10.1C ORACLE
DataServer
-cpinternal
UTF-8
Layer or process
-cpstream
OCI
Client Library
NLS_LANG=
.AL32UTF8
ORACLE 9i+
Database
Database Charset
National Charset
UTF-8
_db-xl-name
UTF-8
Schema
Holder
.d file
Database
cpstream=ISO8859-1
_db-xl-name
ANSI
or
UTF-8
33
OPS-25: Unicode and the DataServer
.d file
cpstream=ISO8859-5
VARCHAR
NVARCHAR
CLOB
CFILE
NCLOB
© 2008 Progress Software Corporation
Oracle Unicode Migration
 What version of ORACLE
 Unicode Instance and
Unicode drivers must be 9i or
above
 Codepage for Schema Image
 Declares Unicode
 Collation Name
 Sets ICU collation
34
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Oracle Unicode Migration
Two ways to configure an
ORACLE database to
store Unicode:
 Use Unicode Types
 Unchecked – Uses Database
Charset
NLS_CHARACTERSETS:
 AL32UTF8
 UTF8
 Checked – Uses National
Language Charset
 NLS_NCHAR_CHARACTERSETS:
 AL16UTF16
 UTF8
35
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Oracle Unicode Migration
 For field width’s use
 Width (recommended)
 Use SQL Width Tool
 Char semantics
 Checked –
CHAR(10) = 10 chars
(w/UTF8
=10–30 bytes)
(w/AL32UTF8=10-40 bytes)
 Unchecked –
CHAR(10) = 10 bytes
36
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Oracle Unicode Migration
 Maximum char length
 Use Unicode Types
 = 2000 (assumes
NCS = AL16UTF16 )
 Use Unicode Types
 = 1000 (assumes
DB CP = AL32UTF8
 Expand to CLOB
 Checked –
Greater than Maximum char
length produces CLOB
 Unchecked –
Greater than Maximum char
length produces LONG
(backward compatible)
37
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Agenda
The path to successful development & deployment
 Unicode:
•
•
•
•
How did we get there ?
What are its broader OpenEdge implications ?
What are its DataServer implications ?
Specific Implementation in the DataServers for:
– Oracle
– MS SQL Server
38
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
MS SQL Server DataServer Migration
_db-xl-name, -cpinternal and -cpstream
OpenEdge Process
OpenEdge
conversions
Driver
conversions
10.1C MSS
DataServer
-cpinternal
UTF-8
ODBC
Layer or process
Driver
ACP =
OS CP
MSS 2005
Database
UCS-2
UTF-16
-cpstream
UTF-8
_db-xl-name
UTF-8
Schema
Holder
.d file
Database
cpstream=ISO8859-1
_db-xl-name
ANSI
or
UTF-8
39
.d file
NCHAR
NVARCHAR
NTEXT
NVARCHAR(max)
cpstream=ISO8859-5
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
MS SQL Server Unicode Migration
 ODBC Data Source Name
 Must be Unicode Driver
 Codepage for Schema Image
 Declares Unicode
 Collation Name
 Sets ICU collation
 Use Unicode Types
 Checked – Selects Unicode
(Changes Codepage to UTF-8)
 NVARCHAR types
 Unchecked – Uses nonUnicode character types
 VARCHAR types
40
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
MS SQL Server Unicode Migration
 Maximum char length
 Use Unicode Types
 = 4000 (assumes
MSS 2005 = UCS-2
 For field width’s use
 Width (recommended)
 Use SQL Widtth Tool
 Expand width (utf-8)
 Checked – Doubles width
defined for NVARCHAR types
 NVARCHAR(1000) becomes
NVARHCAR (2000)
41
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Linguistic Sorting and Collation
Sorting with Finnish collation
FOR EACH mytable
BY COLLATE(myfield,"CASE-INSENSITIVE","ICU-fi"):
DISPLAY myfield WITH FONT 8.
END.
42
Basic
ICU-UCA
ICU-fi
Aaa
Ááá
Äää
Ççç
Ĉĉĉ
Bbb
Ccc
Zzz
Aaa
Ááá
Äää
Bbb
Ccc
Ĉĉĉ
Ççç
Zzz
Aaa
Ááá
Bbb
Ccc
Ĉĉĉ
Ççç
Zzz
Äää
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Linguistic Sorting and Collation
Comparing with Finnish collation
FOR EACH mytable
WHERE COMPARE(myfield,">=","C",
"CASE-INSENSITIVE","ICU-fi")
BY COLLATE(myfield,"CASE-INSENSITIVE","ICU-fi"):
DISPLAY myfield WITH FONT 8.
END.
43
Basic
ICU-UCA
ICU-fi
Ccc
Zzz
Ccc
Ĉĉĉ
Ççç
Zzz
Ccc
Ĉĉĉ
Ççç
Zzz
Äää
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Linguistic Sorting and Collation
Global Setup
Caution with performance!
Database
TEMPTABLES
AppServer
-cpcoll ICU-uca
-cpcoll ICU-uca
--Uses client
collation in
COMPARE
and
COLLATE
-cpcoll ICU-en
TEMPTABLES
-cpcoll ICU-fr
TEMPTABLES
-cpcoll ICU-cs
RUN ASprg.p ON hAppServer
(INPUT SESSION:CPCOLL,
INPUT USERID,
INPUT <other parameters>,
OUTPUT TABLE ttMytable).
44
OPS-25: Unicode and the DataServer
TEMPTABLES
-cpcoll ICU-fi
English User
French User
Czech User
Finnish User
© 2008 Progress Software Corporation
8-bit Code Pages
 Where to find code page tables:
• 10.1B Internationalizing Applications manual (IBM850 and
ISO8859-1)
• http://www.microsoft.com/globaldev/reference/cphome.mspx
• http://www03.ibm.com/servers/eserver/iseries/software/globalization/codepag
es.html
• http://en.wikipedia.org
• http://www.fileformat.info/info/charset/index.htm
 Where to find Unicode Fonts:
• http://en.wikipedia.org/wiki/Code2000
 Information about Windows fonts:
http://www.microsoft.com/typography/fonts/default.aspx
http://www.microsoft.com/globaldev/getwr/steps/wrg_font.mspx
45
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
For More Information, go to…
 PSDN
• B2420-LV: From 26 to 96,000 Characters in 60 Minutes
• DEV-10: Supporting Multiple Languages in Your
Application
• DEV-23: Global Applications and Code Pages
 Progress eLearning Community:
• Understanding Internationalization – Salvador Vinals
 Documentation:
• OpenEdge Data Management: DataServer for Oracle
• OpenEdge Data Management: DataServer for Microsoft
SQL Server
• OpenEdge Development: Internationalizing Applications
46
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
?
Questions
47
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
Thank You
48
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation
49
OPS-25: Unicode and the DataServer
© 2008 Progress Software Corporation