Salesforce

Codepage change to UTF8 German Characters are not sorted

« Go Back

Information

 
TitleCodepage change to UTF8 German Characters are not sorted
URL Nameconvert-db-to-utf-8-with-german-sorting
Article Number000111679
EnvironmentProduct: OpenEdge
Version: All supported versions
OS: All supported platforms
Other: i18N
Question/Problem Description
After converting a from Codepage 1252 with sorting "GERMAN9" to UTF-8 –cpinternal UTF-8 –cpcoll basic
extended characters are not sorted properly

Example: the german umlaute (e.g. "Ä") are sorted after the letter "Z" instead after e.g. "A"

When the ICU-de__PHONEBOOK.df is used instead of "_tran.df" , IDXBUILD finds duplicate keys on unique index entries 1127 on  indexes built with -cpinternal UTF-8  -cpcoll ICU-de__PHONEBOOK 
Steps to Reproduce
Clarifying Information
How to convert DB to UTF-8 with German sorting?
Error Message
Defect Number
Enhancement Number
Cause
To insert the German A-Umlaut = “Ä” after or equivalent to “AE” at sorts, means AE is for collation equivalent to Ä. 

In UTF-8-BASIC the Ä is the last in collation, so will not be inserted at this position.
Which is why the German umlaute (e.g. "Ä") will be sorted after the letter "Z" instead after e.g. "A".

The default BASIC collation defined in prolang/utf/_tran.df, is based on the binary values of the characters. It doesn't satisfy the sorting requirements of any language (except English perhaps)

For this reason we provide “ICU-xxxx.df”-files to change this behaviour so that extended characters are correctly sorted.

When the ICU-de__PHONEBOOK.df is chosen over the basic tran.df
Brings us to the second condition in equality sorting which, when you've been converting existing data, then results in problems on unique indexed fields (eg error 1127)

For example: If a key field belonging to an unique index contains:
"Kraemer" where there is another record with the entry "Krämer", the collation treats ae = är (equivalent)

Effectively, ICU-de_PHONEBOOK.df is the wrong choice for equality matching, it has collation rules you don't want (Ä = identical to AE)
So Ä and AE cannot be stored as different units in a unique indexed field.
 
Resolution
Which “ICU-xxxx.df”-file to use?

The collation ICU-UCA is the most basic UTF-8 collation. The ICU-UCA.df is our implementation of the DUCET (Default Unicode Collation Element Table) It will allow you to add the entries to your unique index because it considers 'ä' as different to 'ae'. If you don't have a sorting requirement for a particular locale it's a good choice.

In OpenEdge 11.6, there is more recent version (e.g. ICU_48-UCA).

The following example code is useful in evaluating the collation needed:
 
/* 
Start the session with: prowin32 -cpinternal UTF-8 
Testing “Krämer” vs “Kraemer” 
​*/ 

DEFINE VARIABLE c1 AS CHARACTER NO-UNDO. 
DEFINE VARIABLE c2 AS CHARACTER NO-UNDO. 

c1 = "Kr" + CHR(228,"UTF-8","1252") + "mer". 
c2 = "Kraemer". 

MESSAGE c1 SKIP 
c2 SKIP 
"Equal : " c1 = c2 SKIP 
"---------------------------------------------" SKIP 
"CASE-SENSITIVE : " COMPARE(c1,'=',c2,"CASE-SENSITIVE","ICU-UCA") 
SKIP 
"CASE-INSENSITIVE : " COMPARE(c1,'=',c2,"CASE-INSENSITIVE","ICU-UCA") 
SKIP 
"CAPS : " COMPARE(c1,'=',c2,"CAPS","ICU-UCA") 
SKIP 
VIEW-AS ALERT-BOX INFO BUTTONS OK.

 
Workaround
Notes
Keyword Phrase
Last Modified Date3/13/2019 8:30 AM

Powered by