Information

Title	Codepage change to UTF8 German Characters are not sorted

URL Name	convert-db-to-utf-8-with-german-sorting

Article Number	000111679

Environment	Product: OpenEdge Version: All supported versions OS: All supported platforms Other: i18N

Question/Problem Description

After converting a from Codepage 1252 with sorting "GERMAN9" to UTF-8 –cpinternal UTF-8 –cpcoll basic
extended characters are not sorted properly

Example: the german umlaute (e.g. "Ä") are sorted after the letter "Z" instead after e.g. "A"

When the ICU-de__PHONEBOOK.df is used instead of "_tran.df" , IDXBUILD finds duplicate keys on unique index entries 1127 on indexes built with -cpinternal UTF-8 -cpcoll ICU-de__PHONEBOOK

Steps to Reproduce

Clarifying Information

How to convert DB to UTF-8 with German sorting?

Error Message

Defect Number

Enhancement Number

Cause

To insert the German A-Umlaut = “Ä” after or equivalent to “AE” at sorts, means AE is for collation equivalent to Ä.

In UTF-8-BASIC the Ä is the last in collation, so will not be inserted at this position.
Which is why the German umlaute (e.g. "Ä") will be sorted after the letter "Z" instead after e.g. "A".

The default BASIC collation defined in prolang/utf/_tran.df, is based on the binary values of the characters. It doesn't satisfy the sorting requirements of any language (except English perhaps)

For this reason we provide “ICU-xxxx.df”-files to change this behaviour so that extended characters are correctly sorted.

When the ICU-de__PHONEBOOK.df is chosen over the basic tran.df
Brings us to the second condition in equality sorting which, when you've been converting existing data, then results in problems on unique indexed fields (eg error 1127)

For example: If a key field belonging to an unique index contains:
"Kraemer" where there is another record with the entry "Krämer", the collation treats ae = är (equivalent)

Effectively, ICU-de_PHONEBOOK.df is the wrong choice for equality matching, it has collation rules you don't want (Ä = identical to AE)
So Ä and AE cannot be stored as different units in a unique indexed field.

Resolution

Which “ICU-xxxx.df”-file to use?

The collation ICU-UCA is the most basic UTF-8 collation. The ICU-UCA.df is our implementation of the DUCET (Default Unicode Collation Element Table) It will allow you to add the entries to your unique index because it considers 'ä' as different to 'ae'. If you don't have a sorting requirement for a particular locale it's a good choice.

In OpenEdge 11.6, there is more recent version (e.g. ICU_48-UCA).

The following example code is useful in evaluating the collation needed:

/* 
Start the session with: prowin32 -cpinternal UTF-8 
Testing “Krämer” vs “Kraemer” 
*/ 

DEFINE VARIABLE c1 AS CHARACTER NO-UNDO. 
DEFINE VARIABLE c2 AS CHARACTER NO-UNDO. 

c1 = "Kr" + CHR(228,"UTF-8","1252") + "mer". 
c2 = "Kraemer". 

MESSAGE c1 SKIP 
c2 SKIP 
"Equal : " c1 = c2 SKIP 
"---------------------------------------------" SKIP 
"CASE-SENSITIVE : " COMPARE(c1,'=',c2,"CASE-SENSITIVE","ICU-UCA") 
SKIP 
"CASE-INSENSITIVE : " COMPARE(c1,'=',c2,"CASE-INSENSITIVE","ICU-UCA") 
SKIP 
"CAPS : " COMPARE(c1,'=',c2,"CAPS","ICU-UCA") 
SKIP 
VIEW-AS ALERT-BOX INFO BUTTONS OK.

Workaround

Notes

Keyword Phrase

Last Modified Date	3/13/2019 8:30 AM

Codepage change to UTF8 German Characters are not sorted

Information