CODEPAGE question

Posted by goo on 25-Sep-2019 07:16

11.7 / 12.0

Could anyone explain to me why this gives me two different answares?

def var myISO as longchar no-undo.
def var myUTF as longchar no-undo.

fix-codepage(myISO) = session:cpinternal.
fix-codepage(myUTF) = 'UTF-8'.

myISO = 'Ä'.

/*Changing these will give me two different results for utf.txt .... why?*/

//myUTF = codepage-convert(myISO,'UTF-8',session:cpinternal).
myUTF = codepage-convert('Ä','UTF-8',session:cpinternal).

copy-lob myISO to file 'e:\temp\iso.txt' no-convert.
copy-lob myUTF to file 'e:\temp\utf.txt' no-convert.

All Replies

Posted by frank.meulblok on 25-Sep-2019 08:09

The "codepage-convert('Ä','UTF-8',session:cpinternal)"  gives mojibake instead of the expected  'Ä', and has more bytes than expected.

So in that case there's a double conversion - the character gets converted from single-byte ISO codepage to a multibyte UTF-8 sequence, then the invidual bytes of the UTF-8 sequence get interpreted as single-byte characters and converted again.

The double conversion is probably because when you fix-codepage your longchars, there is an automatic codepage conversion that takes effect. (Rules are buried in the docs on the ASSIGN statement) -> that automatic conversion happens on top of the explicit conversion you have in the code.

Posted by slacroixak on 25-Sep-2019 09:33

I am a little bit confused with the chosen sample 'A' as it is actually encoded with one single byte in UTF-8, like all characters in the ASCII set (below 128).  The 8 f UTF-8 means it can go down to 8 bits.

In UTF-8, extended characters (those above 127 in single byte encodings) are encoded with 2, 3 or 4 bytes (so 16 to 32 bits)

=> en.wikipedia.org/.../UTF-8

Said differently, strings made with only ASCII chars (below 128) should be encoded the same in all single byte codepages as well as in UTF-8

There should be differences only if extended characters are involved (like letter with accents, etc...)

Posted by frank.meulblok on 25-Sep-2019 09:50

The sample isn't 'A' (Latin captial letter A, codepoint U+0041 ), it's  'Ä' (Latin captial letter A with diaeresis, codepoint U+00C4 (assuming composed form)).

Posted by slacroixak on 25-Sep-2019 09:54

Opps, I missed the double dots on my screen

Posted by goo on 25-Sep-2019 11:01

Ok, so correct way of doing a ISO -> UTF -> ISO would be something like this?

myUTF = 'Ä'.

myISO = myUTF.

I would belive that

def var myTekst as char no-undo. /*by default session:cpinternal*/

myTekst = 'Ä'.

would be the same as 'Ä'

since myTekst = 'Ä' is true.

But when using CODEPAGE-CONVERT it converts different using 'Ä' and myTekst.

Is that the way it should be?

Posted by slacroixak on 26-Sep-2019 06:50

To what I understand, a LONGCHAR variable is "Code-Page aware".   Once you have see its code-page with the FIX-CODEPAGE Statement, you should no longer play with the CODE-PAGE() function.  It should handle the conversions for you when you assign it to something by taking into account the source and target code-pages.   At least, this is what I would expect.

Said differently, when you do:

myTekst =  "Some constant".

 or

myTekst =  aSimpleCharVar.

-> the ABL is aware of the code page of "Some constant" or aSimpleCharVar, aka SESSION:CPINTERNAL

So the implicit ASSIGN statement should convert it automatically to the codepage of myTekst.

Similarly, if you to myTekst = myUTF.

 -> the ABL should convert the content of myUTF from its encoding to the encoding of myTekst.

If you do something like myUTF = codepage-convert('Ä','UTF-8',session:cpinternal).

then you may somehow wrongly apply twice the conversion transformation and not obtain what you want

BTW, I you look at the OpenEdge.Core.String new type, you will see it actually encapsulates a LONGCHAR variable and exposes an Encoding property, probably to use the FIX-CODEPAGE() Statement.

I'd be pleased to be corrected by a PSC person if I'm wrong

Hope it Helps.

Posted by Peter Judge on 26-Sep-2019 13:42

 
From the Help, it looks like your assertion is correct.
 
Default character conversions with the ASSIGN statement
When the target field is a . . .
And the source expression results in a . . .
The AVM converts the result of the source expression to . . .
CHARACTER
LONGCHAR
-cpinternal code page
LONGCHAR
CHARACTER
-cpinternal or the fixed code page
 

> BTW, I you look at the OpenEdge.Core.String new type, you will see it actually encapsulates a LONGCHAR variable and exposes an Encoding property, probably to use the FIX-CODEPAGE() Statement.

Sebastien, is there something that's not working properly? Or not in an expected way?

The OpenEdge.Core.String object
- holds all its values in a (private) UTF-8-encoded longchar (regardless of the session's CPINTERNAL value)
- has an Encoding property that  defaults to CPINTERNAL
- Has a public Value property that's a longchar and which performs a CODEPAGE-CONVERT() when needed (when the GET runs).
 
    /** Contains the actual string value. Marked as NON-SERIALIZABLE since the actual value is derived,
        and stored in the private mUTF8Value variable */
    define public non-serializable property Value as longchar no-undo
        get():
            // no need for changes if we're using UTF-8 as CPINTERNAL
            if this-object:Encoding eq 'UTF-8':u then
                return mUTF8Value.
            else
                return codepage-convert(mUTF8Value, this-object:Encoding).
        end get.
   
 
You should be able to see this easily enough.
 
def var objString as OpenEdge.Core.String.
def var lcValue as longchar.
def var lcIn as longchar.
 
fix-codepage(lcIN) = 'utf-8'.
lcIN = '  Ä '.
 
objString = new OpenEdge.Core.String(lcin).
 
objString:Encoding = 'ISO8859-1'.
 
lcValue = objString:Value.
 
message
    'session:cpinternal = ' session:cpinternal skip // is UTF-8 in my case
    'lcValue:cpinternal = ' get-codepage(lcValue)   // should be ISO8859-1
    string(lcValue)
    view-as alert-box.
 
 
 
 

Posted by goo on 26-Sep-2019 14:14

I was concerned by the result of
 
Def var myISOvalue as char no-undo. /*session:cpinternal = ‘iso8859-1*/
myISOvalue = ‘Ä’.
 
Fix-codepage(UTF8value1) = ‘UTF-8’.
Fix-codepage(UTF8value2) = ‘UTF-8’.
 
UTF8value1 = codepage-convert(myIsoValue,’UTF-8’,session:cpinternal).
UTF8value2 = codepage-convert(‘Ä’,’UTF-8’,session:cpinternal).
 
Gives different value when I copy-lob to file with no-convert.
 
So I just wondered why that happened. myISO is a char, not a longchar.
 
But at the end, I do not have to use codepage-convert into a fixed longchar…
 

Posted by slacroixak on 27-Sep-2019 09:14

Hi Goo, I agree with you this is weird.  As I was trying to say, this :

   UTF8value1 = codepage-convert(myIsoValue,’UTF-8’,session:cpinternal).

may result in applying twice the conversion transformation from your cpinternal to UTF-8

If it does so, then I'd consider it as a bug and would open a Tech Support Ticket

IMHO, when the assign statement assigns a longchar to the value of the codepage-convert function, then it should not convert a second time the result, especially if the target code page param of the codepage-convert matches the codepage of the longchar.  The all problem is what to do when these two codepages do not match... raise a runtime error perhaps?

HTH

This thread is closed