11.7 / 12.0
Could anyone explain to me why this gives me two different answares?
def var myISO as longchar no-undo.
def var myUTF as longchar no-undo.
fix-codepage(myISO) = session:cpinternal.
fix-codepage(myUTF) = 'UTF-8'.
myISO = 'Ä'.
/*Changing these will give me two different results for utf.txt .... why?*/
//myUTF = codepage-convert(myISO,'UTF-8',session:cpinternal).
myUTF = codepage-convert('Ä','UTF-8',session:cpinternal).
copy-lob myISO to file 'e:\temp\iso.txt' no-convert.
copy-lob myUTF to file 'e:\temp\utf.txt' no-convert.
The "codepage-convert('Ä','UTF-8',session:cpinternal)" gives mojibake instead of the expected 'Ä', and has more bytes than expected.
So in that case there's a double conversion - the character gets converted from single-byte ISO codepage to a multibyte UTF-8 sequence, then the invidual bytes of the UTF-8 sequence get interpreted as single-byte characters and converted again.
The double conversion is probably because when you fix-codepage your longchars, there is an automatic codepage conversion that takes effect. (Rules are buried in the docs on the ASSIGN statement) -> that automatic conversion happens on top of the explicit conversion you have in the code.
I am a little bit confused with the chosen sample 'A' as it is actually encoded with one single byte in UTF-8, like all characters in the ASCII set (below 128). The 8 f UTF-8 means it can go down to 8 bits.
In UTF-8, extended characters (those above 127 in single byte encodings) are encoded with 2, 3 or 4 bytes (so 16 to 32 bits)
Said differently, strings made with only ASCII chars (below 128) should be encoded the same in all single byte codepages as well as in UTF-8
There should be differences only if extended characters are involved (like letter with accents, etc...)
The sample isn't 'A' (Latin captial letter A, codepoint U+0041 ), it's 'Ä' (Latin captial letter A with diaeresis, codepoint U+00C4 (assuming composed form)).
Opps, I missed the double dots on my screen
Ok, so correct way of doing a ISO -> UTF -> ISO would be something like this?
myUTF = 'Ä'.
myISO = myUTF.
I would belive that
def var myTekst as char no-undo. /*by default session:cpinternal*/
myTekst = 'Ä'.
would be the same as 'Ä'
since myTekst = 'Ä' is true.
But when using CODEPAGE-CONVERT it converts different using 'Ä' and myTekst.
Is that the way it should be?
To what I understand, a LONGCHAR variable is "Code-Page aware". Once you have see its code-page with the FIX-CODEPAGE Statement, you should no longer play with the CODE-PAGE() function. It should handle the conversions for you when you assign it to something by taking into account the source and target code-pages. At least, this is what I would expect.
Said differently, when you do:
myTekst = "Some constant".
or
myTekst = aSimpleCharVar.
-> the ABL is aware of the code page of "Some constant" or aSimpleCharVar, aka SESSION:CPINTERNAL
So the implicit ASSIGN statement should convert it automatically to the codepage of myTekst.
Similarly, if you to myTekst = myUTF.
-> the ABL should convert the content of myUTF from its encoding to the encoding of myTekst.
If you do something like myUTF = codepage-convert('Ä','UTF-8',session:cpinternal).
then you may somehow wrongly apply twice the conversion transformation and not obtain what you want
BTW, I you look at the OpenEdge.Core.String new type, you will see it actually encapsulates a LONGCHAR variable and exposes an Encoding property, probably to use the FIX-CODEPAGE() Statement.
I'd be pleased to be corrected by a PSC person if I'm wrong
Hope it Helps.
Default character conversions with the ASSIGN statement
|
||
When the target field is a . . .
|
And the source expression results in a . . .
|
The AVM converts the result of the source expression to . . .
|
CHARACTER
|
LONGCHAR
|
-cpinternal code page
|
LONGCHAR
|
CHARACTER
|
-cpinternal or the fixed code page
|
> BTW, I you look at the OpenEdge.Core.String new type, you will see it actually encapsulates a LONGCHAR variable and exposes an Encoding property, probably to use the FIX-CODEPAGE() Statement.
Sebastien, is there something that's not working properly? Or not in an expected way?
Hi Goo, I agree with you this is weird. As I was trying to say, this :
UTF8value1 = codepage-convert(myIsoValue,’UTF-8’,session:cpinternal).
may result in applying twice the conversion transformation from your cpinternal to UTF-8
If it does so, then I'd consider it as a bug and would open a Tech Support Ticket
IMHO, when the assign statement assigns a longchar to the value of the codepage-convert function, then it should not convert a second time the result, especially if the target code page param of the codepage-convert matches the codepage of the longchar. The all problem is what to do when these two codepages do not match... raise a runtime error perhaps?
HTH