How to make a valid UTF-8 Json when my ABL session:cpinterna

Posted by Jurjen Dijkstra on 16-Jul-2015 14:56

Hi,

Our ABL session runs with cpinternal="1252". I don't kow why, but I have to deal with it.

From that session I need to make a JSON file and send it to Elasticsearch (to a REST service, using a http socket).

Soms string values have characters with accents and German umlauts. The Elasticsearch server complaints that it is not valid UTF-8. When I put a NodeJS server in betwee, then NodeJS says that it is not valid UTF-8, so it is not Elasticsearch's fault.

I have tried and tried, but I do not know how I can make a JSON longchar in "UTF-8" when cpinternal is "1252".

Here are some snippets of code that are involved in my solution (which is not really a solution). Please, what can I do?


define variable jsondocument as longchar no-undo.
jsondocument="".
fix-codepage(jsondocument)="UTF-8".

method public void BufferToJson (input hBuffer as handle, input-output jsondocument as longchar, input fieldstoskip as character):
  /* make 1 JSON document for 1 temp-table buffer. */
  define variable json as Progress.Json.ObjectModel.JsonObject no-undo.
  define variable hField as handle no-undo.
  define variable i as integer no-undo.
 
  jsondocument = "".
  json = new Progress.Json.ObjectModel.JsonObject().
  do i=1 to hBuffer:num-fields:
    hField = hBuffer:buffer-field(i).
    if not can-do(fieldstoskip, hField:name) then 
      if hField:buffer-value <> ? then
      do:
        case hField:data-type :
          when "date":U then json:Add( hField:name, DateField(hField:buffer-value)).
          when "datetime":U then json:Add( hField:name, DateField(hField:buffer-value)).
          when "datetime-tz":U then json:Add( hField:name, DateField(hField:buffer-value)).
         otherwise 
              if hField:data-type="character" then
                json:Add( hField:name, FixedCodepage(hField:buffer-value)).
              else 
                json:Add( hField:name, hField:buffer-value).
       end case.
   end.
   end.
   json:write(input-output jsondocument, true, "UTF-8":U).
   delete object json.
 end method.
 
 method private character FixedCodepage( input databasevalue as character) :
   define variable i as integer no-undo.
   define variable kar as integer no-undo.
   define variable outputstring as character no-undo.
 
   if session:cpinternal = "UTF-8":U then 
      return databasevalue.
   else do:
      outputstring = "".
      do i=1 to length(databasevalue, "character"):
         kar = asc(substring(databasevalue, i, 1, "character")).
         /* TODO: this is an emergency solution: just delete the characters that are questionable in utf-8. What else can I do? :-( */
         if kar<128 then 
            outputstring = outputstring + chr(kar, "UTF-8", session:cpinternal).
      end.
   if outputstring<>databasevalue then
      LogMessage ( substitute('codepageconflict: "&1" veranderd in "&2"', databasevalue, outputstring)). 
    return outputstring.
  end.
 end method.

thanks,

Jurjen.

OpenEdge Development - Forum

All Replies

Posted by Jurjen Dijkstra on 16-Jul-2015 15:04

Oops I forgot to mention the version. This is OpenEdge version 11.4

Posted by Mike Fechner on 16-Jul-2015 15:22

Try to FIX-CODEPAGE (jsondocument) within the BufferToJson method.

Posted by Jurjen Dijkstra on 17-Jul-2015 06:24

Thanks Mike.

Unfortunately it did not help.

Having a closer look I see that not every character causes the same kind of problem.

Most characters with accents are just received by Elasticsearch but then look like 4 bytes gibberish.

There are also characters that cause Elasticsearch to simply refuse the document. For example the umlaut in "Hückelhoven".

Posted by pedrorodriguez on 17-Jul-2015 07:12

Hi,

If you could transfer the character to Elasticsearch by using a memptr variable, you could remove all encoding when generating the JSON, allow it to be generated with the internal codepage and right before sending it, copy it to a memptr:

COPY-LOB jsonDocument TO memDocument CONVERT TARGET-CODEPAGE "UTF-8".

Something like that it will be a little bit slower, but should work ok.

Regards,

Posted by TheMadDBA on 17-Jul-2015 07:13

How are you verifying that the JSON is not in UTF-8? You need to make sure whatever you are looking at the JSON with knows to use UTF-8 and not the default code page for your OS.

Write the output to a file and open it with Explorer... right click and choose Encoding --> UTF8. The characters should appear as you expect them to.

You might just be having an issue with the codepage environment settings on the Elastisearch side.

Posted by pedrorodriguez on 17-Jul-2015 07:16

Actually, after going through the documentation for JsonObject, the write method can be used directly with a MEMPTR variable and makes the conversion to UTF-8 for you.

So, just create the JSON as CP1252 and then instead of writing it to a LONGCHAR variable, using a MEMPTR variable should make the encoding correct.

Posted by pedrorodriguez on 17-Jul-2015 07:17

Elasticsearch only works with UTF-8.

Posted by TheMadDBA on 17-Jul-2015 07:39

But is still dependent on the OS settings being correct....

stackoverflow.com/.../elasticsearch-shows-umlauts-as

Posted by TheMadDBA on 17-Jul-2015 07:40

If you look at the documentation further you will see that it also does UTF-8 conversions for LONGCHAR.

The standard WRITE-JSON also does this for TTs and datasets.

Posted by pedrorodriguez on 17-Jul-2015 08:58

I have found issues when working with LONGCHARS and the implicit conversions that Progress has to do when you are moving those around. So, my guess (from the information we have) was that maybe the problem is happening when he is dumping the longchar variable to the socket. Hence the suggestion of using a MEMPTR, that doesn't have any implicit conversions.

Posted by pedrorodriguez on 17-Jul-2015 09:00

I think that case describes issues with the OS setting for a script loading data into elasticsearch, (so basically exactly same case we are discussing here), not elasticsearch itself.

Posted by TheMadDBA on 17-Jul-2015 09:53

I personally haven't seen any issues with conversions unless the character is not supported in the target codepage.

It is easy enough to see if the issue is on the OpenEdge side or not by writing the JSON to disk and inspecting it with a viewer that supports UTF-8. It works perfectly for me on several versions of OE.

Elastisearch, like most tools, are dependent on the OS not messing with the data before it gets inside its domain. Just like OpenEdge can handle valid codepage conversions... up until the point that something else does a double/invalid conversion.

Our beloved Tex Tevin has provided this example of what is most likely happening

www.i18nqa.com/.../utf8-debug.html

Posted by Jurjen Dijkstra on 20-Jul-2015 02:28

Thanks guys,

Good discussion with many interesting pointers (including the document from Master Texin). I think there is enough to study and I hope I can make it work now. I never realized this, but perhaps the JSON output was already valid UTF-8 but messed up by our http socket class. I will check that too.

Thanks!!!

This thread is closed