Extended UTF-8 characters are not parsed correctly by the SA

Posted by geertjeguns@hotmail.com on 01-Jun-2017 03:37

[View:/cfs-file/__key/communityserver-discussions-components-files/19/esbpws2.xml:320:240]Hello everyone,

I'm having a small issue while parsing an XML that's using a UTF-8 codepage.

It contains some special characters like ‘ (U+0091), ’ (U+0092), “ (U+0093), ” (U+0094), œ (U+009C) and so on.

It's not very clear but although the above characters look like a single quotation mark ' and a double quotation mark ", they are not the same. 

I first read the xml into a longchar and fix the codepage to UTF-8. (with or without FIX-CODEPAGE, the result is the same)
No convertion is needed because the xml file is already created in UTF-8, hence the NO-CONVERT.
I then use the longchar as an input source for the SAX-READER.

Example of my code: 

/* Set a fixed codepage (UTF-8) for the longchar */
FIX-CODEPAGE ( wclong ) = "UTF-8".

/* copy the xml to a longchar */
COPY-LOB FILE wcxml TO wclong NO-CONVERT.

/* OUTPUT TO "d:\users\geegun\webservice\bal\esbpws\longcontent.txt". */
/* EXPORT wclong .                                                    */
/* OUTPUT CLOSE.                                                      */

CREATE SAX-READER whParser.
RUN saxparserprocedure.p PERSISTENT SET whHandler. 

whParser:HANDLER = whHandler.
whParser:SET-INPUT-SOURCE("LONGCHAR", wclong ).  
 
whParser:SAX-PARSE-FIRST() NO-ERROR.

ParseLoop:
REPEAT WHILE whParser:PARSE-STATUS = SAX-RUNNING:
  whParser:SAX-PARSE-NEXT() NO-ERROR.
  
  IF whParser:PRIVATE-DATA = "FatalErrorInvokedByUser"
  THEN DO:
      ASSIGN ERROR-STATUS:ERROR = TRUE.
      LEAVE ParseLoop. 
  END.
END.

IF ERROR-STATUS:ERROR
THEN DO:
  /* ... some error handling here ... */
END.
ELSE DO :
  /* get the dataset from the saxparserprocedure */
  RUN getdata IN whHandler (OUTPUT DATASET-HANDLE whdataset BIND, OUTPUT iplfuncerror , OUTPUT ipcErrorMsg ). 
                                           
END.

When I uncomment the 'OUTPUT TO' to statement in the code above, the file still contains all the characters.

But when I look at the attribute's value (using GET-VALUE-BY-INDEX(indexPosition) ) during the parsing process, the attribute's value has already changed.

Attached to this post you can find a excerpt of the xml file.  The following text 'Vidange d’huile' contains one of the special characters.  It's not a normal apostrophe.

I've been searching for a solution for a while and I found the following KB post dating from 2014 which describes my problem but unfortunately there doesn't seem to be a solution.
http://knowledgebase.progress.com/articles/Article/000054284

Does anyone have an idea on how to solve this? 
Or has anyone had the same problem before?

Thanks in advance,

Geert

All Replies

Posted by Garry Hall on 01-Jun-2017 07:54

You don't mention the OE version. Defect PSC00315657 referenced in the kbase was addressed in 11.6.0 and 11.5.1.

Posted by geertjeguns@hotmail.com on 01-Jun-2017 09:47

Hello Garry,

Unfortunately we are using OE 11.5.0 at the moment.  Our management is still deciding whether to update to 11.5.1, 11.6.0 or even 11.7.0.  

Thanks for the info.

This thread is closed