Invalid Unicode characters in XML 1.0 document - Forum - OpenEdge Development - Progress Community

Invalid Unicode characters in XML 1.0 document


Invalid Unicode characters in XML 1.0 document

This question is not answered


In our ERP system we export some data to xml-files. We do this by buffer-copying rows from tables to temp-tables, and then exporting these using WRITE-XML. This results in a file starting with <?xml version="1.0" encoding="UTF-8"?>. So it's an xml 1.0 file.

Some of our customers have data in the tables with unicode characters, for instance U+1A. In our application, that's perfectly valid data. However, it is an invalid character in xml 1.0. It's valid in xml 1.1, but only when it's written as &#x1A;

So we have two issues here:

1. As far as I can tell, WRITE-XML always writes xml 1.0 files. Is it possible to write xml 1.1 files? If so, how? I can't find anything in the docs. We could just replace the header of the resulting document, but that's a hack.

2. We need to rewrite almost all invalid characters from U+01 until U+1F to the &#x__; form. Is there a smart way to do this? As we are always interested in complete records, buffer-copy is the most efficient way for us to copy the records to the temp-table. When we need to sanitize the data, we could do a replace for all these characters on all character fields, but that's a performance issue waiting to happen.

Any thoughts? Experiences?


Martin Zantingh

All Replies
  • 1. no - the number of xml 1.1 consumers is also very disappointing

    2a. when exporting end user xml we strip out all invalid asc values - it is not smart

    2b. for all other use we try to use write-json which handles it all a lot better, but does miss a 'write-schema' option, but you can combine write-xmlschema with write-json