Speed up INDEX() and SUBSTRING() with 1-Byte-LONGCHAR under a session with UTF-8 as CPINTERNAL - OpenEdge 11.6 Community Input - Products Enhancements - Progress Community

 OpenEdge 11.6 Community Input

Speed up INDEX() and SUBSTRING() with 1-Byte-LONGCHAR under a session with UTF-8 as CPINTERNAL

  • Under Review

When ABL has to read a large textfile e.g. row-by-row being read into a LONGCHAR variable it is very fast if session:cpinternal is 1252 / iso8859-1 but very slow unter UTF-8.

See article http://knowledgebase.progress.com/articles/Article/000039559

When I use FIX-CODEPAGE = 1252 for the LONGCHAR it gets even worse because its different to the UTF-8 session then.

When I use INDEX or SUBSTRING with that 1-byte LONGCHAR it should not need doing character conversion with cpinternal in the first place.

The code-example takes milliseconds with CPINTERNAL 1252 but takes up to an hour with UTF-8.

/* Example */
DEFINE VARIABLE lchar AS LONGCHAR NO-UNDO.
DEFINE VARIABLE ld_bytes as deci NO-UNDO.
DEFINE VARIABLE record AS CHARACTER NO-UNDO.
DEFINE VARIABLE iIndex AS INTEGER NO-UNDO.
DEFINE VARIABLE iPosition AS INTEGER INIT 1 NO-UNDO.

FIX-CODEPAGE(lchar) = "1252".  /* even if you omit this under a CPINTERNAL UTF-8 it still takes way to long */

/* LONGCHAR */
COPY-LOB FROM FILE "400000_rows.txt" TO lchar.
ASSIGN ld_bytes = LENGTH(lchar).

etime(true).
REPEAT:
ASSIGN iIndex = INDEX(lchar , CHR(13), iPosition)
record = SUBSTRING(lchar, iPosition, (iIndex - iPosition)).

/* do something with the record */

ASSIGN iPosition = iIndex + 2.

IF iPosition >= ld_bytes THEN
LEAVE.
END.

/* Takes milliseconds when SESSION:CPINTERNAL = 1252 and up to an hour with UTF-8 */
MESSAGE STRING(INT(ETIME / 1000),"HH:MM:SS") VIEW-AS ALERT-BOX.

Comments
  • Wow, I could not imagine there was such a dramatic difference when using large LONGCHARs. I had to test this myself and got similar results. This should definitely be addressed in my opinion.  

    Your example code could be made somewhat faster by removing the handled part of lchar in every iteration but even then it is much slower than with 1252 / iso8859-1.

  • I hope that the development can change the speed with the "INDEX & SUBSTRING" based version when the LONGCHAR has a single byte FIXED-CODEPAGE while CPINTERNAL is UTF-8.

    SUBSTRING:

    When "lchar" has a FIX-CODEPAGE with single-byte then a re-read should not be necessary. Of course a conversion of the result to UTF-8 for saving it into the "record" variable is needed.

    INDEX:

    When "lchar" has a FIX-CODEPAGE with a single-byte a re-read also should not be necessary, espacially when the second parameter as the same codepage, e.g.:

      iIndex = INDEX(lchar , CHR(13, "1252"), iPosition)