JSON Parsing and performance

Posted by James Palmer on 09-Nov-2015 08:48

I'm posting this on behalf of an esteemed colleague who has been tearing his hair out over this for a while now. Does anyone have any insights into what might be the problem, or is this a bug that needs raising? Cosider the code below against sports2000:

DEFINE TEMP-TABLE tt-Customer NO-UNDO LIKE Customer.

DEFINE DATASET dsCustomer SERIALIZE-NAME "Customer"
           FOR tt-Customer.

DEFINE VARIABLE ds-h AS HANDLE NO-UNDO.
ASSIGN ds-h = DATASET dsCustomer:HANDLE.         

DEFINE VARIABLE lv-JSON AS LONGCHAR NO-UNDO.
DEFINE VARIABLE lv-NewJSON AS LONGCHAR NO-UNDO.
DEFINE VARIABLE lv-i AS INTEGER NO-UNDO.

DEFINE VARIABLE lv-JSONChar AS CHARACTER NO-UNDO.
DEFINE VARIABLE lv-JSONLoop AS INTEGER NO-UNDO.

DEFINE VARIABLE lv-Time  AS INTEGER.
DEFINE VARIABLE lv-Time2 AS INTEGER.

FOR EACH Customer:
    CREATE tt-Customer.
    BUFFER-COPY Customer 
             TO tt-Customer.
       ASSIGN lv-i  = lv-i + 1.                  
END.   

ds-h:WRITE-JSON("longchar", lv-JSON,NO,"UTF-8",NO,YES) NO-ERROR.

/* First Pass */
ETIME(YES).
DO lv-JSONLoop = 1 TO LENGTH(lv-JSON):
   ASSIGN lv-JSONChar = SUBSTRING(lv-JSON,lv-JSONLoop,1).
END.
ASSIGN lv-Time = ETIME. 

/* Second Pass */
COPY-LOB lv-JSON to lv-NewJSON.
ETIME(YES).
DO lv-JSONLoop = 1 TO LENGTH(lv-NewJSON):
   ASSIGN lv-JSONChar = SUBSTRING(lv-NewJSON,lv-JSONLoop,1).
END.
ASSIGN lv-Time2 = ETIME. 

MESSAGE lv-i "Customer records" SKIP 
        "First Pass" SKIP 
        SUBSTITUTE("Done &1 chars in &2 ms",LENGTH(lv-JSON),lv-Time) SKIP (2)
        "Second Pass" SKIP
        SUBSTITUTE("Done &1 chars in &2 ms",LENGTH(lv-NewJSON),lv-Time2) SKIP 
        VIEW-AS ALERT-BOX.


Results:

1117 Customer records

First Pass

Done 373204 chars in 149406 ms

Second Pass

Done 373204 chars in 350 ms

What is causing the huge difference in performance here?

All Replies

Posted by James Palmer on 09-Nov-2015 08:48

THis is 11.5.1 on Windows.

Posted by scott_auge on 09-Nov-2015 09:02

That is a mystery...

Posted by Tim Kuehn on 09-Nov-2015 09:12

I added this, ran the tests, and it ran the same speed for both.

COPY-LOB lv-JSON TO lv-NewJSON.

COPY-LOB lv-NewJSON TO lv-JSON.

I'm thinking there's something under the hood that's causing the delay, which makes this something TS should hear about.

Posted by Garry Hall on 09-Nov-2015 09:14

What is your -cpinternal? I would guess 1252 or ISO8859-1.

lv-JSON is written as UTF-8, as per your WRITE-JSON command. lv-NEWJSON is not fixed to a particular codepage, so will  use -cpinternal when you COPY-LOB it. UTF-8 is a multi-byte codepage, to identify char NNN of a string, the AVM has to count from char 1 all the way to char NNN, observing the number of bytes for each char. For a single-byte codepage, char NNN can easily be reached by going directly to offset NNN. Of course, if your JSON contains chars that are not in the -cpinternal codepage, you will get '?' as a replacement char.

Posted by Stefan Drissen on 09-Nov-2015 09:18

The DO 1 TO LENGTH will result in LENGTH being calculated for every iteration of the loop ie LENGTH is evaluated 373 thousand times. What happens when you put length in a variable and use that? Not that it explains why it is so slow in the first pass, but it may point to something.

Posted by Tim Kuehn on 09-Nov-2015 09:20

[quote user="Stefan Drissen"]The DO 1 TO LENGTH will result in LENGTH being calculated for every iteration of the loop ie LENGTH is evaluated 373 thousand times. What happens when you put length in a variable and use that? Not that it explains why it is so slow in the first pass, but it may point to something.[/quote]

Since its in both tests, the times will be comparable. 

I moved the LENGTH() outside of the DO ... TO, and the large difference in run time remained. 

Posted by Tim Kuehn on 09-Nov-2015 09:22

I added code to copy the JSON to files for the WRITE-JSON(), JSON, and NewJSON, and all 3 were binary equals.

Posted by James Palmer on 09-Nov-2015 09:25

This is an interesting one - yes we have ISO8859-1 for cpinternal so it could well be something to do with this.

Posted by James Palmer on 09-Nov-2015 09:26

Stefan, as per Tim's comments it made no discernible difference.

Posted by Stefan Drissen on 09-Nov-2015 09:50

[quote user="Tim Kuehn"]

Since its in both tests, the times will be comparable. 

[/quote]

But there is a difference between both tests which are doing the same - so there is no reason to assume getting the LENGTH will be comparable.

The substring being the culprit as per Garry's response sounds 'resasonable' - how do you get to position x in a string when the string is multi-byte?

The length function does not seem to suffer, since all three length types (character, raw, column) are probably available as part of the longchar.

Chomping on the left side of the string may help - but since this is a longchar, performance is generally awful when performing string operations anyway.

What will help - and this is so un-ABL - is to split the longchar into approx 30k character variables and then loop thru these - you will only need to 'walk' thru the longchar 30k times less.

Posted by Tim Kuehn on 09-Nov-2015 09:56

The final statement reported the LENGTH() for each CLOB, and the values were the same.

  SUBSTITUTE("Done &1 chars in &2 ms",LENGTH(lv-JSON),lv-Time) SKIP (2)

  SUBSTITUTE("Done &1 chars in &2 ms",LENGTH(lv-NewJSON),lv-Time2) SKIP

I modified the code to dump the dataset and two CLOB contents, and the output was binary identical.

I added code top copy the NewJSON back to JSON, and the performance was then identical - so my inference is that there's something under the covers when the JSON CLOB is first created that is causing the problem.  

Posted by Garry Hall on 09-Nov-2015 10:09

The final statement reported the length *IN CHARACTERS* of each. If you changed this to LENGTH(..., "RAW") you would see a different result.

Stefan is correct in that the AVM internally knows the length in chars and bytes of a LONGCHAR. I don't believe it stores column length.

Posted by Tim Kuehn on 09-Nov-2015 10:16

As I mentioned previously, I had the dataset and both CLOBs dumped to disk, and they were identical.

On your suggestion I added "RAW" to the LENGTH() being reported, and they matched the LENGTH() value of 373204.

SUBSTITUTE("Done &1/&2 chars in &3 ms", iJSONLen,      LENGTH(lv-JSON, "RAW"), lv-Time)    SKIP (2)

"Second Pass"                                                   SKIP

SUBSTITUTE("Done &1/&2 chars in &3 ms", iNewJSONLen,   LENGTH(lv-newJSON, "RAW"), lv-Time2)   SKIP

Posted by Garry Hall on 09-Nov-2015 10:37

@Tim, Just for kicks, could you also report the GET-CODEPAGE() of each LONGCHAR var?

Posted by Garry Hall on 09-Nov-2015 10:41

I just remembered this is the sports2000 database. This is all ASCII data, which will be the same when represented in UTF-8 and ISO8859-1. Hence the documents being the same on disk and in memory. However, the algorithm still will differ for counting NNN chars in a UTF-8 string vs an ISO8859-1 string.

Posted by MarkHeppenstall on 09-Nov-2015 10:52

It was I who originally posed this 'problem' to James.

I believe Garry hit the nail on the head earlier with his initial post regarding codepage.

The length of both long chars will be the same in this instance because the data in the JSON happens to all be single byte characters.

However, the SUBSTRING command will not 'know' this when processing the UTF-8 long char (lv-JSON) and hence, as Garry states, it will have to start at the beginning every time it encounters the SUBSTRING command in order to find the n-th character.

Hence, on the first instance it will first have to read 1 character, then 2, then 3, then 4 etc to 373204

The total number of characters it thus has to read in the first pass is thus given by the formula:-

n (n + 1) / 2 i.e. 69640799410

instead of the 373204 it has to do for lv-NewJSON that has a ISO8859-1 codepage.

This more than explains the difference to my satisfaction.

Posted by Tim Kuehn on 09-Nov-2015 11:09

I think something else is going on, because if one compares 69,640,799,410 to 373,204, you're talking ~5 orders of magnitude difference, not 1-2 orders of magnitude difference.

Posted by MarkHeppenstall on 09-Nov-2015 11:32

As it is reading significantly more characters in the first instance then surely it is more surprising that it doesn't take longer as the order of magnitude in this respect is much greater than the order of magnitude for the time difference?

It is reading (69640799410 / 373204) = 186602 times more characters in only (149406 / 350) = 427 times the time.

So rather than complaining it is slow the first way shouldn't we be complaining that it is slower in comparison the second way?

Only kidding by the way :-)

Posted by Tim Kuehn on 09-Nov-2015 11:37

Your analysis that it's processing more characters may be correct - my pushback was on the magnitude of the presumed reads. A 5x order of magnitude difference in reads should've resulted in an elapsed time 10^5 longer instead of 10^2 longer.

I'd still send this to PSC TS for analysis so they can create a KB that'll describe what's going on. 

Posted by Peter Judge on 11-Nov-2015 08:35

Do you see the time increase in the 2nd pass if you use the original longchar?
 
Does anything change if you assign the LENGTH() to a variable, instead of evaluating it on each iteration?
 
 
 
 

Posted by Brian K. Maher on 11-Nov-2015 08:35

James,
 
Please report this as a bug to Tech Support.
 
I see the same behavior on 11.6 (Win64).
 
You can make the times a bit better by storing the LENGTH(...) stuff into an integer variable instead of including it in each loop but you will still get very long times for the first loop and I see no logical reason for it.
 
Brian

Posted by Brian K. Maher on 11-Nov-2015 08:35

Folks,
 
Garry is correct.  It is the handling of the multi-byte characters by the substring function that is the cause.  Try adding a fix-codepage(lv-NewJSON) = "UTF-8" before the copy-lob and you will see basically the same results.
 
It's still worth submitting as a bug to see if there is something that can be improved in the multi-byte character handling.
 
Brian
 

Posted by Tim Kuehn on 11-Nov-2015 08:35

JSON is UTF-8
NewJSON is ISO8859-1

Posted by James Palmer on 11-Nov-2015 09:40

I've raised a case for this: 00329351

For some reason the auto-linking doesn't seem to be working on this.

Posted by Garry Hall on 11-Nov-2015 09:51

To make it clearer for the TS and Dev folks assigned to this issue, please add a concise statement to the case to explain exactly what it is you want them to look for. I have already explained that there will be *some* performance degradation, the cost of doing business using UTF-8. Are you asking for Brian Maher's "see if something can be improved with multi-byte character handling"? Or Tim Kuehn's difference in the order of magnitude?

I know that we have provided some performance improvements in ABL string functions, I cannot enumerate exactly what. Some of it might have been to avoid unnecessary work for single-byte codepages, which makes multi-byte processing look bad because it didn't go down.

Posted by James Palmer on 11-Nov-2015 09:56

I created the case in response to Brian's post:

James,
 
Please report this as a bug to Tech Support.
 
I see the same behavior on 11.6 (Win64).
 
You can make the times a bit better by storing the LENGTH(...) stuff into an integer variable instead of including it in each loop but you will still get very long times for the first loop and I see no logical reason for it.
 
Brian
- See more at: community.progress.com/.../74968

Posted by Garry Hall on 11-Nov-2015 10:15

Yes, he did. But then he sent a followup post, saying he confirmed my analysis, and suggesting that a bug be logged "to see if something can be improved with multi-byte character handling". The original issue to which Brian suggested you log a call is addressed through my initial response. Is there any further need for this call? If so, it should be stated concisely in the call, as reading this thread is confusing.

Posted by James Palmer on 11-Nov-2015 10:18

Case closed then. You're right it's confusing. I'm confused.

Posted by Garry Hall on 11-Nov-2015 10:21

I suspect this is a victim of the email parser issues. Brian sent 2 emails at 9:35 EST today, along with Tim and Peter, and I would guess they were sent yesterday.

This thread is closed