11.6.3 promon crash/DB shutdown bug - Forum - OpenEdge RDBMS - Progress Community

11.6.3 promon crash/DB shutdown bug


11.6.3 promon crash/DB shutdown bug

  • Just an FYI for people out there running 11.6.3, I found a bug that TS says is new in 11.6.3 and is also in 11.7.0.

    I found it when running a promon script but I am also able to reproduce it running interactively.  It causes promon to crash with error 49 (memory violation).  The promon session holds a lock on the MTX latch when it crashes, so the database shuts down abnormally.

    I haven't narrowed down exactly the minimum steps for recreating the crash but I have a script that does it reliably.  The content of the script (with annotations) is below.

    m       Modify defaults
    1       page size
    1       User control
    1       all users
    4       Record locking table
    1       all users
    5       Activity
    6       Shared resources
    7       Database status
    5       Adjust monitor options
    1       Display page length
    6       Number of auto repeats
    t       Main menu
    1       Status displays
    4       Processes/clients
    2       Blocked clients
    3       Active transactions
    9       BI log
    10      AI log
    12      Startup parameters
    13      Shared resources
    14      Shared memory segments
    17      Servers by broker
    t       Main menu
    2       Activity displays
    3       Buffer cache
    5       BI log
    6       AI log
    10      Space allocation
    13      Other
    t       Main menu
    3       Other displays
    1       Performance indicators
    4       Checkpoints
    t       Main menu
    6       Hidden menu
    8       Resource queues
    11      Latch counts
    x       Exit

    You would have to remove the annotations to have a functioning script for promon stdin.  I showed them here for clarity.

    Workaround: if "debghb" is moved from its location above to the latest point possible, i.e. between "t" and "6", six lines from the bottom, then promon does not crash and the DB does not shut down.  I hope that makes sense.

    For those who are interested, I will post an update here when the 11.6.3 hotfix is available.

  • Is it the same issue as the defect # PSC00356177?


  • George P reported something that sounds similar on Solaris 64-bit.  What OS did you see this on ?

  • George: it sounds similar, but TS said this issue is not in 11.6.2 and prior.  I'll try to repro in Linux 11.6.3.

    CJ: Sorry, I should have given the platform.  I encountered this in 64-bit OE 11.6.3 on Linux x64 but I have also reproduced it in 64-bit 11.6.3 on Windows 7.

  • Update:

    I can reproduce the error 49 following George's steps from the other thread.  So this might be related.

    I'll try my steps again without R&D 1 14 and see if that changes anything.

  • > The promon session holds a lock on the MTX latch when it crashes

    Are you getting the error 5028 for latchId 1 (MTX) or 2 (USR)?

  • Update 2:

    My script above works when R&D 1 14 is removed.  So it does indeed look like this is the same bug George reported, though it is cross-platform.

  • > Are you getting the error 5028 for latchId 1 (MTX) or 2 (USR)?

    The (5028) error was: SYSTEM ERROR: Releasing regular latch. latchId: 2

    I'm confused. I thought MTX was 2 and USR was 3.

    for each _latch no-lock:

     display _latch-id _latch-name.


    1 0      

    2 MTL_MTX

    3 MTL_USR

    4 MTL_OM

    5 MTL_BIB


  • _Latch._Latch-Id = real LatchID + 1

    Common Progress rule: "plus or minus one" does not matter. ;-)

    MTX latch was the first and the only latch in V5 Progress db and it was called MT lock.

  • > Common Progress rule: "plus or minus one" does not matter. ;-)

    Well, I learned something new so today's a good day.  :)

    > _Latch._Latch-Id = real LatchID + 1

    I believe you, but this is non-obvious.  When the 5028 says "latchid" I expect it to mean "_latch-id".

    I'm aware of such cases in other tables, like _connect-id = _connect-usr + 1.  It seems like _Latch is missing a field like "_latch-num" to hold the "real" number that shows up in the db log.  And the 5028 message should be reworded.

  • > And the 5028 message should be reworded.

    And what about the 5029? ;-)

    (5029)  SYSTEM ERROR: Releasing multiplexed latch. latchId: 1489504328

    It's the BHT latch, by the way. :-)

    > 1 0      

    > 2 MTL_MTX

    BTW, Progress does use the memory for the nameless latchId 0 though it's not a real latch.

  • The issue is that when the super secret "debghb" setting is on in promon, examining "14. Shared memory segments" has the adverse side effect of zeroing a pointer it should not be zeroing.

    The next reference to this pointer will cause promon to crash.,

    Depending on when the pointer is accessed, promon may be holding a resource that can cause a crash.  This is seen when disconnection but could happen sooner than that based on activity performed.

  • And yes George, I believe it is the same issue and is available in HotFix

  • Thanks Rich and George.  That confirms how I should edit my script for safety until I have the fix.

  • > On Apr 19, 2017, at 4:12 PM, George Potemkin wrote:


    > MTX latch was a first latch in Progress db and it was called db lock.




    wrong. sorry george. you get a red card. :)

    the first release to have shared memory was 5.2A. In that release, there were two memory locks: the DB lock and the MTX lock.

    The MTX lock served a purpose similar to what it does today although the implementation was quite different. For various reasons, /all/ database writes were performed while holding the MTX lock.

    The DB lock was used to lock the entire shared memory region when any shared data structure was accessed or modified.

    Some other fun facts about the ancient version 5:

    * max segment size was 8 MB,

    * max -B was 32,000,

    * db and bi block size was 1 kb,

    * lock table size was limited to 32 kb,

    * bi cluster size was fixed at 16 kb.

    * TP1 benchmark performance was about 10 tps.

    * no data servers

    * 4GL could connect to only one database at a time

    * there were no internal procedures in 4GL