CDC policies and OM latches - Forum - OpenEdge RDBMS - Progress Community

CDC policies and OM latches

 Forum

CDC policies and OM latches

This question is not answered

The -omsize value was set higher than the number of the objects in database (-omsize 2000 vs 1190 objects). 

OM latch controls Object Manager cache: “This is the Object to Area mapping information used to show what Object resides in what Area. The Area also anchors the Index Delete Chain”.

Richard T. Banville wrote:

It's not just schema changes that can page something into the secondary cache. Any update where something is not found in the primary cache will cause a latch on the secondary cache. It's just that creating a new table forces it to be in the secondary cache.

More about OM latch:

community.progress.com/.../49024 

Problem: after enabling CDC policies we got the high numbers of the OM latch locks and its naps: ~500,000 locks per sec, ~4,000-6,000 naps perc. OM latch was almost 100% busy (the most snapshots in promon shown a latch owner). In fact the OM latch locks were just 2-6 times less than BHT latch locks on any time intervals. In other words there is a correlation between the activities of these two latches. IMHO, it’s very strange - OM latch deals with db objects while BHT latch deals with the activity of db objects.

Any explanation?

BHT/OM latch lock ratio gives a hint of index tree levels.

All Replies
  • No explanation, but a few questions:

    Were there any objects in the secondary cache, according to promon R&D 1 13?  

    Was -omsize set to 2000 at DB start or increased online?

    > after enabling CDC policies we got the high numbers of the OM latch locks and its naps

    Would the online addition of CDC policies cause the creation of a large number of change tables, whose OM entries would be in the secondary cache (and result in latching) until DB restart?

  • My theory is that the changes for CDC either check the object information to see if CDC is required and it causes an om latch or that something (maybe cdc policies) are no pre-loaded into the primary om cache.  Either way, if the om cache is sized properly and you are seeing om latch contention, then this it is a bug.  I'll have someone file a bug for this so that additional investigation can be done on it.

  • Would the online addition of CDC policies cause the creation of a large number of change tables, whose OM entries would be in the secondary cache (and result in latching) until DB restart?

    Yes.

  • > Yes.

    Does that suggest this could be expected behaviour rather than a bug?

  • Yes, in that case it would be expected behavior. I did not read that George had added them online.

    If that is the case, then they go into the secondary om cache and the secondary cache is always latched when accessed.

    It does show the value of not latching the primary cache though.

    This could be an area of improvement for us when adding new policies.

  • > Were there any objects in the secondary cache, according to promon R&D 1 13?  

    09/13/18        Status: Shared Resources  
    10:36:49        
    Active transactions:               481
    Lock table entries in use:         35235 of 1500000
    Lock table high water mark:        776635
    Primary object cache (-omsize):    Total: 2000 Used: 1176 Pinned: 0
    Secondary object cache (-omsize):  Total: 2000 Used: 164 Pinned: 0
    Primary MT cache (-mtpmsize):      Total: 1024 Used: 0 Pinned: 0
    Secondary MT cache (-mtpmsize):    Total: 1024 Used: 0 Pinned: 0
    

    5 min later

    09/13/18        Status: Shared Resources  
    10:42:15        
    Active transactions:               496
    Lock table entries in use:         33006 of 1500000
    Lock table high water mark:        776635
    Primary object cache (-omsize):    Total: 2000 Used: 1176 Pinned: 0
    Secondary object cache (-omsize):  Total: 5120 Used: 164 Pinned: 0
    Primary MT cache (-mtpmsize):      Total: 1024 Used: 0 Pinned: 0
    Secondary MT cache (-mtpmsize):    Total: 1024 Used: 0 Pinned: 0
    

    > Was -omsize set to 2000 at DB start or increased online?

    -omsize 2000 at startup and it was increased online to 5000 

  • CDC policies were activated online. It caused OM latch naps. Deactivation of policies did not resolve the issue.

  • Quick test results:

    1. If we just create a CDC policy for a table then the /reading/ its records will cause OM latch activity. But only if you don't restart a database.

    2. The number of OM latch locks is increasing with the record reads. For example, 1 read -> 2 OM latch locks, 10 reads -> 600+ OM latch locks.

    3. OM latch activity stays the same even if you delete a CDC policy. Unless you restart a database.

    4. 'FOR EACH _Cdc-Table-Policy NO-LOCK' query caused the OM latch locks in production. Really a session running CDCAdminProcess.p crashed while holding the OM latch:

    ABL  1507: (49)    SYSTEM ERROR: Memory violation.

    WDOG 1507: (5028)  SYSTEM ERROR: Releasing regular latch. latchId: 3

    But the tests show no OM latch locks when the query reads one record in _Cdc-Table-Policy.

  • 1. That is expected behavior

    2. What records are you reading?  The user data or the CDC data? The ratio 10 reads : 600 OM latches seems wrong

    3. Probably something that could be optimized on our end.

    4. Should be reported and investigated by Progress.

  • > What records are you reading?  The user data or the CDC data?

    We read the history1 table in atm database (in other words, the user data).

  • Hi Richard,

    Today I created case  for this issue

    00458901 - High OM latches and abnormal shutdown of the database after CDCBatch start

  • Thanks.  That will help get the right resources to investigate the issue and determine if there is a bug here and if so how to fix it.

    If I am understanding this properly, the reads of the existing history1 table (where all object references should be in primary om cache if set large enough at startup) are incurring om latch activity after the secondary om cache contains entries for the CDC change capture tables.

    Am I properly interpreting what you are encountering?

  • Yes, you understood correctly.

    I again checked this on the ATM database and here's the result.

    1. I deleted all CDC policies and restarted the database.

    2. After starting the database for the history1 table, I added one CDC policy, but did not activate it. I did this through the Pro2 Administration Utility.

    3. Now in the promon menu R&D 1/13 Status: Shared Resources I see that I have 1 object in the Secondary object cache

    09/14/18 Status: Shared Resources
    16:59:34

    Active transactions: 0
    Lock table entries in use: 0 of 10016
    Lock table high water mark: 10
    Primary object cache (-omsize): Total: 1024 Used: 265 Pinned: 0
    Secondary object cache (-omsize): Total: 1024 Used: 1 Pinned: 1
    Primary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0
    Secondary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0
    Number of servers: 2 (41 allocated)
    Total users: 2 (200 allocated)
    Self-service: 0
    Remote: 2
    Batch: 0
    Watchdog status: Executing
    BIW status: Executing
    Number of page writers: 1
    Number of monitors: 2
    Number of semaphores allocated: 249
    Shared memory allocated: 852384K (1 segments. The last segment was not locked in memory)

    Enter <return>, R, P, T, or X (? for help):

    4. Now reset promon statistics in the promon menu R&D debghb 6/11 Activity: Latch Counts. And start this simple code:

    MESSAGE "Zero Activity: Latch Counts" VIEW-AS ALERT-BOX. /*manuale reset promon statistics and click OK*/

    DEF VAR i AS INT NO-UNDO.
    i = 0.
    FOR EACH atm.history1 NO-LOCK:
    DISPL history1. PAUSE 0.
    i = i + 1.
    IF i > 10 THEN
    DO:
    LEAVE.
    END.
    END.

    MESSAGE "Update Activity: Latch Counts" VIEW-AS ALERT-BOX.

    5. Update promon statistics in Activity: Latch Counts

    09/14/18 Activity: Latch Counts
    17:00:55 09/14/18 16:59 to 09/14/18 17:00 (1 min 26 sec)

    ----- Locks ----- ------ Busy ------ Naps ---------- Spins ----------- ----- Nap Max -----
    Owner Total /Sec /Sec Pct /Sec /Sec /Lock /Busy Total HWM

    MTX -- 0 0 0 0.0 0 0 0 0 0 0
    USR -- 6 0 0 0.0 0 0 0 0 0 0
    OM -- 762 8 0 0.0 0 0 0 0 0 0
    BIB -- 86 1 0 0.0 0 0 0 0 0 0
    SCH -- 0 0 0 0.0 0 0 0 0 0 0
    LKP -- 36 0 0 0.0 0 0 0 0 0 0
    GST -- 0 0 0 0.0 0 0 0 0 0 0
    TXT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    SEQ -- 0 0 0 0.0 0 0 0 0 0 0
    AIB -- 0 0 0 0.0 0 0 0 0 0 0
    TXQ -- 2 0 0 0.0 0 0 0 0 0 0
    EC -- 0 0 0 0.0 0 0 0 0 0 0
    LKF -- 54 0 0 0.0 0 0 0 0 0 0
    BFP -- 0 0 0 0.0 0 0 0 0 0 0
    BHT -- 1583 18 0 0.0 0 0 0 0 0 0
    PWQ -- 0 0 0 0.0 0 0 0 0 0 0
    CPQ -- 858 9 0 0.0 0 0 0 0 0 0
    LRU -- 8 0 0 0.0 0 0 0 0 0 0
    LRU -- 0 0 0 0.0 0 0 0 0 0 0
    BUF -- 1954 22 0 0.0 0 0 0 0 0 0
    BUF -- 434 5 0 0.0 0 0 0 0 0 0
    BUF -- 348 4 0 0.0 0 0 0 0 0 0
    BUF -- 430 5 0 0.0 0 0 0 0 0 0
    INC -- 0 0 0 0.0 0 0 0 0 0 0
    CDC -- 0 0 0 0.0 0 0 0 0 0 0
    SEC -- 0 0 0 0.0 0 0 0 0 0 0
    LG -- 0 0 0 0.0 0 0 0 0 0 0

    Enter <return>, A, L, R, S, U, Z, P, T, or X (? for help):

    As we can see, OM latch is now used very actively when only 10 records are read.

    6. Now online I delete this CDC-policy via Pro2

    09/14/18 Status: Shared Resources
    17:03:27

    Active transactions: 0
    Lock table entries in use: 0 of 10016
    Lock table high water mark: 10
    Primary object cache (-omsize): Total: 1024 Used: 265 Pinned: 0
    Secondary object cache (-omsize): Total: 1024 Used: 1 Pinned: 0
    Primary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0

    Secondary object cache is still in use.

    7. I execute my code again and look into Activity: Latch Counts

    09/14/18 Activity: Latch Counts
    17:04:14 09/14/18 17:03 to 09/14/18 17:04 (14 sec)

    ----- Locks ----- ------ Busy ------ Naps ---------- Spins ----------- ----- Nap Max -----
    Owner Total /Sec /Sec Pct /Sec /Sec /Lock /Busy Total HWM

    MTX -- 0 0 0 0.0 0 0 0 0 0 0
    USR -- 2 0 0 0.0 0 0 0 0 0 0
    OM -- 762 54 0 0.0 0 0 0 0 0 0
    BIB -- 14 1 0 0.0 0 0 0 0 0 0
    SCH -- 0 0 0 0.0 0 0 0 0 0 0
    LKP -- 8 0 0 0.0 0 0 0 0 0 0
    GST -- 0 0 0 0.0 0 0 0 0 0 0
    TXT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    SEQ -- 0 0 0 0.0 0 0 0 0 0 0
    AIB -- 0 0 0 0.0 0 0 0 0 0 0
    TXQ -- 2 0 0 0.0 0 0 0 0 0 0
    EC -- 0 0 0 0.0 0 0 0 0 0 0
    LKF -- 12 0 0 0.0 0 0 0 0 0 0
    BFP -- 0 0 0 0.0 0 0 0 0 0 0
    BHT -- 1569 112 0 0.0 0 0 0 0 0 0
    PWQ -- 0 0 0 0.0 0 0 0 0 0 0

    OM latch is still in use.

    8. Again, create the CDC policy and restart the database.

    09/14/18 Status: Shared Resources
    17:07:37

    Active transactions: 0
    Lock table entries in use: 0 of 10016
    Lock table high water mark: 0
    Primary object cache (-omsize): Total: 1024 Used: 265 Pinned: 0
    Secondary object cache (-omsize): Total: 1024 Used: 0 Pinned: 0
    Primary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0
    Secondary MT cache (-mtpmsize): Total: 1024 Used: 0 Pinned: 0

    Now Secondary object cache is not used.

    9. I execute my code again and look in Activity: Latch Counts

    09/14/18 Activity: Latch Counts
    17:09:35 09/14/18 17:09 to 09/14/18 17:09 (5 sec)

    ----- Locks ----- ------ Busy ------ Naps ---------- Spins ----------- ----- Nap Max -----
    Owner Total /Sec /Sec Pct /Sec /Sec /Lock /Busy Total HWM

    MTX -- 0 0 0 0.0 0 0 0 0 0 0
    USR -- 0 0 0 0.0 0 0 0 0 0 0
    OM -- 0 0 0 0.0 0 0 0 0 0 0
    BIB -- 0 0 0 0.0 0 0 0 0 0 0
    SCH -- 0 0 0 0.0 0 0 0 0 0 0
    LKP -- 4 0 0 0.0 0 0 0 0 0 0
    GST -- 0 0 0 0.0 0 0 0 0 0 0
    TXT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    LKT -- 0 0 0 0.0 0 0 0 0 0 0
    SEQ -- 0 0 0 0.0 0 0 0 0 0 0
    AIB -- 0 0 0 0.0 0 0 0 0 0 0
    TXQ -- 2 0 0 0.0 0 0 0 0 0 0
    EC -- 0 0 0 0.0 0 0 0 0 0 0
    LKF -- 6 1 0 0.0 0 0 0 0 0 0
    BFP -- 0 0 0 0.0 0 0 0 0 0 0
    BHT -- 1641 328 0 0.0 0 0 0 0 0 0
    PWQ -- 0 0 0 0.0 0 0 0 0 0 0
    CPQ -- 52 10 0 0.0 0 0 0 0 0 0

    Now the OM latch is not used.

    P.S.
    Pay attention to BHT. Why is it so big, if we only read 10 records?

  • By the way, what are the "Pinned" objects in a object cache (and in MT cache)?

  • This is to provide isolation from uncommitted changes to object entries.

    When making changes to an object the "old" committed version of the object is pinned in the cache while the new uncommitted version resides on disk as it is being changed.  

    This is how we ensure all other OLTP sees only the committed version of the object entry as opposed to the changed but not committed yet version.  If the entry was not pinned in the cache, it could be paged out and when read on demand from disk NO-LOCK, the uncommitted version would show up in the cache.  

    After the transaction which changed the cache entry ends, committed or rolled back, the entry in the cache will be updated with the current on-disk version the next time it is referenced.