Activity: Performance Indicators - Latch timeouts - Forum - OpenEdge RDBMS - Progress Community

Activity: Performance Indicators - Latch timeouts

 Forum

Activity: Performance Indicators - Latch timeouts

This question is not answered

Hello,

I'm trying to learn how to manage performance in an OpenEdge database. I've set up a monitoring system. And I decided to start with setting the -spin parameter.

With the database works from 1.7K to 3.2К users depending from time of day. On the database server we have 128 CPU. Several databases are running on the same server. Currently the value of -spin is set to 10000 for all databases.  On the monitoring system I set following thresholds for Latch timeouts: less than 300 - Ok(Green zone); more that 300 and less than 500 - Warning (Yellow zone); more than 500 - Critical (Red zone). I've noticed that most databases have the Latch timeouts value in the green zone (less than 300).

But in one database, the values of Latch timeouts are almost always in the red zone. On average 809, max 1.5 K.

The question is, should I increase or decrease the -spin value for this database? Or maybe I should pay attention to some other metrics?

Thanks in advance for any tips!

All Replies
  • >>On average 809, max 1.5 K.

    That values is per second.

  • Latching is extremely sensitive to cpu affinity.  128 CPUs is a very, very bad thing for a database server.  A server with 128 CPUs probably makes a great web server.  It is not an appropriate database server.

    Aside from that...

    Noticing that you have possibly excessive latch timeouts is just the first step.  You then need to know *which* latch is responsible.  Your corrective actions will vary depending on whether the issue is excessive LRU waits or MTX waits.  Or BHT waits.  Or one of the BF latches.  It can also depend on what version of OpenEdge you are running.

    Once you know which latch is involved (hopefully just one will stick out) you might be able to "treat the symptoms" via -spin or -lruskips or something simple like that.  On the other hand you might not.  If it is MTX, for instance, then you probably have to do something like stop using RAID5 on a SAN for your bi file.

    Ultimately you want to get to the problem behind the problem.  In many cases this is going to be the behavior of 4gl code.  For instance, a very frequent driver of excessive latch requests is 4gl code that is hammering on a small set of records.  So you probably want to correlate latch requests with table and index activity at the detailed table and index level.  And then you probably want to tie that back to particular user sessions if it is narrowly focused activity, or you want to find the frequently used bits of code that are responsible.

    I should mention that ProTop does many of these things for you and would save you a lot of development time.  And has nice pretty Grafana charts too.  wss.com/.../

    --
    Tom Bascom
    tom@wss.com

  • > But in one database, the values of Latch timeouts are almost always in the red zone

    The cost of one timeout is unknown. Is it the -nap 10 or the -napmax 250 (or whatever is used by your customer)?  The difference is huge. I'd change the -napmax to the -nap just to check how it will affect the latch timeouts. Significant increasing of timeouts would mean that a bottleneck on the latches is real.

    Also I guess the -nap 1 would be a right thing to use.

    > 128 CPUs is a very, very bad thing for a database server.

    What is a percent of time when one Progress session uses the latches? 5%. Maybe 10% at most. So the most processes that are using CPUs right now are not locking the latches. The rest (~ 6 processes = 5% of 128 ) served by CPUs at the moment: only one process can lock a regular latch at time, a few processes can /try/ to lock the same latch. But if they failed then will sleep (nap 10 milliseconds) the thousands times longer than latch lock (~50-100 nanoseconds) leaving CPU time for the processes that don't need the latch locks.

    And if a box is running many databases then the large number of CPUs will not be a bad thing at all.

  • I agree that just because you see what seems like a "high" value for latch timeouts that you do not necessarily have a problem.   That is why you need to dig deeper and understand which latch or latches are involved.

    I disagree that large numbers of CPUs is not an issue.  Yes, you will get away with it if your load is not heavy.  But there is a reason that certain platforms are so frequently mentioned when users are having performance problems.

    --
    Tom Bascom
    tom@wss.com

  • Too many CPUs are not good for monolithic applications like Database Servers.  There are 2 main reasons here.

    1.) Cache Coherency – Each time a latch is being tested to see if it is free, ALL the CPUs need to clear their cache lines to make sure the latch being tested is not in another CPUs cache line being tested by a different process on a different CPU.  The more CPUs the more cache lines need to be flushed and the harder the machine works.  This shows up at high system load, without any specific process or processes being culprits for CPU usage.

    2.) NUMA – It’s unlikely today to find a 128 CPU machine that is not NUMA.  NUMA is also bad for monolithic applications.  Shared Memory will likely span nodes, hence a process on node 1 will take longer to read memory located on node 2.  The time difference for this far memory read is called the Numa Quotient and tends to be 3 or higher, meaning it’s 3 times longer to read memory on a different node than where the process is running.  As you can imagine this is bad.

    Keep the number of sockets low to avoid NUMA and get the number of CPUs you need to run your application.  Hardware prices have dropped and continue to drop, so it may be tempting to GO BIG on number of CPUs, but my advice is not to for the reasons stated above.

    Mike
    -- 
    Mike Furgal
    Director – Database and Pro2 Services
    PROGRESS
    617-803-2870 


  • To simplify a picture let’s assume that database has only one latch - MTX latch. The one I don’t like because when MTX latch is a bottleneck I don’t have any suggestions for the customers how to improve performance. Let’s assume a box has 128 CPUs. Will we get the stronger issue with Cache Coherency if we will split the whole table space, let’s say, in a hundred separate databases with their own copy of MTX latch?

  • It is an even bigger problem when too many cores are virtualized -- in that case enough free cpus need to be available before the hypervisor can schedule the VM.  Which is what often causes the "my VM is painfully slow but it isn't doing *anything*" problem.  Many times people add CPUs to the VM in an attempt to cure that problem.  Which only makes it worse.

    --
    Tom Bascom
    tom@wss.com

  • MTX is a bad example because the main improvement that you can make for that one is to address the IO subsystem.  Well, the other main improvement would be to fix the code -- but it's usually a whole lot easier to fix the IO subsystem.

    --
    Tom Bascom
    tom@wss.com

  • No.  The CPUs do not know or care which memory is being tested.  If it’s memory from DB1 or DB2.  An atomic test and set is being performed, which forces all the CPUs to surrender their cache line to make sure no other CPU is doing an atomic test and set on the same piece of memory.  The machine and CPU have no awareness that the memory the test and set is being performed on is for a specific database or shared memory region.

    Or google cache coherence problem and see many articles written on the subject.

    Mike
    -- 
    Mike Furgal
    Director – Database and Pro2 Services
    PROGRESS
    617-803-2870 


  • MTX lock duration is indeed affected by the IO subsystem but why? MTX latch can be locked during BI writes. But BIW does not seem to use MTX latch. Client's sessions do use MTX latch but in most cases they should not write to BI file.

    For our biggest customers MTX latch has the highest timeouts among all latches.

  • I see a lot of situations where LRU and BHT are much bigger problems than MTX.  And those cases are severely aggravated when the customer is on a NUMA box with dozens and dozen of cores.  IMHO a handful of cores is plenty for almost every db server that we come across.  A few fast cores is infinitely better than a horde of (slower) cores.  (If you look at Intel's chip matrix you will also see that many cores also = slower cores.)

    --
    Tom Bascom
    tom@wss.com

  • For update heavy applications, MTX is often a bottleneck.  In this note writing work, the MTX is really used to order the recording of the bi and ai notes as an atomic action and not to protect each one individually.

    Recording of OLTP actions grab MTX, then BIB, records bi notes and releases BIB, then grabs AIB and records ai notes, then releases AIB and MTX.  

    This is another reason why extending variable length bi or ai extents during OLTP operations can adversely affect performance.

    The BIWs write full/dirty bi buffers and hold the BIB latch to protect new buffers being put on the full list by OLTP activity while the BIW is grabbing one off the list.

    It would make sense to separate the latches used to record bi notes and management of the full bi buffer list.  A BIB and BIW latch for example.

    Same goes for the AIB latch but that mechanism is a bit different than the BI.

    Currently, the ai and bi writers contend with the OLTP activity for the BIB and AIB latches and that is occurring while the OLTP action is holding the MTX.

    The MTX would not be needed here if there were only one set of files to write notes to.  In other words one recovery subsystem for BI and AI as opposed to two.  The note recording latch and full buffer write latch would suffice.  One set of notes, one set of files.  It seems it would make sense to provide such an option.  Of course DBA management of such a joined recovery subsystem would be a bit different than today based on archiving and long running txns as well as involvement in replication but should be consumable easy enough.

    If such changes were ever made, it still would not remove the bottleneck of recording recovery notes since OE still maintains the write ahead logging rule database wide so managing OLTP updates will continue to be a bottleneck or have a funneling effect albeit with improved performance over today's implementation.

  • Tom, I hope you see significant improvements in BHT concurrency with those apps once they are deployed in OE 12.0.

  • -lruskips was certainly a big help!  Hopefully -hashLatchFactor will be similarly beneficial :)

    --
    Tom Bascom
    tom@wss.com

  • Hopefully not high jacking this thread...

    -hashLatchFactor also was put into 11.7.4.

    The other BHT performance enhancement only in OE 12.0 was the Optimistic Cursors feature.  This will dramatically decrease the # of BHT latch requests for queries using cursors.  Readprobe's: for each customer NO-LOCK: end. comes to mind ;)