On all Progress and OpenEdge supported platforms, multi-CPU architecture is implicitly supported as long as the host operating system also provides guaranteed multi-CPU support.
- OpenEdge does not include any specific coding that would optimize memory access performance on a given multi-CPU memory architecture.
- Progress Software has not implemented any optimization for the NUMA architecture.
For optimized performance of the multi-CPU environments:
- Use Enterprise DB license to benefit from the spinlocks implementation instead of the semaphores
- Set the right value for the spin database startup parameter. Refer to Article What is the -spin parameter?
For optimized performance of the Enterprise RDBMS with the NUMA memory architecture:
- Apart from using an Enterprise Database license and using spinlocks, use client/server connections to the database to reduce the number of processes accessing the database's shared memory. Using shared-memory client connections, each ABL user is a separate operating system process. In client-server mode, a reduced number of server processes access the database's shared memory pool (one server process serving multiple remote clients). By lessening the number of processes connected directly to shared memory by connecting client/server, Cache Coherency is reduced.
- Bind a process and its memory to a smaller set of cores to reduce the impact of the Numa Quotient.
Understanding shared memory access in multi-CPU environments:The implementation of
semaphores by the operating system on SMP / NUMA systems varies across vendors.
The OpenEdge RDBMS uses shared memory to coordinate database changes amongst multiple users. Shared memory is the standard technique for providing a common buffer pool that can be accessed by various processes. Changes are controlled by a set of latches that are used to control updates but also keep the processes from interfering with one another.
In the case of the Enterprise database, a technique called spinlocks is used to implement latches. Spinlocks operate with very little interaction with the OS itself, which lets them scale to multiple CPU's with very little added overhead per CPU. That is because the coordination of the locks happens at the level of CPU/memory hardware synchronization. Spinlocks are implemented in just a few assembly-code instructions and linked directly into the database manager itself. It is the same technique that operating systems use for internal locking mechanisms.
In the case of Workgroup databases, a similar technique is used, relying on operating system semaphores instead. A semaphore is different from a spinlock in that the operating system manages the synchronization. The process must call the OS itself to perform the latching code that a spinlock does in a few assembly instructions. Semaphores-based locking is less efficient because of the overhead of the context switch into the OS, as well as the added instructions in the semaphore service itself.
On a single CPU system, Workgroup database performance is good because the RDBMS itself embeds optimized algorithms. Moreover, the added code on a single CPU system for semaphores is not a huge penalty.
On a quad processor, for example, a call from a Workgroup database to use a semaphore goes into the OS and then must be reconciled in software against semaphore requests coming from processes running on the other three CPU's. This is all done in at the operating system level and is far more expensive than just a few instructions that test and set locks in the application directly.
Reconciling changes up at the semaphore level is an expensive operation which gets slower and slower as more CPU's are added because data structures have to be locked many times over, while updating the data structures that the OS maintains to implement the semaphores. The cost of using semaphores is spread across operating system calls, context switches and complex conflict resolution algorithms.
Performance of semaphores varies by operating system, because of the different implementations.
- Linux semaphores are significant slower mechanisms than spinlocks and that they do not scale well.
- Windows semaphores seem to run faster than Linux and seem to scale a bit better, but still exhibit the same performance hit.
OpenEdge RDBMS code is a consumer of operating system services and we use them as efficiently and simply as we can in order to get best reliability and a high degree of portability. Some services are less efficient on some OS implementations; some are slower based on hardware architecture as well. Workgroup performance is impacted by the architecture of the present Intel-based CPUs, and the expense of using Linux semaphores in an SMP environment. This limitation is not specific to the OpenEdge product.
In the NUMA model, the system shows only a single memory image to the user even though the memory is physically distributed over the processors. Since processors can access their own memory much faster than that of other processors, memory access is non-uniform (NUMA).
The OpenEdge database engine is based on shared memory, which means that processors must spend a significant amount of kernel time (the high %sys) only to make sure that all processors have a coherent image of the shared memory segments.
When multiple databases are running on the server, each OpenEdge database has a shared memory cache, synchronized with mutex locks (latches). The process of doing this requires CPU caches to be synchronized. The higher the number of databases, the less effective a high spin value, as the process starts jumping around cores when it has to retry.
While spinlocks operate with very little interaction with the OS itself, the premise of providing -spin is to allow 'retries' when not available, then nap before trying again.
For example: -spin 5000, -nap 1, -napmax 32
Assuming a Latch is held for 100 nanoseconds, 2.4Ghz CPU.
-spin 5000, takes 2083 ns spin time (~21 times longer than a latch is held by another process)
failing to get that latch, nap 1 == (10,000 times longer than that latch is typically held)
the CPU is available to process 479 full spin cycles of 5000, or 2395000 operations in 1ms nap time and likely to acquire that latch the next cycle
However, with multiple CPU's, each process is a separate Operating System process which is penalized for the slow shared memory sync or Cache Coherency. Every time a latch is tested for availability, all CPUs need to clear their cache lines. That requirement is to ensure that the tested latch is not in another CPU's cache, being tested by a different process on a different CPU. High CPU usage without any specific processes being responsible, typically shows up during periods of high database activity as more cache lines need to be flushed.
With more than 16 CPU's, NUMA is enabled with memory utilization with multiple cores and local vs. remote cache. When the logical partition (LPAR) spans the NUMA zone, the effect is worse due to the Numa Quotient -- the time it takes for a CPU to read memory on a remote node as compared to reading memory locally. When Shared Memory spans nodes, a process on node 1 will take longer to read memory located on node 2. The time difference for this memory read (the Numa Quotient) tends to be 3 times longer to read memory on a different node than where the process is running.