The customer had two incidents of db down due to the memory violation on a client holding the GST latch:
[2018/02/23@10:46:21.000+0000] P-5684 T-1 I ABL 2462: (49) SYSTEM ERROR: Memory violation.
[2018/02/23@10:46:21.000+0000] P-5684 T-1 I ABL 2462: (439) ** Save file named core for analysis by Progress Software Corporation.
[2018/02/23@10:46:25.000+0000] P-5684 T-1 I ABL 2462: (-----) Generating /home/xxxxxx/protrace.5684
[2018/02/23@16:46:55.591+0600] P-6053 T-1 I WDOG 2462: (5028) SYSTEM ERROR: Releasing regular latch. latchId: 7
[2018/02/23@16:46:55.592+0600] P-6053 T-1 I WDOG 1104: (2522) User 2462 died holding 1 shared memory locks.
We know that in both cases the sessions were preparing to quit from database.
GST latch = Global Storage Table latch that control an access to the entire shared memory area. Its activity used to be very low compared to the activity of other latches. I guess GST latch is mainly used when Progress needs to change a size of any structure in shared memory by decreasing the free memory.
The tests shown:1. GST latch is activily used only at broker startup.2. promon/R&D/1/14. Status: Shared Memory Segments screen creates one GST latch lock (of course, without changing shared memory).3. When new session exceeds the previous maximum for the number of connected sessions or new login broker/remote client server is started then GST latch is locked twice and free shared memory is decreasing by 160 bytes.4. When a remote session is connecting but it does not increase HWM of connected sessions then GST latch is locked twice but free shared memory is not changing. I guess it's just due to some legacy code used by remote client servers.5. "Lock table overflow" error locks GST latch only once and free shared memory is decreasing by 88 bytes (= size of Lock Table Entry since 11.4, 14 bytes in V5 ;-).6. 'proutil -C increaseto' locks GST latch many times and the free shared memory is decreasing when possible or new shared memory segment is added.
Of course, it's not the complete list.
Monitoring free shared memory during a day shown the changes that are multiples of 448 bytes (=64*7). It can be caused by an increase of HWM of connected sessions but then I would expect the changes that are multiples of 160 bytes (=32*5).
Is it a correct assumption that ABL sessions will use GST latch only under the exceptional circumstances like "Lock table overflow" error?There are no reasons to use GST latch while disconnecting from a database, aren't there?
I guess the session in my example locked GST latch only due to its "madness" (Memory violation).
Protrace file contains:
(4) 0xe0000001901cb480 ---- Signal 11 (SIGSEGV) delivered ----
(5) 0x40000000008d7b81 stVacate + 0x161 at /vobs_rkt/src/dbmgr/st/stm.c:803 [/usr/dlc/bin/_progres]
(6) 0x4000000000963840 cxDeactivateCursor + 0x150 at /vobs_rkt/src/dbmgr/cx/cxnxt.c:222 [/usr/dlc/bin/_progres]
(7) 0x40000000009634c0 dsmCursorDelete + 0x1c0 at /vobs_rkt/src/dbmgr/dsm/dsmcursr.c:1281 [/usr/dlc/bin/_progres]
(8) 0x4000000001752430 proixdlc + 0x150 at /vobs_prgs/src/prsys/profnd.c:596 [/usr/dlc/bin/_progres]
And I guess the session tried to use an index and according ABL Stack Trace it was the index in /another/ database the session was connected to. Is it a reasonable assumption?
BTW, both incidents happened soon after migration from 10.2B08 to 11.7.2
Does your customer use -usernotifytime startup parameter for the client holding the GST latch:?
No: BROKER 0: (18118) User Notification Time (-usernotifytime): 0 seconds
one other time the GST is used is in the index manager under certain conditions when it needs to make a temporary copy of an index key. since it is not possible to allocate those at startup, they are allocated dynamically as needed.
Thanks, Gus. In our case the both incidents seem to be caused by the corrupted index:
IDXCHECK : (14684) SYSTEM ERROR: Attempt to read block 0 which does not exist in area 29, database . (210)
I was misled by the fact that the session locked the GST latch in different database - not in database with corrupted index.
BTW, idxcheck run 6 days to check only ONE index and it did not even complete phase 1. Size of the index: 154.7G (35,862,553 blocks).
> On Mar 11, 2018, at 12:33 PM, George Potemkin wrote:
> idxcheck run 6 days to check only ONE index and it did not even complete phase 1
yes. index check is quite slow sometimes.
> index check is quite slow sometimes.
And sometimes it's too short-spoken. We spent 6 days on idxcheck and got the error message (14684) that does not help to localize the corruption. We need to know the dbkey of an intermediate index block with a key that reffers to dbkey mentioned in the error (dbkey 0, area 29).
We don't know how many blocks are corrupted: the error 14684 is fatal and it forces a process to terminate.
And idxfix seems to be unable to find this corruption.
I see the only option: to run idxcheck against a copy of database started in multi-user mode but only for idxcheck and promon. When idxcheck will exit after the error 14684 we will check the most recently used index blocks in promon/R&D/6/4. Lru Chains.
Are there any other possible solutions?
How to fix the corrupted index block is another challenge. ;-(
"Highly likely" the root case of the error 14684 in our case was a failure of system memory. The memory was replaced and the second run of idxcheck did not find any index corruptions.