Hi,
were running low on ideas. Since jul 13 our databases are crashing with semaphore errors.
History
things we tried
I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone
[2016/07/25@06:23:11.613+0200] P-32381 T-140189001459520 I ABL 86: (453) Logout by batchclnt on batch.[2016/07/25@06:25:04.592+0200] P-28093 T-140142107518784 I SHUT 88: (453) Logout by proadmin on /dev/pts/0. Mon Jul 25 04:25:04 2016[2016/07/25@04:25:04.000+0000] P-28093 T-140142107518784 I SHUT (562) HANGUP signal received. Mon Jul 25 06:28:41 2016[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (2520) Stopped.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (2520) Stopped.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (453) Logout by proadmin on batch.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1[2016/07/25@06:28:48.303+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9[2016/07/25@06:28:48.303+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.[2016/07/25@06:29:28.457+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9[2016/07/25@06:29:28.457+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.[2016/07/25@06:29:42.516+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9[2016/07/25@06:29:42.516+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.
attacheg log file errors
I'll add ipcs -s to the startup script.
Yesterday we stopped using shared-memory connections and it got us through the night without crashes
Did the tcp errors disappear as well?
I haven't seen any TCP/IP errors in the db log, I do see a lot of dropped packets
ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet x.x.x.x netmask 255.255.0.0 broadcast 10.0.255.255
inet6 yyyyyyyyyyyyyyyyyy prefixlen 64 scopeid 0x20<link>
ether 00:15:5d:00:01:12 txqueuelen 1000 (Ethernet)
RX packets 72734765 bytes 25463358707 (23.7 GiB)
RX errors 0 dropped 1023562 overruns 0 frame 0
TX packets 61839003 bytes 83457857300 (77.7 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 0 (Local Loopback)
RX packets 483858734 bytes 164988488628 (153.6 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 483858734 bytes 164988488628 (153.6 GiB)
TCP/IP write error occurred with errno 32
>I haven't seen any TCP/IP errors in the db log,
Do you mean that you don't see them in the current log?
Yesterday these errors were reported in log once per 5-10 minutes through the whole day.
I missed a few during my last search, there were none between 17:00 and 06:50, there are a few errors since that time but the db stays up & running.
[2016/07/25@07:25:12.467+0200] P-37327 T-139985194260288 I SRV 16: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@07:42:00.265+0200] P-38269 T-140325921544000 I SRV 19: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@07:56:39.636+0200] P-37336 T-139744649774912 I SRV 18: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:22:32.523+0200] P-46233 T-139696198965056 I SRV 34: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:33:28.461+0200] P-46233 T-139696198965056 I SRV 34: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:53:58.327+0200] P-51006 T-139629681858368 I SRV 38: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:56:58.560+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@08:58:07.372+0200] P-53007 T-139680708523840 I SRV 41: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:19:55.615+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:21:45.428+0200] P-40014 T-140399388444480 I SRV 21: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:25:09.992+0200] P-56699 T-140641693988672 I SRV 43: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:47:20.925+0200] P-33969 T-140121481381696 I SRV 2: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@09:48:22.646+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:33:19.665+0200] P-40014 T-140399388444480 I SRV 21: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:34:55.088+0200] P-53007 T-139680708523840 I SRV 41: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:48:41.338+0200] P-33974 T-139697772349248 I SRV 4: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@10:50:27.258+0200] P-52605 T-140382806456128 I SRV 40: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:12:19.914+0200] P-10713 T-140585542993728 I SRV 46: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:14:42.414+0200] P-10713 T-140585542993728 I SRV 46: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:44:40.016+0200] P-51424 T-139858028005184 I SRV 39: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:46:17.233+0200] P-11690 T-139931360241472 I SRV 48: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:55:47.234+0200] P-33974 T-139697772349248 I SRV 4: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@11:59:46.517+0200] P-10225 T-139944178841408 I SRV 45: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@12:01:44.940+0200] P-51424 T-139858028005184 I SRV 39: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:41:28.143+0200] P-4635 T-140640063956800 I SRV 29: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:50:06.256+0200] P-3171 T-140347927385920 I SRV 12: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:56:17.786+0200] P-3171 T-140347927385920 I SRV 12: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@13:59:28.007+0200] P-4133 T-139904439777088 I SRV 25: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@14:06:39.272+0200] P-4133 T-139904439777088 I SRV 25: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@14:16:58.821+0200] P-5630 T-140431553333056 I SRV 37: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@14:24:17.698+0200] P-5099 T-139919449716544 I SRV 33: (-----) TCP/IP write error occurred with errno 32 [2016/07/25@16:58:01.036+0200] P-19422 T-140208870037312 I SRV 17: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@06:53:27.189+0200] P-23437 T-140599803811648 I SRV 29: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@07:39:35.723+0200] P-27754 T-140670744590144 I SRV 36: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@08:49:59.790+0200] P-60434 T-139664345315136 I SRV 48: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@09:08:27.365+0200] P-20130 T-139837751461696 I SRV 20: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@09:14:49.987+0200] P-20130 T-139837751461696 I SRV 20: (-----) TCP/IP write error occurred with errno 32 [2016/07/26@09:46:00.119+0200] P-57586 T-140041057462080 I SRV 46: (-----) TCP/IP write error occurred with errno 32
If I understand correctly only the batch processes (20% of all sessions) were connected db locally. Today all of them are using a remote connection.
TCP/IP write error results in 794: Usernum <n> terminated abnormally. Is there anything common for all sessions that died with tcp error? The same username? The same remote host? I guess the batch processes are now connected using localhost. Will they get the 794's today?
I don't have a "theory" but I suspect that tcp errors and semaphore errors have the same source.
Just to be clear, not only were connections changed to client/server (-Ma 1)
But that this Linux guest is now on it's OWN host (as opposed to shared with other guests)
Where the Hyper-V node is now running only the Centos Linux.
All other VM's have been moved to the second node."
I believe that this is the significant part of the change that has somehow stabilised this awful semaphore down situation.
Found the trigger for the issue and a solution.
The database is started using a script that basically does
sudo -u proadmin proserve LISA sudo -u proadmin probiw LISA sudo -u proadmin proapw LISA sudo -u proadmin proapw LISA
if someone logs on to the system with user proadmin using ssh, and executes
proshut LISA -C list
the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.
On our test server the issue went away with a yum update
That is why the log files of the incidents should not be trimmed. ;-)