Database continues to crash with semaphore errors, how to investigate - Forum - OpenEdge RDBMS - Progress Community

Database continues to crash with semaphore errors, how to investigate

 Forum

Database continues to crash with semaphore errors, how to investigate

This question is answered

Hi,

were running low on ideas. Since jul 13 our databases are crashing with semaphore errors.

History

  • databases migrated from oe10.2b using conv1011
  • server changed, from real redhat 5 to HyperV hosted virtual centos 7
  • OpenEdge from 10.2B to oe 11.6
  • first crash on jul 13 after changing -Mm 1024 -> 10240
  • frequent crashes of broker, server, always showing semaphore errors

things we tried

  • change kernel.sem
  • change max open file limits
  • stop one of the databases
  • stop apache httpd
  • truncate bi
  • dump & load db
  • move hyper-v vhd from CluSter Shared Volume to local disks
  • stopped clock sync with hyperv host

I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone

[2016/07/25@06:23:11.613+0200] P-32381      T-140189001459520 I ABL    86: (453)   Logout by batchclnt on batch.
[2016/07/25@06:25:04.592+0200] P-28093      T-140142107518784 I SHUT   88: (453)   Logout by proadmin on /dev/pts/0.

                Mon Jul 25 04:25:04 2016
[2016/07/25@04:25:04.000+0000] P-28093      T-140142107518784 I SHUT       (562)   HANGUP signal received.

                Mon Jul 25 06:28:41 2016
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (2520)  Stopped.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (2520)  Stopped.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (453)   Logout by proadmin on batch.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:48.303+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:28:48.303+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.
[2016/07/25@06:29:28.457+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:28.457+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.
[2016/07/25@06:29:42.516+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:42.516+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.

attacheg log file errors

All Replies
  • I'll add ipcs -s to the startup script.

    Yesterday we stopped using shared-memory connections and it got us through the night without crashes

  • Did the tcp errors disappear as well?

  • I haven't seen any TCP/IP errors in the db log, I do see a lot of dropped packets

    ifconfig

    eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

           inet x.x.x.x  netmask 255.255.0.0  broadcast 10.0.255.255

           inet6 yyyyyyyyyyyyyyyyyy  prefixlen 64  scopeid 0x20<link>

           ether 00:15:5d:00:01:12  txqueuelen 1000  (Ethernet)

           RX packets 72734765  bytes 25463358707 (23.7 GiB)

           RX errors 0  dropped 1023562  overruns 0  frame 0

           TX packets 61839003  bytes 83457857300 (77.7 GiB)

           TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

    lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

           inet 127.0.0.1  netmask 255.0.0.0

           inet6 ::1  prefixlen 128  scopeid 0x10<host>

           loop  txqueuelen 0  (Local Loopback)

           RX packets 483858734  bytes 164988488628 (153.6 GiB)

           RX errors 0  dropped 0  overruns 0  frame 0

           TX packets 483858734  bytes 164988488628 (153.6 GiB)

           TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

  • TCP/IP write error occurred with errno 32

    >I haven't seen any TCP/IP errors in the db log,

    Do you mean that you don't see them in the current log?

    Yesterday these errors were reported in log once per 5-10 minutes through the whole day.

  • I missed a few during my last search, there were none between 17:00 and 06:50, there are a few errors since that time but the db stays up & running.

    [2016/07/25@07:25:12.467+0200] P-37327      T-139985194260288 I SRV    16: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@07:42:00.265+0200] P-38269      T-140325921544000 I SRV    19: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@07:56:39.636+0200] P-37336      T-139744649774912 I SRV    18: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@08:22:32.523+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@08:33:28.461+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@08:53:58.327+0200] P-51006      T-139629681858368 I SRV    38: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@08:56:58.560+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@08:58:07.372+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@09:19:55.615+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@09:21:45.428+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@09:25:09.992+0200] P-56699      T-140641693988672 I SRV    43: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@09:47:20.925+0200] P-33969      T-140121481381696 I SRV     2: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@09:48:22.646+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@10:33:19.665+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@10:34:55.088+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@10:48:41.338+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@10:50:27.258+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@11:12:19.914+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@11:14:42.414+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@11:44:40.016+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@11:46:17.233+0200] P-11690      T-139931360241472 I SRV    48: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@11:55:47.234+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@11:59:46.517+0200] P-10225      T-139944178841408 I SRV    45: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@12:01:44.940+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@13:41:28.143+0200] P-4635       T-140640063956800 I SRV    29: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@13:50:06.256+0200] P-3171       T-140347927385920 I SRV    12: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@13:56:17.786+0200] P-3171       T-140347927385920 I SRV    12: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@13:59:28.007+0200] P-4133       T-139904439777088 I SRV    25: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@14:06:39.272+0200] P-4133       T-139904439777088 I SRV    25: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@14:16:58.821+0200] P-5630       T-140431553333056 I SRV    37: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@14:24:17.698+0200] P-5099       T-139919449716544 I SRV    33: (-----) TCP/IP write error occurred with errno 32
    [2016/07/25@16:58:01.036+0200] P-19422      T-140208870037312 I SRV    17: (-----) TCP/IP write error occurred with errno 32
    [2016/07/26@06:53:27.189+0200] P-23437      T-140599803811648 I SRV    29: (-----) TCP/IP write error occurred with errno 32
    [2016/07/26@07:39:35.723+0200] P-27754      T-140670744590144 I SRV    36: (-----) TCP/IP write error occurred with errno 32
    [2016/07/26@08:49:59.790+0200] P-60434      T-139664345315136 I SRV    48: (-----) TCP/IP write error occurred with errno 32
    [2016/07/26@09:08:27.365+0200] P-20130      T-139837751461696 I SRV    20: (-----) TCP/IP write error occurred with errno 32
    [2016/07/26@09:14:49.987+0200] P-20130      T-139837751461696 I SRV    20: (-----) TCP/IP write error occurred with errno 32
    [2016/07/26@09:46:00.119+0200] P-57586      T-140041057462080 I SRV    46: (-----) TCP/IP write error occurred with errno 32
    
  • If I understand correctly only the batch processes (20% of all sessions) were connected db locally. Today all of them are using a remote connection.

    TCP/IP write error results in 794: Usernum <n> terminated abnormally. Is there anything common for all sessions that died with tcp error? The same username? The same remote host? I guess the batch processes are now connected using localhost. Will they get the 794's today?

    I don't have a "theory" but I suspect that tcp errors and semaphore errors have the same source.

  • Just to be clear, not only were connections changed to client/server (-Ma 1)

    But that this Linux guest is now on it's OWN host (as opposed to shared with other guests)

    Where the Hyper-V node is now running only the Centos Linux.

    All other VM's have been moved to the second node."

    I believe that this is the significant part of the change that has somehow stabilised this awful semaphore down situation.

  • Found the trigger for the issue and a solution.

    The database is started using a script that basically does

    sudo -u proadmin proserve LISA
    sudo -u proadmin probiw LISA
    sudo -u proadmin proapw LISA
    sudo -u proadmin proapw LISA

    if someone logs on to the system with user proadmin using ssh, and executes

    proshut LISA -C list

    the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.

    On our test server the issue went away with a yum update

  • That is why the log files of the incidents should not be trimmed. ;-)