Database continues to crash with semaphore errors, how to investigate - Forum - OpenEdge RDBMS - Progress Community

Database continues to crash with semaphore errors, how to investigate

 Forum

Database continues to crash with semaphore errors, how to investigate

This question is answered

Hi,

were running low on ideas. Since jul 13 our databases are crashing with semaphore errors.

History

  • databases migrated from oe10.2b using conv1011
  • server changed, from real redhat 5 to HyperV hosted virtual centos 7
  • OpenEdge from 10.2B to oe 11.6
  • first crash on jul 13 after changing -Mm 1024 -> 10240
  • frequent crashes of broker, server, always showing semaphore errors

things we tried

  • change kernel.sem
  • change max open file limits
  • stop one of the databases
  • stop apache httpd
  • truncate bi
  • dump & load db
  • move hyper-v vhd from CluSter Shared Volume to local disks
  • stopped clock sync with hyperv host

I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone

[2016/07/25@06:23:11.613+0200] P-32381      T-140189001459520 I ABL    86: (453)   Logout by batchclnt on batch.
[2016/07/25@06:25:04.592+0200] P-28093      T-140142107518784 I SHUT   88: (453)   Logout by proadmin on /dev/pts/0.

                Mon Jul 25 04:25:04 2016
[2016/07/25@04:25:04.000+0000] P-28093      T-140142107518784 I SHUT       (562)   HANGUP signal received.

                Mon Jul 25 06:28:41 2016
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (2520)  Stopped.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (2520)  Stopped.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (453)   Logout by proadmin on batch.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:48.303+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:28:48.303+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.
[2016/07/25@06:29:28.457+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:28.457+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.
[2016/07/25@06:29:42.516+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:42.516+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.

attacheg log file errors

Verified Answer
  • Found the trigger for the issue and a solution.

    The database is started using a script that basically does

    sudo -u proadmin proserve LISA
    sudo -u proadmin probiw LISA
    sudo -u proadmin proapw LISA
    sudo -u proadmin proapw LISA

    if someone logs on to the system with user proadmin using ssh, and executes

    proshut LISA -C list

    the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.

    On our test server the issue went away with a yum update

All Replies
  • posted  trimmed logs here

    https://ccenv-my.sharepoint.com/personal/cvb_cce_be/_layouts/15/guestaccess.aspx?guestaccesstoken=FEt3KSY0aDGyDDObljDV9pIax531FM2sJq3OwCJLByQ%3d&docid=078d8fc568fd04005a1f4097a3d6f7049&rev=1

  • > I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone

    It's just a feature:

    knowledgebase.progress.com/.../Database-stalls-on-Linux-because-self-service-OpenEdge-client-deadlocks-if-it-receives-hangup-signal-at-the-wrong-time

    A partial fix is included in OpenEdge 11.2.1 and 11.3.x

    As a side effect of the fix, messages from signal handlers will always show as UTC time instead of the database's timezone.

  • > (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.

    Errno 22 is an invalid argument. In this case the semid is invalid. I guess the other small values of semid are wrong as well.

    Carl, can you create the list of semid's allocated by broker at startup?

    promon/R&D/debghb/16. Semaphores (but first 5. Adjust Monitor Options / 1. Enter page length: 9999

    [2016/07/13@22:01:49.085+0200] P-60282      T-140225063741248 I SRV     5: (794)   Usernum 187 terminated abnormally. 
    [2016/07/13@22:01:49.085+0200] P-60282      T-140225063741248 I SRV     5: (-----) TCP/IP write error occurred with errno 32
    [2016/07/13@22:01:49.092+0200] P-60282      T-140225063741248 I SRV     5: (1132)  Invalid semaphore id 
    [2016/07/13@22:01:49.092+0200] P-60282      T-140225063741248 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22. 
    


    I saw these errors (TCP/IP write error occurred with errno 32) when all remote clients servers were napped trying to get MTX latch.
    Do you monitor the latch activity? You can do that using ProTop or my dbmon script.

  • The list of the errors from your log with the unique semid:

    (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. 
    (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = -32763, errno = 22. 
    (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32762, errno = 22. 
    (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22. 
    (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 6, errno = 22.
    (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22.

    Note that -32763, -32762, -32762 = 32768 minus 5, 6 and 7

    (6526)  Number of Semaphore Sets (-semsets): 3 

    Your database used 3 semsets and all their semid's became truncated to one byte.

  • Hi George,

    thanks for your reply. Not yet monitoring latch activity, will look that up

    # ipcs -s

    ------ Semaphore Arrays --------

    key        semid      owner      perms      nsems

    0x00000000 294912     proadmin   666        7

    0x00000000 327681     proadmin   666        97

    0x00000000 360450     proadmin   666        97

    0x00000000 393219     proadmin   666        7

    0x00000000 425988     proadmin   666        97

    0x00000000 458757     proadmin   666        97

    for the LISA database

    07/25/16        Semaphores

    09:07:11

    User         Id        Num

       0     327681          0

       1     360450          0

       2     327681          1

       3     360450          1

       4     327681          2

       5     360450          2

       6     327681          3

       7     360450          3

       8     327681          4

       9     360450          4

      10     327681          5

      11     360450          5

      12     327681          6

      13     360450          6

      14     327681          7

      15     360450          7

      16     327681          8

      17     360450          8

      18     327681          9

      19     360450          9

      20     327681         10

      21     360450         10

      22     327681         11

      23     360450         11

      24     327681         12

      25     360450         12

      26     327681         13

      27     360450         13

      28     327681         14

      29     360450         14

      30     327681         15

      31     360450         15

      32     327681         16

      33     360450         16

      34     327681         17

      35     360450         17

      36     327681         18

      37     360450         18

      38     327681         19

      39     360450         19

      40     327681         20

      41     360450         20

      42     327681         21

      43     360450         21

      44     327681         22

      45     360450         22

      46     327681         23

      47     360450         23

      48     327681         24

      49     360450         24

      50     327681         25

      51     360450         25

      52     327681         26

      53     360450         26

      54     327681         27

      55     360450         27

      56     327681         28

      57     360450         28

      58     327681         29

      59     360450         29

      60     327681         30

      61     360450         30

      62     327681         31

      63     360450         31

      64     327681         32

      65     360450         32

      66     327681         33

      67     360450         33

      68     327681         34

      69     360450         34

      70     327681         35

      71     360450         35

      72     327681         36

      73     360450         36

      74     327681         37

      75     360450         37

      76     327681         38

      77     360450         38

      78     327681         39

      79     360450         39

      80     327681         40

      81     360450         40

      82     327681         41

      83     360450         41

      84     327681         42

      85     360450         42

      86     327681         43

      87     360450         43

      88     327681         44

      89     360450         44

      90     327681         45

      91     360450         45

      92     327681         46

      93     360450         46

      94     327681         47

      95     360450         47

      96     327681         48

      97     360450         48

      98     327681         49

      99     360450         49

     100     327681         50

     101     360450         50

     102     327681         51

     103     360450         51

     104     327681         52

     105     360450         52

     106     327681         53

     107     360450         53

     108     327681         54

     109     360450         54

     110     327681         55

     111     360450         55

     112     327681         56

     113     360450         56

     114     327681         57

     115     360450         57

     116     327681         58

     117     360450         58

     118     327681         59

     119     360450         59

     120     327681         60

     121     360450         60

     122     327681         61

     123     360450         61

     124     327681         62

     125     360450         62

     126     327681         63

     127     360450         63

     128     327681         64

     129     360450         64

     130     327681         65

     131     360450         65

     132     327681         66

     133     360450         66

     134     327681         67

     135     360450         67

     136     327681         68

     137     360450         68

     138     327681         69

     139     360450         69

     140     327681         70

     141     360450         70

     142     327681         71

     143     360450         71

     144     327681         72

     145     360450         72

     146     327681         73

     147     360450         73

     148     327681         74

     149     360450         74

     150     327681         75

     151     360450         75

     152     327681         76

     153     360450         76

     154     327681         77

     155     360450         77

     156     327681         78

     157     360450         78

     158     327681         79

     159     360450         79

     160     327681         80

     161     360450         80

     162     327681         81

     163     360450         81

     164     327681         82

     165     360450         82

     166     327681         83

     167     360450         83

     168     327681         84

     169     360450         84

     170     327681         85

     171     360450         85

     172     327681         86

     173     360450         86

     174     327681         87

     175     360450         87

     176     327681         88

     177     360450         88

     178     327681         89

     179     360450         89

     180     327681         90

     181     360450         90

     182     327681         91

     183     360450         91

     184     327681         92

     185     360450         92

     186     327681         93

     187     360450         93

     188     327681         94

     189     360450         94

     190     327681         95

     191     360450         95

     192     327681         96

    for the lisalrap database (we also had the crashes when the second db was not running)

    07/25/16        Semaphores

    09:10:34

    User         Id        Num

       0     425988          0

       1     458757          0

       2     425988          1

       3     458757          1

       4     425988          2

       5     458757          2

       6     425988          3

       7     458757          3

       8     425988          4

       9     458757          4

      10     425988          5

      11     458757          5

      12     425988          6

      13     458757          6

      14     425988          7

      15     458757          7

      16     425988          8

      17     458757          8

      18     425988          9

      19     458757          9

      20     425988         10

      21     458757         10

      22     425988         11

      23     458757         11

      24     425988         12

      25     458757         12

      26     425988         13

      27     458757         13

      28     425988         14

      29     458757         14

      30     425988         15

      31     458757         15

      32     425988         16

      33     458757         16

      34     425988         17

      35     458757         17

      36     425988         18

      37     458757         18

      38     425988         19

      39     458757         19

      40     425988         20

      41     458757         20

      42     425988         21

      43     458757         21

      44     425988         22

      45     458757         22

      46     425988         23

      47     458757         23

      48     425988         24

      49     458757         24

      50     425988         25

      51     458757         25

      52     425988         26

      53     458757         26

      54     425988         27

      55     458757         27

      56     425988         28

      57     458757         28

      58     425988         29

      59     458757         29

      60     425988         30

      61     458757         30

      62     425988         31

      63     458757         31

      64     425988         32

      65     458757         32

      66     425988         33

      67     458757         33

      68     425988         34

      69     458757         34

      70     425988         35

      71     458757         35

      72     425988         36

      73     458757         36

      74     425988         37

      75     458757         37

      76     425988         38

      77     458757         38

      78     425988         39

      79     458757         39

      80     425988         40

      81     458757         40

      82     425988         41

      83     458757         41

      84     425988         42

      85     458757         42

      86     425988         43

      87     458757         43

      88     425988         44

      89     458757         44

      90     425988         45

      91     458757         45

      92     425988         46

      93     458757         46

      94     425988         47

      95     458757         47

      96     425988         48

      97     458757         48

      98     425988         49

      99     458757         49

     100     425988         50

     101     458757         50

     102     425988         51

     103     458757         51

     104     425988         52

     105     458757         52

     106     425988         53

     107     458757         53

     108     425988         54

     109     458757         54

     110     425988         55

     111     458757         55

     112     425988         56

     113     458757         56

     114     425988         57

     115     458757         57

     116     425988         58

     117     458757         58

     118     425988         59

     119     458757         59

     120     425988         60

     121     458757         60

     122     425988         61

     123     458757         61

     124     425988         62

     125     458757         62

     126     425988         63

     127     458757         63

     128     425988         64

     129     458757         64

     130     425988         65

     131     458757         65

     132     425988         66

     133     458757         66

     134     425988         67

     135     458757         67

     136     425988         68

     137     458757         68

     138     425988         69

     139     458757         69

     140     425988         70

     141     458757         70

     142     425988         71

     143     458757         71

     144     425988         72

     145     458757         72

     146     425988         73

     147     458757         73

     148     425988         74

     149     458757         74

     150     425988         75

     151     458757         75

     152     425988         76

     153     458757         76

     154     425988         77

     155     458757         77

     156     425988         78

     157     458757         78

     158     425988         79

     159     458757         79

     160     425988         80

     161     458757         80

     162     425988         81

     163     458757         81

     164     425988         82

     165     458757         82

     166     425988         83

     167     458757         83

     168     425988         84

     169     458757         84

     170     425988         85

     171     458757         85

     172     425988         86

     173     458757         86

     174     425988         87

     175     458757         87

     176     425988         88

     177     458757         88

     178     425988         89

     179     458757         89

     180     425988         90

     181     458757         90

     182     425988         91

     183     458757         91

     184     425988         92

     185     458757         92

     186     425988         93

     187     458757         93

     188     425988         94

     189     458757         94

     190     425988         95

     191     458757         95

     192     425988         96

  • 294912 = 0x48000
    327681 = 0x50001
    360450 = 0x58002
    
    393219 = 0x60003
    425988 = 0x68004
    458757 = 0x70005
    

    I guess Remote Client Servers get the semid's from db shared memory. When one of servers corrupts this information in shared memory then the other ones begin to report the 10839's. "Mad" server "updates" semid's for its clients. If it served more than one client it may corrupt more than one semid. Can you use the -Ma 1 as a workaround?

    BTW, the first semset (the one with 7 semaphores) is used only during db startup and during logins/logouts of self-serviced clients. I guess you will not see the last byte of this semid in the 10839's

  • I ran dbmon.sh, output uploaded here

    https://ccenv-my.sharepoint.com/personal/cvb_cce_be/_layouts/15/guestaccess.aspx?guestaccesstoken=DfptMKQvgq3MsCs2MEV32d6KU6hydxAMvMIwKVVWKOY%3d&docid=0b802bea945d04006b17f18a20cf1f814&rev=1

    I found a (seemingly) similar issue reported by SteveSaltDanwood, I asked what the solution was and this is his reply

    Carl,

    We had been in discussions with Tech Support regarding this. The question refers to KB 61278, and our sys Admin has increased both SEMMSL, SEMMNS and SEMOPM slightly. The one thing that I did, was to add -semsets 6 to the Primary Login Broker pf file, and our issue disappeared - not sure whether this actually fixed it, but since the change on 10th Feb we have had no further instances.

    Regards,

    Steve

  • Carl, do you always pre-start 60 remote client servers immidiately after db startup?

    [2016/07/13@23:48:13.123+0200] P-3986       T-140605298083648 I APW    82: (2518)  Started. 
    [2016/07/13@23:48:27.788+0200] P-4350       T-140402236524352 I SRV     1: (5646)  Started on port 11002 using TCP IPV4 address 0.0.0.0, pid 4350. 
    [2016/07/13@23:49:14.562+0200] P-4843       T-140184589739840 I SRV     2: (5646)  Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 4843. 
    
    [2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (1132)  Invalid semaphore id 
    [2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22. 
    [2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (2252)  Begin transaction backout. 
    [2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (2253)  Transaction backout completed. 
    [2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (1132)  Invalid semaphore id 
    [2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (1132)  Invalid semaphore id 
    [2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. 
    [2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (-----) semLockLog_2: semValue val = -1
    [2016/07/14@00:26:19.903+0200] P-3966       T-140549975136064 I BROKER  0: (2527)  Disconnecting dead user 87. 
    
    
    
    [2016/07/14@00:55:14.379+0200] P-2288       T-139877706889024 I APW    82: (2518)  Started. 
    [2016/07/14@00:57:04.277+0200] P-3010       T-140641843181376 I SRV     2: (5646)  Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 3010. 
    ...
    [2016/07/14@01:02:05.506+0200] P-4201       T-140565902128960 I SRV     5: (5646)  Started on port 11006 using TCP IPV4 address 0.0.0.0, pid 4201. 
    [2016/07/14@01:25:33.654+0200] P-2626       T-139742344922944 I APPSRV 91: (1132)  Invalid semaphore id 
    [2016/07/14@01:25:33.654+0200] P-2626       T-139742344922944 I APPSRV 91: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22. 
    

    It looks like sometimes only a few servers were started successfully and the semaphore errors were issued by the self-service clients:

    Latch activity collected by dbmon is normal. And I guess db activity was next to zero right after db startup in the midnight. I suspect that starting many RCS at once can be a root case of the errors.Multiple semsets will decrease the competion for semset's during their launches.

  • No servers are prestarted, would that be better ?

    Just  had a call with Progress tech support, they suspect something is leaking file handles.

  • > No servers are prestarted

    Db log shows that /sometimes/ a lot of servers were started but there were no remote users connected.

    > Just  had a call with Progress tech support, they suspect something is leaking file handles.

    I'd run promon in background to check every second a few first lines on 16. "Semaphores" screen. If they are changed then the information in shared memory has got corrupted.

  • I trimmed the logfile, deleted all logon/logoff sequences, maybe shouldn't have done that
     

    Ø  I'd run promon in background to ..

     
    I'll try that
     
  • When you will be able to catch the changes of the shmid's then the next step would be to generate the protrace files for all Progress sessions. And maybe it's not too late to do the same immediately after a first semaphore error in db log.

  • Lastest crash 25 minutes after db startup following reboot of server. increase of -semsets to 6 did not help.

    Captured promon semaphore output 11 seconds before crash didn't look different than one after db start

    ccenv-my.sharepoint.com/.../guestaccess.aspx

  • Was collecting semaphore output every minute.

    Crash frequency increased (2 crashes last 2.5 hours) will stop collecting this data. May be unrelated but I wont take the risk

  • > Was collecting semaphore output every minute

    Time interval was too long. One second would have more chances to catch the corruptions of semid's. But even one second can be too long.

    What values had these semid's for last multi-user session?

    Only 2 of 5 semid's were corrupted this time.