Database continues to crash with semaphore errors, how to in

Posted by cverbiest on 25-Jul-2016 00:53

Hi,

were running low on ideas. Since jul 13 our databases are crashing with semaphore errors.

History

  • databases migrated from oe10.2b using conv1011
  • server changed, from real redhat 5 to HyperV hosted virtual centos 7
  • OpenEdge from 10.2B to oe 11.6
  • first crash on jul 13 after changing -Mm 1024 -> 10240
  • frequent crashes of broker, server, always showing semaphore errors

things we tried

  • change kernel.sem
  • change max open file limits
  • stop one of the databases
  • stop apache httpd
  • truncate bi
  • dump & load db
  • move hyper-v vhd from CluSter Shared Volume to local disks
  • stopped clock sync with hyperv host

I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone

[2016/07/25@06:23:11.613+0200] P-32381      T-140189001459520 I ABL    86: (453)   Logout by batchclnt on batch.
[2016/07/25@06:25:04.592+0200] P-28093      T-140142107518784 I SHUT   88: (453)   Logout by proadmin on /dev/pts/0.

                Mon Jul 25 04:25:04 2016
[2016/07/25@04:25:04.000+0000] P-28093      T-140142107518784 I SHUT       (562)   HANGUP signal received.

                Mon Jul 25 06:28:41 2016
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (2520)  Stopped.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (2520)  Stopped.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (453)   Logout by proadmin on batch.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132)  Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:48.303+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:28:48.303+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.
[2016/07/25@06:29:28.457+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:28.457+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.
[2016/07/25@06:29:42.516+0200] P-27133      T-140230977447744 I BROKER  0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:42.516+0200] P-27133      T-140230977447744 I BROKER  0: (8837)  4GL server process terminated.

attacheg log file errors

Posted by cverbiest on 01-Aug-2016 03:24

Found the trigger for the issue and a solution.

The database is started using a script that basically does

sudo -u proadmin proserve LISA
sudo -u proadmin probiw LISA
sudo -u proadmin proapw LISA
sudo -u proadmin proapw LISA

if someone logs on to the system with user proadmin using ssh, and executes

proshut LISA -C list

the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.

On our test server the issue went away with a yum update

All Replies

Posted by George Potemkin on 25-Jul-2016 01:05

> I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone

It's just a feature:

knowledgebase.progress.com/.../Database-stalls-on-Linux-because-self-service-OpenEdge-client-deadlocks-if-it-receives-hangup-signal-at-the-wrong-time

A partial fix is included in OpenEdge 11.2.1 and 11.3.x

As a side effect of the fix, messages from signal handlers will always show as UTC time instead of the database's timezone.

Posted by George Potemkin on 25-Jul-2016 01:26

> (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.

Errno 22 is an invalid argument. In this case the semid is invalid. I guess the other small values of semid are wrong as well.

Carl, can you create the list of semid's allocated by broker at startup?

promon/R&D/debghb/16. Semaphores (but first 5. Adjust Monitor Options / 1. Enter page length: 9999

[2016/07/13@22:01:49.085+0200] P-60282      T-140225063741248 I SRV     5: (794)   Usernum 187 terminated abnormally. 
[2016/07/13@22:01:49.085+0200] P-60282      T-140225063741248 I SRV     5: (-----) TCP/IP write error occurred with errno 32
[2016/07/13@22:01:49.092+0200] P-60282      T-140225063741248 I SRV     5: (1132)  Invalid semaphore id 
[2016/07/13@22:01:49.092+0200] P-60282      T-140225063741248 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22. 


I saw these errors (TCP/IP write error occurred with errno 32) when all remote clients servers were napped trying to get MTX latch.
Do you monitor the latch activity? You can do that using ProTop or my dbmon script.

Posted by George Potemkin on 25-Jul-2016 02:06

The list of the errors from your log with the unique semid:

(10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. 
(10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = -32763, errno = 22. 
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32762, errno = 22. 
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22. 
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = 6, errno = 22.
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22.

Note that -32763, -32762, -32762 = 32768 minus 5, 6 and 7

(6526)  Number of Semaphore Sets (-semsets): 3 

Your database used 3 semsets and all their semid's became truncated to one byte.

Posted by cverbiest on 25-Jul-2016 02:13

Hi George,

thanks for your reply. Not yet monitoring latch activity, will look that up

# ipcs -s

------ Semaphore Arrays --------

key        semid      owner      perms      nsems

0x00000000 294912     proadmin   666        7

0x00000000 327681     proadmin   666        97

0x00000000 360450     proadmin   666        97

0x00000000 393219     proadmin   666        7

0x00000000 425988     proadmin   666        97

0x00000000 458757     proadmin   666        97

for the LISA database

07/25/16        Semaphores

09:07:11

User         Id        Num

   0     327681          0

   1     360450          0

   2     327681          1

   3     360450          1

   4     327681          2

   5     360450          2

   6     327681          3

   7     360450          3

   8     327681          4

   9     360450          4

  10     327681          5

  11     360450          5

  12     327681          6

  13     360450          6

  14     327681          7

  15     360450          7

  16     327681          8

  17     360450          8

  18     327681          9

  19     360450          9

  20     327681         10

  21     360450         10

  22     327681         11

  23     360450         11

  24     327681         12

  25     360450         12

  26     327681         13

  27     360450         13

  28     327681         14

  29     360450         14

  30     327681         15

  31     360450         15

  32     327681         16

  33     360450         16

  34     327681         17

  35     360450         17

  36     327681         18

  37     360450         18

  38     327681         19

  39     360450         19

  40     327681         20

  41     360450         20

  42     327681         21

  43     360450         21

  44     327681         22

  45     360450         22

  46     327681         23

  47     360450         23

  48     327681         24

  49     360450         24

  50     327681         25

  51     360450         25

  52     327681         26

  53     360450         26

  54     327681         27

  55     360450         27

  56     327681         28

  57     360450         28

  58     327681         29

  59     360450         29

  60     327681         30

  61     360450         30

  62     327681         31

  63     360450         31

  64     327681         32

  65     360450         32

  66     327681         33

  67     360450         33

  68     327681         34

  69     360450         34

  70     327681         35

  71     360450         35

  72     327681         36

  73     360450         36

  74     327681         37

  75     360450         37

  76     327681         38

  77     360450         38

  78     327681         39

  79     360450         39

  80     327681         40

  81     360450         40

  82     327681         41

  83     360450         41

  84     327681         42

  85     360450         42

  86     327681         43

  87     360450         43

  88     327681         44

  89     360450         44

  90     327681         45

  91     360450         45

  92     327681         46

  93     360450         46

  94     327681         47

  95     360450         47

  96     327681         48

  97     360450         48

  98     327681         49

  99     360450         49

 100     327681         50

 101     360450         50

 102     327681         51

 103     360450         51

 104     327681         52

 105     360450         52

 106     327681         53

 107     360450         53

 108     327681         54

 109     360450         54

 110     327681         55

 111     360450         55

 112     327681         56

 113     360450         56

 114     327681         57

 115     360450         57

 116     327681         58

 117     360450         58

 118     327681         59

 119     360450         59

 120     327681         60

 121     360450         60

 122     327681         61

 123     360450         61

 124     327681         62

 125     360450         62

 126     327681         63

 127     360450         63

 128     327681         64

 129     360450         64

 130     327681         65

 131     360450         65

 132     327681         66

 133     360450         66

 134     327681         67

 135     360450         67

 136     327681         68

 137     360450         68

 138     327681         69

 139     360450         69

 140     327681         70

 141     360450         70

 142     327681         71

 143     360450         71

 144     327681         72

 145     360450         72

 146     327681         73

 147     360450         73

 148     327681         74

 149     360450         74

 150     327681         75

 151     360450         75

 152     327681         76

 153     360450         76

 154     327681         77

 155     360450         77

 156     327681         78

 157     360450         78

 158     327681         79

 159     360450         79

 160     327681         80

 161     360450         80

 162     327681         81

 163     360450         81

 164     327681         82

 165     360450         82

 166     327681         83

 167     360450         83

 168     327681         84

 169     360450         84

 170     327681         85

 171     360450         85

 172     327681         86

 173     360450         86

 174     327681         87

 175     360450         87

 176     327681         88

 177     360450         88

 178     327681         89

 179     360450         89

 180     327681         90

 181     360450         90

 182     327681         91

 183     360450         91

 184     327681         92

 185     360450         92

 186     327681         93

 187     360450         93

 188     327681         94

 189     360450         94

 190     327681         95

 191     360450         95

 192     327681         96

for the lisalrap database (we also had the crashes when the second db was not running)

07/25/16        Semaphores

09:10:34

User         Id        Num

   0     425988          0

   1     458757          0

   2     425988          1

   3     458757          1

   4     425988          2

   5     458757          2

   6     425988          3

   7     458757          3

   8     425988          4

   9     458757          4

  10     425988          5

  11     458757          5

  12     425988          6

  13     458757          6

  14     425988          7

  15     458757          7

  16     425988          8

  17     458757          8

  18     425988          9

  19     458757          9

  20     425988         10

  21     458757         10

  22     425988         11

  23     458757         11

  24     425988         12

  25     458757         12

  26     425988         13

  27     458757         13

  28     425988         14

  29     458757         14

  30     425988         15

  31     458757         15

  32     425988         16

  33     458757         16

  34     425988         17

  35     458757         17

  36     425988         18

  37     458757         18

  38     425988         19

  39     458757         19

  40     425988         20

  41     458757         20

  42     425988         21

  43     458757         21

  44     425988         22

  45     458757         22

  46     425988         23

  47     458757         23

  48     425988         24

  49     458757         24

  50     425988         25

  51     458757         25

  52     425988         26

  53     458757         26

  54     425988         27

  55     458757         27

  56     425988         28

  57     458757         28

  58     425988         29

  59     458757         29

  60     425988         30

  61     458757         30

  62     425988         31

  63     458757         31

  64     425988         32

  65     458757         32

  66     425988         33

  67     458757         33

  68     425988         34

  69     458757         34

  70     425988         35

  71     458757         35

  72     425988         36

  73     458757         36

  74     425988         37

  75     458757         37

  76     425988         38

  77     458757         38

  78     425988         39

  79     458757         39

  80     425988         40

  81     458757         40

  82     425988         41

  83     458757         41

  84     425988         42

  85     458757         42

  86     425988         43

  87     458757         43

  88     425988         44

  89     458757         44

  90     425988         45

  91     458757         45

  92     425988         46

  93     458757         46

  94     425988         47

  95     458757         47

  96     425988         48

  97     458757         48

  98     425988         49

  99     458757         49

 100     425988         50

 101     458757         50

 102     425988         51

 103     458757         51

 104     425988         52

 105     458757         52

 106     425988         53

 107     458757         53

 108     425988         54

 109     458757         54

 110     425988         55

 111     458757         55

 112     425988         56

 113     458757         56

 114     425988         57

 115     458757         57

 116     425988         58

 117     458757         58

 118     425988         59

 119     458757         59

 120     425988         60

 121     458757         60

 122     425988         61

 123     458757         61

 124     425988         62

 125     458757         62

 126     425988         63

 127     458757         63

 128     425988         64

 129     458757         64

 130     425988         65

 131     458757         65

 132     425988         66

 133     458757         66

 134     425988         67

 135     458757         67

 136     425988         68

 137     458757         68

 138     425988         69

 139     458757         69

 140     425988         70

 141     458757         70

 142     425988         71

 143     458757         71

 144     425988         72

 145     458757         72

 146     425988         73

 147     458757         73

 148     425988         74

 149     458757         74

 150     425988         75

 151     458757         75

 152     425988         76

 153     458757         76

 154     425988         77

 155     458757         77

 156     425988         78

 157     458757         78

 158     425988         79

 159     458757         79

 160     425988         80

 161     458757         80

 162     425988         81

 163     458757         81

 164     425988         82

 165     458757         82

 166     425988         83

 167     458757         83

 168     425988         84

 169     458757         84

 170     425988         85

 171     458757         85

 172     425988         86

 173     458757         86

 174     425988         87

 175     458757         87

 176     425988         88

 177     458757         88

 178     425988         89

 179     458757         89

 180     425988         90

 181     458757         90

 182     425988         91

 183     458757         91

 184     425988         92

 185     458757         92

 186     425988         93

 187     458757         93

 188     425988         94

 189     458757         94

 190     425988         95

 191     458757         95

 192     425988         96

Posted by George Potemkin on 25-Jul-2016 02:38

294912 = 0x48000
327681 = 0x50001
360450 = 0x58002

393219 = 0x60003
425988 = 0x68004
458757 = 0x70005

I guess Remote Client Servers get the semid's from db shared memory. When one of servers corrupts this information in shared memory then the other ones begin to report the 10839's. "Mad" server "updates" semid's for its clients. If it served more than one client it may corrupt more than one semid. Can you use the -Ma 1 as a workaround?

BTW, the first semset (the one with 7 semaphores) is used only during db startup and during logins/logouts of self-serviced clients. I guess you will not see the last byte of this semid in the 10839's

Posted by cverbiest on 25-Jul-2016 02:52

I ran dbmon.sh, output uploaded here

https://ccenv-my.sharepoint.com/personal/cvb_cce_be/_layouts/15/guestaccess.aspx?guestaccesstoken=DfptMKQvgq3MsCs2MEV32d6KU6hydxAMvMIwKVVWKOY%3d&docid=0b802bea945d04006b17f18a20cf1f814&rev=1

I found a (seemingly) similar issue reported by SteveSaltDanwood, I asked what the solution was and this is his reply

Carl,

We had been in discussions with Tech Support regarding this. The question refers to KB 61278, and our sys Admin has increased both SEMMSL, SEMMNS and SEMOPM slightly. The one thing that I did, was to add -semsets 6 to the Primary Login Broker pf file, and our issue disappeared - not sure whether this actually fixed it, but since the change on 10th Feb we have had no further instances.

Regards,

Steve

Posted by George Potemkin on 25-Jul-2016 03:12

Carl, do you always pre-start 60 remote client servers immidiately after db startup?

[2016/07/13@23:48:13.123+0200] P-3986       T-140605298083648 I APW    82: (2518)  Started. 
[2016/07/13@23:48:27.788+0200] P-4350       T-140402236524352 I SRV     1: (5646)  Started on port 11002 using TCP IPV4 address 0.0.0.0, pid 4350. 
[2016/07/13@23:49:14.562+0200] P-4843       T-140184589739840 I SRV     2: (5646)  Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 4843. 

[2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (1132)  Invalid semaphore id 
[2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22. 
[2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (2252)  Begin transaction backout. 
[2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (2253)  Transaction backout completed. 
[2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (1132)  Invalid semaphore id 
[2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (1132)  Invalid semaphore id 
[2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. 
[2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (-----) semLockLog_2: semValue val = -1
[2016/07/14@00:26:19.903+0200] P-3966       T-140549975136064 I BROKER  0: (2527)  Disconnecting dead user 87. 



[2016/07/14@00:55:14.379+0200] P-2288       T-139877706889024 I APW    82: (2518)  Started. 
[2016/07/14@00:57:04.277+0200] P-3010       T-140641843181376 I SRV     2: (5646)  Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 3010. 
...
[2016/07/14@01:02:05.506+0200] P-4201       T-140565902128960 I SRV     5: (5646)  Started on port 11006 using TCP IPV4 address 0.0.0.0, pid 4201. 
[2016/07/14@01:25:33.654+0200] P-2626       T-139742344922944 I APPSRV 91: (1132)  Invalid semaphore id 
[2016/07/14@01:25:33.654+0200] P-2626       T-139742344922944 I APPSRV 91: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22. 

It looks like sometimes only a few servers were started successfully and the semaphore errors were issued by the self-service clients:

Latch activity collected by dbmon is normal. And I guess db activity was next to zero right after db startup in the midnight. I suspect that starting many RCS at once can be a root case of the errors.Multiple semsets will decrease the competion for semset's during their launches.

Posted by cverbiest on 25-Jul-2016 04:01

No servers are prestarted, would that be better ?

Just  had a call with Progress tech support, they suspect something is leaking file handles.

Posted by George Potemkin on 25-Jul-2016 04:23

> No servers are prestarted

Db log shows that /sometimes/ a lot of servers were started but there were no remote users connected.

> Just  had a call with Progress tech support, they suspect something is leaking file handles.

I'd run promon in background to check every second a few first lines on 16. "Semaphores" screen. If they are changed then the information in shared memory has got corrupted.

Posted by cverbiest on 25-Jul-2016 04:30

I trimmed the logfile, deleted all logon/logoff sequences, maybe shouldn't have done that
 

Ø  I'd run promon in background to ..

 
I'll try that
 

Posted by George Potemkin on 25-Jul-2016 05:10

When you will be able to catch the changes of the shmid's then the next step would be to generate the protrace files for all Progress sessions. And maybe it's not too late to do the same immediately after a first semaphore error in db log.

Posted by cverbiest on 25-Jul-2016 07:10

Lastest crash 25 minutes after db startup following reboot of server. increase of -semsets to 6 did not help.

Captured promon semaphore output 11 seconds before crash didn't look different than one after db start

ccenv-my.sharepoint.com/.../guestaccess.aspx

Posted by cverbiest on 25-Jul-2016 07:55

Was collecting semaphore output every minute.

Crash frequency increased (2 crashes last 2.5 hours) will stop collecting this data. May be unrelated but I wont take the risk

Posted by George Potemkin on 25-Jul-2016 08:05

> Was collecting semaphore output every minute

Time interval was too long. One second would have more chances to catch the corruptions of semid's. But even one second can be too long.

What values had these semid's for last multi-user session?

Only 2 of 5 semid's were corrupted this time.

Posted by George Potemkin on 25-Jul-2016 08:13

The error was reported by the most recently started session (a batch process) since 23 seconds after its login. Carl, do you know what the session did during this time?

Posted by cverbiest on 25-Jul-2016 08:15

I'll try to find out

Posted by George Potemkin on 25-Jul-2016 08:36

Just a crazy idea but it helped one our customer in the past:

Create and start an empty database. Force all self-service clients to connect the empty db first and only then connect the LISA db. I hope your application (especially its authorization) does not rely on the order of the connected databases.

Only the self-service clients can corrupt data in shared memory. Semid's are stored at the beginning of shared memory => at the beginning of the client's address space. Shared-memory of the empty db will take the risks. Either the errors will disappear or you will get them for the empty database.

Posted by cverbiest on 25-Jul-2016 08:37

Hi George,

sorry but I'm kind of drowning in logfiles, can you give some more info on the process that you refer to ?

below is the most recent protrace file but that's probably not the process that interests you

PROGRESS stack trace as of Mon Jul 25 12:46:48 2016

Progress OpenEdge Release 11.6 build 1346 SP02 TF04 on Linux franky-db 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015

Command line arguments are

/usr/dlc/bin/_mprosrv -classpath /usr/oe116/jdk/lib/tools.jar:/usr/oe116/java/progress.jar:/usr/oe116/java/messages.jar:/usr/oe116/java/prosp.jar /data/cce/applprod/data/LISA -L 150000 -n 112 -spin 50000 -bibufs 20 -Mi 1 -Ma 1 -Mn 79 -Mpb 78 -Mm 10240 -lruskips 10 -semsets 6 -S 11000 -B 2000000 -tablerangesize 1000 -indexrangesize 3500 -minport 11001 -maxport 12999 -ServerType 4GL

Startup parameters:

-pf /usr/dlc/startup.pf,-cpinternal UTF-8,-cpstream UTF-8,-cpcoll Basic,-cpcase Basic,-d dmy,-numsep 46,-numdec 44,-Mm 10240,(end .pf),-classpath /usr/oe116/jdk/lib/tools.jar:/usr/oe116/java/progress.jar:/usr/oe116/java/messages.jar:/usr/oe116/java/prosp.jar,-db /data/cce/applprod/data/LISA,-L 150000,-n 112,-spin 50000,-bibufs 20,-Mi 1,-Ma 1,-Mn 79,-Mpb 78,-Mm 10240,-lruskips 10,-semsets 6,-S 11000,-B 2000000,-tablerangesize 1000,-indexrangesize 3500,-minport 11001,-maxport 12999,-ServerType 4GL

#1   [0x5e81bb] uttraceback+0xfb  from /usr/dlc/bin/_mprosrv

#2   [0x5ead4e] uttrace+0x13e  from /usr/dlc/bin/_mprosrv

#3   [0x5eaec4] utcore+0x74  from /usr/dlc/bin/_mprosrv

#4   [0x424952] drexit+0x212  from /usr/dlc/bin/_mprosrv

#5   [0x4d6045] dbExit+0x7c  from /usr/dlc/bin/_mprosrv

#6   [0x4f8d27] dbKillAll+0x32a  from /usr/dlc/bin/_mprosrv

#7   [0x4f91de] dbKillUsers+0x4a8  from /usr/dlc/bin/_mprosrv

#8   [0x47893b] mdend+0x13b  from /usr/dlc/bin/_mprosrv

#9   [0x424911] drexit+0x1d1  from /usr/dlc/bin/_mprosrv

#10  [0x4d6045] dbExit+0x7c  from /usr/dlc/bin/_mprosrv

#11  [0x581e07] semLockLog+0x200  from /usr/dlc/bin/_mprosrv

#12  [0x4f5e2c] dbDelUsrctl+0xd2  from /usr/dlc/bin/_mprosrv

#13  [0x4f7b23] brclrdead+0x44d  from /usr/dlc/bin/_mprosrv

#14  [0x4fa06a] dbWatchdog+0x764  from /usr/dlc/bin/_mprosrv

#15  [0x51b214] dsmWatchdog+0xa7  from /usr/dlc/bin/_mprosrv

#16  [0x5d4355] nsaloop+0x518  from /usr/dlc/bin/_mprosrv

#17  [0x4784b3] doserve+0xbd8  from /usr/dlc/bin/_mprosrv

#18  [0x478771] main+0x103  from /usr/dlc/bin/_mprosrv

#19  [0x49eadb15] __libc_start_main+0xf5  from /lib64/libc.so.6

~

~

Posted by cverbiest on 25-Jul-2016 08:40

We have 2 databases and sometimes this issues appeared first on the second db.

I'll try to replace all db connections with client server connections to see if that helps

Posted by George Potemkin on 25-Jul-2016 08:42

> can you give some more info on the process that you refer to ? 

[2016/07/25@12:54:48.591+0200] P-9540       T-140149562840896 I ABL    86: (452)   Login by proadmin on batch. 
[2016/07/25@12:54:48.608+0200] P-9540       T-140149562840896 I ABL    86: (7129)  Usr 86 set name to batchclnt. 
[2016/07/25@12:55:11.829+0200] P-9540       T-140149562840896 I ABL    86: (1132)  Invalid semaphore id 
[2016/07/25@12:55:11.829+0200] P-9540       T-140149562840896 I ABL    86: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22. 
[2016/07/25@12:55:11.830+0200] P-9540       T-140149562840896 I ABL    86: (453)   Logout by batchclnt on batch. 
[2016/07/25@12:55:11.830+0200] P-9540       T-140149562840896 I ABL        (1132)  Invalid semaphore id 
[2016/07/25@12:55:11.830+0200] P-9540       T-140149562840896 I ABL        (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22. 
[2016/07/25@12:55:11.830+0200] P-9540       T-140149562840896 I ABL        (-----) semLockLog_2: semValue val = -1
[2016/07/25@12:55:14.327+0200] P-4295       T-140025558579008 I BROKER  0: (2527)  Disconnecting dead user 86. 

> Captured promon semaphore output 11 seconds before crash didn't look different than one after db start

Can you post first 5 lines of promon semaphore output?

Posted by George Potemkin on 25-Jul-2016 08:45

> We have 2 databases and sometimes this issues appeared first on the second db.

Only one session was connected to the second database at 14:29:29

Posted by cverbiest on 25-Jul-2016 09:01

I just detected our batch clients connect with rather low (old) values for -s -mmax and -D, could this cause these kind of issues ? I'm going to increase the values anyway

-d     dmy

-yy    1960

-E

-h     7

-D     80

-s     63

-B     1000

-Mr    8192

-TB    31

-TM    32

-mmax  5000

-Bt    200

#-cpinternal ISO8859-1

-cpinternal utf-8

-cpstream   IBM850

-rereadnolock

-p ccetools/start-b

-T /tmp

Posted by George Potemkin on 25-Jul-2016 09:02

Carl, when your database doesn't have the semaphore errors it has the tcp errors:

[2016/07/25@07:25:12.467+0200] P-37327      T-139985194260288 I SRV    16: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@07:42:00.265+0200] P-38269      T-140325921544000 I SRV    19: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@07:56:39.636+0200] P-37336      T-139744649774912 I SRV    18: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:22:32.523+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:33:28.461+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:53:58.327+0200] P-51006      T-139629681858368 I SRV    38: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:56:58.560+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:58:07.372+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:19:55.615+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:21:45.428+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:25:09.992+0200] P-56699      T-140641693988672 I SRV    43: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:47:20.925+0200] P-33969      T-140121481381696 I SRV     2: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:48:22.646+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:33:19.665+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:34:55.088+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:48:41.338+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:50:27.258+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:12:19.914+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:14:42.414+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:44:40.016+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:46:17.233+0200] P-11690      T-139931360241472 I SRV    48: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:55:47.234+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:59:46.517+0200] P-10225      T-139944178841408 I SRV    45: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@12:01:44.940+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:41:28.143+0200] P-4635       T-140640063956800 I SRV    29: (-----) TCP/IP write error occurred with errno 32

Most likely their root case is a hardware.

Posted by George Potemkin on 25-Jul-2016 09:06

> could this cause these kind of issues ?

Everything possible. The good method is to keep changing the things (of course, one change at once).

Posted by cverbiest on 25-Jul-2016 09:09

> The good method is to keep changing the things

I know but it's not easy to follow that rule right now

Thanks you for all your advice and interest

Posted by cverbiest on 25-Jul-2016 09:43

reply to the 14:13 question what the process was doing found following 4gl trace, unfortunately trace level is too low to be really useful

The second log file shows lsof output that is produced by the batch job prior to the quit statement

code fragment :

output close. /* batcherror logfile */
if opsys = "unix"
then os-command silent value(subst("sudo /usr/sbin/lsof -p &1 >> &2", WZ-PID, WZ-LOGFILE)).
run WriteMemo.

quit.

first log file
# cat /tmp/batch_9539.log
[16/07/25@12:54:48.609+0200] P-009540 T-485005120 1 4GL -- Logging level set to = 1
[16/07/25@12:55:11.829+0200] P-009540 T-485005120 1 4GL -- Invalid semaphore id (1132)
[16/07/25@12:55:11.829+0200] P-009540 T-485005120 1 4GL -- SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22. (10839)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- Invalid semaphore id (1132)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22. (10839)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- semLockLog_2: semValue val = -1

# cat be9540.log print.sh -f acrobat blancke01 1 /tmp/Kat46489_9540.pdf 2>&1 Acrobat exitcode: 0 print.sh -f acrobat blancke01 1 /tmp/Kat46493_9540.pdf 2>&1 Acrobat exitcode: 0 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME _progres 9540 proadmin cwd DIR 8,2 12288 2228226 /usr2/cce/applprod _progres 9540 proadmin rtd DIR 8,1 4096 2 / _progres 9540 proadmin txt REG 8,1 12589577 400553 /usr/oe116/bin/_progres _progres 9540 proadmin mem REG 8,1 110808 394029 /usr/lib64/libresolv-2.17.so _progres 9540 proadmin mem REG 8,1 27512 394017 /usr/lib64/libnss_dns-2.17.so _progres 9540 proadmin mem REG 8,1 156770 1054635 /usr/oe116/lib/libproxml.so _progres 9540 proadmin mem REG 8,1 113328 394011 /usr/lib64/libnsl-2.17.so _progres 9540 proadmin mem REG 8,1 5102967 1054636 /usr/oe116/lib/libxerces_psc.so _progres 9540 proadmin mem REG 8,1 2546137 1054640 /usr/oe116/lib/libicui18n48_psc.so _progres 9540 proadmin mem REG 8,1 1751412 1054642 /usr/oe116/lib/libicuuc48_psc.so _progres 9540 proadmin mem REG 8,1 18187923 1054638 /usr/oe116/lib/libicudata48_psc.so _progres 9540 proadmin DEL REG 0,4 98305 /SYSV00000000 _progres 9540 proadmin mem REG 8,1 37441 1054490 /usr/oe116/lib/libenhlog.so _progres 9540 proadmin mem REG 8,1 61928 394019 /usr/lib64/libnss_files-2.17.so _progres 9540 proadmin mem REG 8,1 11376 393960 /usr/lib64/libfreebl3.so _progres 9540 proadmin mem REG 8,1 2112384 394001 /usr/lib64/libc-2.17.so _progres 9540 proadmin mem REG 8,1 88720 393235 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 _progres 9540 proadmin mem REG 8,1 995840 394050 /usr/lib64/libstdc++.so.6.0.19 _progres 9540 proadmin mem REG 8,1 1141560 394009 /usr/lib64/libm-2.17.so _progres 9540 proadmin mem REG 8,1 40816 394005 /usr/lib64/libcrypt-2.17.so _progres 9540 proadmin mem REG 8,1 44096 394031 /usr/lib64/librt-2.17.so _progres 9540 proadmin mem REG 8,1 19520 394007 /usr/lib64/libdl-2.17.so _progres 9540 proadmin mem REG 8,1 142304 394027 /usr/lib64/libpthread-2.17.so _progres 9540 proadmin mem REG 8,1 164440 402388 /usr/lib64/ld-2.17.so _progres 9540 proadmin 0r CHR 1,3 0t0 1028 /dev/null _progres 9540 proadmin 1w REG 8,6 14110 25 /tmp/batcherr32 _progres 9540 proadmin 2w REG 8,6 14110 25 /tmp/batcherr32 _progres 9540 proadmin 3r REG 8,1 1749843 400498 /usr/oe116/promsgs _progres 9540 proadmin 4r REG 8,1 944417 400495 /usr/oe116/convmap.cp _progres 9540 proadmin 5r REG 8,1 944417 400495 /usr/oe116/convmap.cp _progres 9540 proadmin 6u REG 8,6 74955 7235 /tmp/lbiN5u92T (deleted) _progres 9540 proadmin 7w REG 8,7 3082181 13528468 /data/cce/applprod/data/LISA.lg _progres 9540 proadmin 8u REG 8,7 655360 13528466 /data/cce/applprod/data/LISA.db _progres 9540 proadmin 9u REG 8,7 655360 13528466 /data/cce/applprod/data/LISA.db _progres 9540 proadmin 10u REG 8,7 26869760 13500420 /data/cce/applprod/data/LISA.b1 _progres 9540 proadmin 11u REG 8,7 15204352 13500421 /data/cce/applprod/data/LISA.d1 _progres 9540 proadmin 12u REG 8,7 15204352 13500421 /data/cce/applprod/data/LISA.d1 _progres 9540 proadmin 13u REG 8,7 524288000 13500422 /data/cce/applprod/data/LISA_10.d1 _progres 9540 proadmin 14u REG 8,7 524288000 13500423 /data/cce/applprod/data/LISA_10.d2 _progres 9540 proadmin 15u REG 8,7 524288000 13500424 /data/cce/applprod/data/LISA_10.d3 _progres 9540 proadmin 16u REG 8,7 131072 13500425 /data/cce/applprod/data/LISA_10.d4 _progres 9540 proadmin 17u REG 8,7 131072 13500425 /data/cce/applprod/data/LISA_10.d4 _progres 9540 proadmin 18u REG 8,7 262144000 13500426 /data/cce/applprod/data/LISA_11.d1 _progres 9540 proadmin 19u REG 8,7 262144000 13500427 /data/cce/applprod/data/LISA_11.d2 _progres 9540 proadmin 20u REG 8,7 262144000 13500428 /data/cce/applprod/data/LISA_11.d3 _progres 9540 proadmin 21u REG 8,7 131072 13500429 /data/cce/applprod/data/LISA_11.d4 _progres 9540 proadmin 22u REG 8,7 131072 13500429 /data/cce/applprod/data/LISA_11.d4 _progres 9540 proadmin 23u REG 8,7 131072 13500430 /data/cce/applprod/data/LISA_12.d1 _progres 9540 proadmin 24u REG 8,7 131072 13500430 /data/cce/applprod/data/LISA_12.d1 _progres 9540 proadmin 25u REG 8,7 2147352576 13500431 /data/cce/applprod/data/LISA_20.d1 _progres 9540 proadmin 26u REG 8,7 2147352576 13500432 /data/cce/applprod/data/LISA_20.d2 _progres 9540 proadmin 27u REG 8,7 2147352576 13500433 /data/cce/applprod/data/LISA_20.d3 _progres 9540 proadmin 28u REG 8,7 2147352576 13500434 /data/cce/applprod/data/LISA_20.d4 _progres 9540 proadmin 29u REG 8,7 2147352576 13500435 /data/cce/applprod/data/LISA_20.d5 _progres 9540 proadmin 30u REG 8,7 2147352576 13500436 /data/cce/applprod/data/LISA_20.d6 _progres 9540 proadmin 31u REG 8,7 2147352576 13500437 /data/cce/applprod/data/LISA_20.d7 _progres 9540 proadmin 32u REG 8,7 2147352576 13500438 /data/cce/applprod/data/LISA_20.d8 _progres 9540 proadmin 33u REG 8,7 2147352576 13500439 /data/cce/applprod/data/LISA_20.d9 _progres 9540 proadmin 34u REG 8,7 2147352576 13500440 /data/cce/applprod/data/LISA_20.d10 _progres 9540 proadmin 35u REG 8,7 2147352576 13500441 /data/cce/applprod/data/LISA_20.d11 _progres 9540 proadmin 36u REG 8,7 2147352576 13500442 /data/cce/applprod/data/LISA_20.d12 _progres 9540 proadmin 37u REG 8,7 2147352576 13500443 /data/cce/applprod/data/LISA_20.d13 _progres 9540 proadmin 38u REG 8,7 2147352576 13500444 /data/cce/applprod/data/LISA_20.d14 _progres 9540 proadmin 39u REG 8,7 2147352576 13500445 /data/cce/applprod/data/LISA_20.d15 _progres 9540 proadmin 40u REG 8,7 2147352576 13500446 /data/cce/applprod/data/LISA_20.d16 _progres 9540 proadmin 41u REG 8,7 2147352576 13500447 /data/cce/applprod/data/LISA_20.d17 _progres 9540 proadmin 42u REG 8,7 2147352576 13500448 /data/cce/applprod/data/LISA_20.d18 _progres 9540 proadmin 43u REG 8,7 2147352576 13500449 /data/cce/applprod/data/LISA_20.d19 _progres 9540 proadmin 44u REG 8,7 2147352576 13500450 /data/cce/applprod/data/LISA_20.d20 _progres 9540 proadmin 45u REG 8,7 2147352576 13500451 /data/cce/applprod/data/LISA_20.d21 _progres 9540 proadmin 46u REG 8,7 2147352576 13500452 /data/cce/applprod/data/LISA_20.d22 _progres 9540 proadmin 47u REG 8,7 524288 13500453 /data/cce/applprod/data/LISA_20.d23 _progres 9540 proadmin 48u REG 8,7 524288 13500453 /data/cce/applprod/data/LISA_20.d23 _progres 9540 proadmin 49u REG 8,7 2147352576 13500454 /data/cce/applprod/data/LISA_21.d1 _progres 9540 proadmin 50u REG 8,7 2147352576 13500455 /data/cce/applprod/data/LISA_21.d2 _progres 9540 proadmin 51u REG 8,7 2147352576 13500456 /data/cce/applprod/data/LISA_21.d3 _progres 9540 proadmin 52u REG 8,7 2147352576 13500457 /data/cce/applprod/data/LISA_21.d4 _progres 9540 proadmin 53u REG 8,7 2147352576 13500458 /data/cce/applprod/data/LISA_21.d5 _progres 9540 proadmin 54u REG 8,7 2147352576 13500459 /data/cce/applprod/data/LISA_21.d6 _progres 9540 proadmin 55u REG 8,7 2147352576 13500460 /data/cce/applprod/data/LISA_21.d7 _progres 9540 proadmin 56u REG 8,7 2147352576 13500461 /data/cce/applprod/data/LISA_21.d8 _progres 9540 proadmin 57u REG 8,7 2147352576 13500462 /data/cce/applprod/data/LISA_21.d9 _progres 9540 proadmin 58u REG 8,7 2147352576 13500463 /data/cce/applprod/data/LISA_21.d10 _progres 9540 proadmin 59u REG 8,7 2147352576 13500464 /data/cce/applprod/data/LISA_21.d11 _progres 9540 proadmin 60u REG 8,7 2147352576 13500465 /data/cce/applprod/data/LISA_21.d12 _progres 9540 proadmin 61u REG 8,7 524288 13500466 /data/cce/applprod/data/LISA_21.d13 _progres 9540 proadmin 62u REG 8,7 524288 13500466 /data/cce/applprod/data/LISA_21.d13 _progres 9540 proadmin 63u REG 8,7 1572864 13500467 /data/cce/applprod/data/LISA_22.d1 _progres 9540 proadmin 64u REG 8,7 1572864 13500467 /data/cce/applprod/data/LISA_22.d1 _progres 9540 proadmin 65u REG 8,7 131072000 13500468 /data/cce/applprod/data/LISA_30.d1 _progres 9540 proadmin 66u REG 8,7 131072000 13500469 /data/cce/applprod/data/LISA_30.d2 _progres 9540 proadmin 67u REG 8,7 131072000 13500470 /data/cce/applprod/data/LISA_30.d3 _progres 9540 proadmin 68u REG 8,7 524288000 13500471 /data/cce/applprod/data/LISA_30.d4 _progres 9540 proadmin 69u REG 8,7 524288000 13500472 /data/cce/applprod/data/LISA_30.d5 _progres 9540 proadmin 70u REG 8,7 524288000 13500473 /data/cce/applprod/data/LISA_30.d6 _progres 9540 proadmin 71u REG 8,7 524288 13500474 /data/cce/applprod/data/LISA_30.d7 _progres 9540 proadmin 72u REG 8,7 524288 13500474 /data/cce/applprod/data/LISA_30.d7 _progres 9540 proadmin 73u REG 8,7 131072000 13500475 /data/cce/applprod/data/LISA_31.d1 _progres 9540 proadmin 74u REG 8,7 131072000 13500476 /data/cce/applprod/data/LISA_31.d2 _progres 9540 proadmin 75u REG 8,7 131072000 13500477 /data/cce/applprod/data/LISA_31.d3 _progres 9540 proadmin 76u REG 8,7 131072000 13500478 /data/cce/applprod/data/LISA_31.d4 _progres 9540 proadmin 77u REG 8,7 524288 13500479 /data/cce/applprod/data/LISA_31.d5 _progres 9540 proadmin 78u REG 8,7 524288 13500479 /data/cce/applprod/data/LISA_31.d5 _progres 9540 proadmin 79u REG 8,7 524288 13500480 /data/cce/applprod/data/LISA_32.d1 _progres 9540 proadmin 80u REG 8,7 524288 13500480 /data/cce/applprod/data/LISA_32.d1 _progres 9540 proadmin 81ur REG 8,6 84 7248 /tmp/batch_9539.log _progres 9540 proadmin 82u REG 8,6 0 7250 /tmp/srtVk4EVZ (deleted) _progres 9540 proadmin 83u REG 8,6 46351608 7251 /tmp/rcdhNobO5 (deleted) _progres 9540 proadmin 84u REG 8,6 819200 7256 /tmp/DBI485005120kkz8Gb (deleted) _progres 9540 proadmin 88r REG 8,1 4949823 1054570 /usr/oe116/tty/adecomm.pl _progres 9540 proadmin 89r REG 8,1 26771 1054866 /usr/oe116/tty/OpenEdge.ServerAdmin.pl _progres 9540 proadmin 90r REG 8,1 586073 1054581 /usr/oe116/tty/adeedit.pl _progres 9540 proadmin 91r REG 8,1 20580737 1054623 /usr/oe116/tty/prodict.pl _progres 9540 proadmin 92r REG 8,1 56999 1054864 /usr/oe116/tty/OpenEdge.BusinessLogic.pl _progres 9540 proadmin 93r REG 8,1 1499066 1054865 /usr/oe116/tty/OpenEdge.Core.pl _progres 9540 proadmin 94r REG 8,1 2631412 1054577 /usr/oe116/tty/adeshar.pl _progres 9540 proadmin 95r REG 8,1 240309 1054584 /usr/oe116/tty/adecomp.pl _progres 9540 proadmin 96r REG 8,1 14002200 1054621 /usr/oe116/tty/dataadmin.pl

Posted by George Potemkin on 25-Jul-2016 12:17

Carl, can you save 'ipcs -s' output after each db startup? Footprint that the corruption is leaving in shared memory is the most painless way to find out the root case. The semaphore errors show what the processes found out in shared memory but we don't know what the same bytes stored after db startup.

> Just  had a call with Progress tech support, they suspect something is leaking file handles

The size of the corrupted spot in shared memory, the byte alignment of the corrution's borders would confirm or deny the assumption.

> server changed, from real redhat 5 to HyperV hosted virtual centos 7

I suspect this is the root.

Posted by Richard Shulman on 25-Jul-2016 15:48

Hello cverbiest,

You may want to take a look at this article for tuning semaphores on Linux:

knowledgebase.progress.com/.../Database-stops-abnormally-after-showing-the-error-10839

The errors you have listed suggest an insufficient configuration for the system and the users of the system.

I agree that to find what the value should be you have to use ipcs -s to find out how many semaphores are commonly being used.

Then you can adjust the semaphores accordingly.

Thank you

Posted by George Potemkin on 25-Jul-2016 21:40

> The errors you have listed suggest an insufficient configuration for the system and the users of the system.

I saw this article but I can't understand it. "Insufficient semaphore identifiers"? For the -semsets 3 (default) and/or 6 (the recent value)? On the system runnining only 2 databases? But broker successfully creates the semaphore identifiers at database startup. What processes failed to create more identifiers? The ones that reported the semaphore errors? But the client processes use only one semaphore in the semaphore identifiers created by broker and they get it at connection time:


What are you waiting for? A reason for waiting around.
Richard Banville, June 7, 2011
pugchallenge.org/.../Waiting_AmericaPUG.pptx
Semaphore usage
Resource wait semaphore – one per connection

Posted by cverbiest on 26-Jul-2016 01:27

I'll add ipcs -s to the startup script.

Yesterday we stopped using shared-memory connections and it got us through the night without crashes

Posted by George Potemkin on 26-Jul-2016 02:36

Did the tcp errors disappear as well?

Posted by cverbiest on 26-Jul-2016 02:47

I haven't seen any TCP/IP errors in the db log, I do see a lot of dropped packets

ifconfig

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

       inet x.x.x.x  netmask 255.255.0.0  broadcast 10.0.255.255

       inet6 yyyyyyyyyyyyyyyyyy  prefixlen 64  scopeid 0x20<link>

       ether 00:15:5d:00:01:12  txqueuelen 1000  (Ethernet)

       RX packets 72734765  bytes 25463358707 (23.7 GiB)

       RX errors 0  dropped 1023562  overruns 0  frame 0

       TX packets 61839003  bytes 83457857300 (77.7 GiB)

       TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

       inet 127.0.0.1  netmask 255.0.0.0

       inet6 ::1  prefixlen 128  scopeid 0x10<host>

       loop  txqueuelen 0  (Local Loopback)

       RX packets 483858734  bytes 164988488628 (153.6 GiB)

       RX errors 0  dropped 0  overruns 0  frame 0

       TX packets 483858734  bytes 164988488628 (153.6 GiB)

       TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Posted by George Potemkin on 26-Jul-2016 03:00

TCP/IP write error occurred with errno 32

>I haven't seen any TCP/IP errors in the db log,

Do you mean that you don't see them in the current log?

Yesterday these errors were reported in log once per 5-10 minutes through the whole day.

Posted by cverbiest on 26-Jul-2016 03:29

I missed a few during my last search, there were none between 17:00 and 06:50, there are a few errors since that time but the db stays up & running.

[2016/07/25@07:25:12.467+0200] P-37327      T-139985194260288 I SRV    16: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@07:42:00.265+0200] P-38269      T-140325921544000 I SRV    19: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@07:56:39.636+0200] P-37336      T-139744649774912 I SRV    18: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:22:32.523+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:33:28.461+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:53:58.327+0200] P-51006      T-139629681858368 I SRV    38: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:56:58.560+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:58:07.372+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:19:55.615+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:21:45.428+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:25:09.992+0200] P-56699      T-140641693988672 I SRV    43: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:47:20.925+0200] P-33969      T-140121481381696 I SRV     2: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:48:22.646+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:33:19.665+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:34:55.088+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:48:41.338+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:50:27.258+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:12:19.914+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:14:42.414+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:44:40.016+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:46:17.233+0200] P-11690      T-139931360241472 I SRV    48: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:55:47.234+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:59:46.517+0200] P-10225      T-139944178841408 I SRV    45: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@12:01:44.940+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:41:28.143+0200] P-4635       T-140640063956800 I SRV    29: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:50:06.256+0200] P-3171       T-140347927385920 I SRV    12: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:56:17.786+0200] P-3171       T-140347927385920 I SRV    12: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:59:28.007+0200] P-4133       T-139904439777088 I SRV    25: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@14:06:39.272+0200] P-4133       T-139904439777088 I SRV    25: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@14:16:58.821+0200] P-5630       T-140431553333056 I SRV    37: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@14:24:17.698+0200] P-5099       T-139919449716544 I SRV    33: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@16:58:01.036+0200] P-19422      T-140208870037312 I SRV    17: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@06:53:27.189+0200] P-23437      T-140599803811648 I SRV    29: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@07:39:35.723+0200] P-27754      T-140670744590144 I SRV    36: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@08:49:59.790+0200] P-60434      T-139664345315136 I SRV    48: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@09:08:27.365+0200] P-20130      T-139837751461696 I SRV    20: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@09:14:49.987+0200] P-20130      T-139837751461696 I SRV    20: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@09:46:00.119+0200] P-57586      T-140041057462080 I SRV    46: (-----) TCP/IP write error occurred with errno 32

Posted by George Potemkin on 26-Jul-2016 04:25

If I understand correctly only the batch processes (20% of all sessions) were connected db locally. Today all of them are using a remote connection.

TCP/IP write error results in 794: Usernum <n> terminated abnormally. Is there anything common for all sessions that died with tcp error? The same username? The same remote host? I guess the batch processes are now connected using localhost. Will they get the 794's today?

I don't have a "theory" but I suspect that tcp errors and semaphore errors have the same source.

Posted by Ruanne Cluer on 26-Jul-2016 04:55

Just to be clear, not only were connections changed to client/server (-Ma 1)

But that this Linux guest is now on it's OWN host (as opposed to shared with other guests)

Where the Hyper-V node is now running only the Centos Linux.

All other VM's have been moved to the second node."

I believe that this is the significant part of the change that has somehow stabilised this awful semaphore down situation.

Posted by cverbiest on 01-Aug-2016 03:24

Found the trigger for the issue and a solution.

The database is started using a script that basically does

sudo -u proadmin proserve LISA
sudo -u proadmin probiw LISA
sudo -u proadmin proapw LISA
sudo -u proadmin proapw LISA

if someone logs on to the system with user proadmin using ssh, and executes

proshut LISA -C list

the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.

On our test server the issue went away with a yum update

Posted by George Potemkin on 01-Aug-2016 03:38

That is why the log files of the incidents should not be trimmed. ;-)

This thread is closed