Database continues to crash with semaphore errors, how to in

Posted by cverbiest on 25-Jul-2016 00:53

Hi,

were running low on ideas. Since jul 13 our databases are crashing with semaphore errors.

History

databases migrated from oe10.2b using conv1011
server changed, from real redhat 5 to HyperV hosted virtual centos 7
OpenEdge from 10.2B to oe 11.6
first crash on jul 13 after changing -Mm 1024 -> 10240
frequent crashes of broker, server, always showing semaphore errors

things we tried

change kernel.sem
change max open file limits
stop one of the databases
stop apache httpd
truncate bi
dump & load db
move hyper-v vhd from CluSter Shared Volume to local disks
stopped clock sync with hyperv host

I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone

[2016/07/25@06:23:11.613+0200] P-32381      T-140189001459520 I ABL    86: (453)   Logout by batchclnt on batch.
[2016/07/25@06:25:04.592+0200] P-28093      T-140142107518784 I SHUT   88: (453)   Logout by proadmin on /dev/pts/0.

                Mon Jul 25 04:25:04 2016
[2016/07/25@04:25:04.000+0000] P-28093      T-140142107518784 I SHUT       (562)   HANGUP signal received.

                Mon Jul 25 06:28:41 2016
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.616+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (2520) Stopped.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (2520) Stopped.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (453)   Logout by proadmin on batch.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (1132) Invalid semaphore id
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
[2016/07/25@06:28:41.623+0200] P-31604      T-140273689978688 I SRV     5: (-----) semLockLog_2: semValue val = -1
[2016/07/25@06:28:48.303+0200] P-27133      T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:28:48.303+0200] P-27133      T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.
[2016/07/25@06:29:28.457+0200] P-27133      T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:28.457+0200] P-27133      T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.
[2016/07/25@06:29:42.516+0200] P-27133      T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9
[2016/07/25@06:29:42.516+0200] P-27133      T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.

attacheg log file errors

OpenEdge RDBMS - Forum

Posted by cverbiest on 01-Aug-2016 03:24

Found the trigger for the issue and a solution.

The database is started using a script that basically does

sudo -u proadmin proserve LISA
sudo -u proadmin probiw LISA
sudo -u proadmin proapw LISA
sudo -u proadmin proapw LISA

if someone logs on to the system with user proadmin using ssh, and executes

proshut LISA -C list

the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.

On our test server the issue went away with a yum update

All Replies

Posted by cverbiest on 25-Jul-2016 01:03

posted trimmed logs here

https://ccenv-my.sharepoint.com/personal/cvb_cce_be/_layouts/15/guestaccess.aspx?guestaccesstoken=FEt3KSY0aDGyDDObljDV9pIax531FM2sJq3OwCJLByQ%3d&docid=078d8fc568fd04005a1f4097a3d6f7049&rev=1

Posted by George Potemkin on 25-Jul-2016 01:05

> I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone

It's just a feature:

knowledgebase.progress.com/.../Database-stalls-on-Linux-because-self-service-OpenEdge-client-deadlocks-if-it-receives-hangup-signal-at-the-wrong-time

A partial fix is included in OpenEdge 11.2.1 and 11.3.x

As a side effect of the fix, messages from signal handlers will always show as UTC time instead of the database's timezone.

Posted by George Potemkin on 25-Jul-2016 01:26

> (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.

Errno 22 is an invalid argument. In this case the semid is invalid. I guess the other small values of semid are wrong as well.

Carl, can you create the list of semid's allocated by broker at startup?

promon/R&D/debghb/16. Semaphores (but first 5. Adjust Monitor Options / 1. Enter page length: 9999

[2016/07/13@22:01:49.085+0200] P-60282      T-140225063741248 I SRV     5: (794)   Usernum 187 terminated abnormally. 
[2016/07/13@22:01:49.085+0200] P-60282      T-140225063741248 I SRV     5: (-----) TCP/IP write error occurred with errno 32
[2016/07/13@22:01:49.092+0200] P-60282      T-140225063741248 I SRV     5: (1132)  Invalid semaphore id 
[2016/07/13@22:01:49.092+0200] P-60282      T-140225063741248 I SRV     5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22.

I saw these errors (TCP/IP write error occurred with errno 32) when all remote clients servers were napped trying to get MTX latch.
Do you monitor the latch activity? You can do that using ProTop or my dbmon script.

Posted by George Potemkin on 25-Jul-2016 02:06

The list of the errors from your log with the unique semid:

(10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. 
(10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = -32763, errno = 22. 
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32762, errno = 22. 
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22. 
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = 6, errno = 22. 
(10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22.

Note that -32763, -32762, -32762 = 32768 minus 5, 6 and 7

(6526)  Number of Semaphore Sets (-semsets): 3

Your database used 3 semsets and all their semid's became truncated to one byte.

Posted by cverbiest on 25-Jul-2016 02:13

Hi George,

thanks for your reply. Not yet monitoring latch activity, will look that up

# ipcs -s

------ Semaphore Arrays --------

key semid owner perms nsems

0x00000000 294912 proadmin 666 7

0x00000000 327681 proadmin 666 97

0x00000000 360450 proadmin 666 97

0x00000000 393219 proadmin 666 7

0x00000000 425988 proadmin 666 97

0x00000000 458757 proadmin 666 97

for the LISA database

07/25/16 Semaphores

09:07:11

User Id Num

0 327681 0

1 360450 0

2 327681 1

3 360450 1

4 327681 2

5 360450 2

6 327681 3

7 360450 3

8 327681 4

9 360450 4

10 327681 5

11 360450 5

12 327681 6

13 360450 6

14 327681 7

15 360450 7

16 327681 8

17 360450 8

18 327681 9

19 360450 9

20 327681 10

21 360450 10

22 327681 11

23 360450 11

24 327681 12

25 360450 12

26 327681 13

27 360450 13

28 327681 14

29 360450 14

30 327681 15

31 360450 15

32 327681 16

33 360450 16

34 327681 17

35 360450 17

36 327681 18

37 360450 18

38 327681 19

39 360450 19

40 327681 20

41 360450 20

42 327681 21

43 360450 21

44 327681 22

45 360450 22

46 327681 23

47 360450 23

48 327681 24

49 360450 24

50 327681 25

51 360450 25

52 327681 26

53 360450 26

54 327681 27

55 360450 27

56 327681 28

57 360450 28

58 327681 29

59 360450 29

60 327681 30

61 360450 30

62 327681 31

63 360450 31

64 327681 32

65 360450 32

66 327681 33

67 360450 33

68 327681 34

69 360450 34

70 327681 35

71 360450 35

72 327681 36

73 360450 36

74 327681 37

75 360450 37

76 327681 38

77 360450 38

78 327681 39

79 360450 39

80 327681 40

81 360450 40

82 327681 41

83 360450 41

84 327681 42

85 360450 42

86 327681 43

87 360450 43

88 327681 44

89 360450 44

90 327681 45

91 360450 45

92 327681 46

93 360450 46

94 327681 47

95 360450 47

96 327681 48

97 360450 48

98 327681 49

99 360450 49

100 327681 50

101 360450 50

102 327681 51

103 360450 51

104 327681 52

105 360450 52

106 327681 53

107 360450 53

108 327681 54

109 360450 54

110 327681 55

111 360450 55

112 327681 56

113 360450 56

114 327681 57

115 360450 57

116 327681 58

117 360450 58

118 327681 59

119 360450 59

120 327681 60

121 360450 60

122 327681 61

123 360450 61

124 327681 62

125 360450 62

126 327681 63

127 360450 63

128 327681 64

129 360450 64

130 327681 65

131 360450 65

132 327681 66

133 360450 66

134 327681 67

135 360450 67

136 327681 68

137 360450 68

138 327681 69

139 360450 69

140 327681 70

141 360450 70

142 327681 71

143 360450 71

144 327681 72

145 360450 72

146 327681 73

147 360450 73

148 327681 74

149 360450 74

150 327681 75

151 360450 75

152 327681 76

153 360450 76

154 327681 77

155 360450 77

156 327681 78

157 360450 78

158 327681 79

159 360450 79

160 327681 80

161 360450 80

162 327681 81

163 360450 81

164 327681 82

165 360450 82

166 327681 83

167 360450 83

168 327681 84

169 360450 84

170 327681 85

171 360450 85

172 327681 86

173 360450 86

174 327681 87

175 360450 87

176 327681 88

177 360450 88

178 327681 89

179 360450 89

180 327681 90

181 360450 90

182 327681 91

183 360450 91

184 327681 92

185 360450 92

186 327681 93

187 360450 93

188 327681 94

189 360450 94

190 327681 95

191 360450 95

192 327681 96

for the lisalrap database (we also had the crashes when the second db was not running)

07/25/16 Semaphores

09:10:34

User Id Num

0 425988 0

1 458757 0

2 425988 1

3 458757 1

4 425988 2

5 458757 2

6 425988 3

7 458757 3

8 425988 4

9 458757 4

10 425988 5

11 458757 5

12 425988 6

13 458757 6

14 425988 7

15 458757 7

16 425988 8

17 458757 8

18 425988 9

19 458757 9

20 425988 10

21 458757 10

22 425988 11

23 458757 11

24 425988 12

25 458757 12

26 425988 13

27 458757 13

28 425988 14

29 458757 14

30 425988 15

31 458757 15

32 425988 16

33 458757 16

34 425988 17

35 458757 17

36 425988 18

37 458757 18

38 425988 19

39 458757 19

40 425988 20

41 458757 20

42 425988 21

43 458757 21

44 425988 22

45 458757 22

46 425988 23

47 458757 23

48 425988 24

49 458757 24

50 425988 25

51 458757 25

52 425988 26

53 458757 26

54 425988 27

55 458757 27

56 425988 28

57 458757 28

58 425988 29

59 458757 29

60 425988 30

61 458757 30

62 425988 31

63 458757 31

64 425988 32

65 458757 32

66 425988 33

67 458757 33

68 425988 34

69 458757 34

70 425988 35

71 458757 35

72 425988 36

73 458757 36

74 425988 37

75 458757 37

76 425988 38

77 458757 38

78 425988 39

79 458757 39

80 425988 40

81 458757 40

82 425988 41

83 458757 41

84 425988 42

85 458757 42

86 425988 43

87 458757 43

88 425988 44

89 458757 44

90 425988 45

91 458757 45

92 425988 46

93 458757 46

94 425988 47

95 458757 47

96 425988 48

97 458757 48

98 425988 49

99 458757 49

100 425988 50

101 458757 50

102 425988 51

103 458757 51

104 425988 52

105 458757 52

106 425988 53

107 458757 53

108 425988 54

109 458757 54

110 425988 55

111 458757 55

112 425988 56

113 458757 56

114 425988 57

115 458757 57

116 425988 58

117 458757 58

118 425988 59

119 458757 59

120 425988 60

121 458757 60

122 425988 61

123 458757 61

124 425988 62

125 458757 62

126 425988 63

127 458757 63

128 425988 64

129 458757 64

130 425988 65

131 458757 65

132 425988 66

133 458757 66

134 425988 67

135 458757 67

136 425988 68

137 458757 68

138 425988 69

139 458757 69

140 425988 70

141 458757 70

142 425988 71

143 458757 71

144 425988 72

145 458757 72

146 425988 73

147 458757 73

148 425988 74

149 458757 74

150 425988 75

151 458757 75

152 425988 76

153 458757 76

154 425988 77

155 458757 77

156 425988 78

157 458757 78

158 425988 79

159 458757 79

160 425988 80

161 458757 80

162 425988 81

163 458757 81

164 425988 82

165 458757 82

166 425988 83

167 458757 83

168 425988 84

169 458757 84

170 425988 85

171 458757 85

172 425988 86

173 458757 86

174 425988 87

175 458757 87

176 425988 88

177 458757 88

178 425988 89

179 458757 89

180 425988 90

181 458757 90

182 425988 91

183 458757 91

184 425988 92

185 458757 92

186 425988 93

187 458757 93

188 425988 94

189 458757 94

190 425988 95

191 458757 95

192 425988 96

Posted by George Potemkin on 25-Jul-2016 02:38

294912 = 0x48000
327681 = 0x50001
360450 = 0x58002

393219 = 0x60003
425988 = 0x68004
458757 = 0x70005

I guess Remote Client Servers get the semid's from db shared memory. When one of servers corrupts this information in shared memory then the other ones begin to report the 10839's. "Mad" server "updates" semid's for its clients. If it served more than one client it may corrupt more than one semid. Can you use the -Ma 1 as a workaround?

BTW, the first semset (the one with 7 semaphores) is used only during db startup and during logins/logouts of self-serviced clients. I guess you will not see the last byte of this semid in the 10839's

Posted by cverbiest on 25-Jul-2016 02:52

I ran dbmon.sh, output uploaded here

https://ccenv-my.sharepoint.com/personal/cvb_cce_be/_layouts/15/guestaccess.aspx?guestaccesstoken=DfptMKQvgq3MsCs2MEV32d6KU6hydxAMvMIwKVVWKOY%3d&docid=0b802bea945d04006b17f18a20cf1f814&rev=1

I found a (seemingly) similar issue reported by SteveSaltDanwood, I asked what the solution was and this is his reply

Carl,

We had been in discussions with Tech Support regarding this. The question refers to KB 61278, and our sys Admin has increased both SEMMSL, SEMMNS and SEMOPM slightly. The one thing that I did, was to add -semsets 6 to the Primary Login Broker pf file, and our issue disappeared - not sure whether this actually fixed it, but since the change on 10th Feb we have had no further instances.

Regards,

Steve

Posted by George Potemkin on 25-Jul-2016 03:12

Carl, do you always pre-start 60 remote client servers immidiately after db startup?

[2016/07/13@23:48:13.123+0200] P-3986       T-140605298083648 I APW    82: (2518)  Started. 
[2016/07/13@23:48:27.788+0200] P-4350       T-140402236524352 I SRV     1: (5646)  Started on port 11002 using TCP IPV4 address 0.0.0.0, pid 4350. 
[2016/07/13@23:49:14.562+0200] P-4843       T-140184589739840 I SRV     2: (5646)  Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 4843. 

[2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (1132)  Invalid semaphore id 
[2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22. 
[2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (2252)  Begin transaction backout. 
[2016/07/14@00:26:16.996+0200] P-17010      T-140595944453952 I ABL    87: (2253)  Transaction backout completed. 
[2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (1132)  Invalid semaphore id 
[2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (1132)  Invalid semaphore id 
[2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. 
[2016/07/14@00:26:17.002+0200] P-17010      T-140595944453952 I ABL        (-----) semLockLog_2: semValue val = -1
[2016/07/14@00:26:19.903+0200] P-3966       T-140549975136064 I BROKER  0: (2527)  Disconnecting dead user 87. 



[2016/07/14@00:55:14.379+0200] P-2288       T-139877706889024 I APW    82: (2518)  Started. 
[2016/07/14@00:57:04.277+0200] P-3010       T-140641843181376 I SRV     2: (5646)  Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 3010. 
...
[2016/07/14@01:02:05.506+0200] P-4201       T-140565902128960 I SRV     5: (5646)  Started on port 11006 using TCP IPV4 address 0.0.0.0, pid 4201. 
[2016/07/14@01:25:33.654+0200] P-2626       T-139742344922944 I APPSRV 91: (1132)  Invalid semaphore id 
[2016/07/14@01:25:33.654+0200] P-2626       T-139742344922944 I APPSRV 91: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22.

It looks like sometimes only a few servers were started successfully and the semaphore errors were issued by the self-service clients:

Latch activity collected by dbmon is normal. And I guess db activity was next to zero right after db startup in the midnight. I suspect that starting many RCS at once can be a root case of the errors.Multiple semsets will decrease the competion for semset's during their launches.

Posted by cverbiest on 25-Jul-2016 04:01

No servers are prestarted, would that be better ?

Just had a call with Progress tech support, they suspect something is leaking file handles.

Posted by George Potemkin on 25-Jul-2016 04:23

> No servers are prestarted

Db log shows that /sometimes/ a lot of servers were started but there were no remote users connected.

> Just had a call with Progress tech support, they suspect something is leaking file handles.

I'd run promon in background to check every second a few first lines on 16. "Semaphores" screen. If they are changed then the information in shared memory has got corrupted.

Posted by cverbiest on 25-Jul-2016 04:30

I trimmed the logfile, deleted all logon/logoff sequences, maybe shouldn't have done that

Ø I'd run promon in background to ..

I'll try that

Posted by George Potemkin on 25-Jul-2016 05:10

When you will be able to catch the changes of the shmid's then the next step would be to generate the protrace files for all Progress sessions. And maybe it's not too late to do the same immediately after a first semaphore error in db log.

Posted by cverbiest on 25-Jul-2016 07:10

Lastest crash 25 minutes after db startup following reboot of server. increase of -semsets to 6 did not help.

Captured promon semaphore output 11 seconds before crash didn't look different than one after db start

ccenv-my.sharepoint.com/.../guestaccess.aspx

Posted by cverbiest on 25-Jul-2016 07:55

Was collecting semaphore output every minute.

Crash frequency increased (2 crashes last 2.5 hours) will stop collecting this data. May be unrelated but I wont take the risk

Posted by George Potemkin on 25-Jul-2016 08:05

> Was collecting semaphore output every minute

Time interval was too long. One second would have more chances to catch the corruptions of semid's. But even one second can be too long.

What values had these semid's for last multi-user session?

Only 2 of 5 semid's were corrupted this time.

Posted by George Potemkin on 25-Jul-2016 08:13

The error was reported by the most recently started session (a batch process) since 23 seconds after its login. Carl, do you know what the session did during this time?

Posted by cverbiest on 25-Jul-2016 08:15

I'll try to find out

Posted by George Potemkin on 25-Jul-2016 08:36

Just a crazy idea but it helped one our customer in the past:

Create and start an empty database. Force all self-service clients to connect the empty db first and only then connect the LISA db. I hope your application (especially its authorization) does not rely on the order of the connected databases.

Only the self-service clients can corrupt data in shared memory. Semid's are stored at the beginning of shared memory => at the beginning of the client's address space. Shared-memory of the empty db will take the risks. Either the errors will disappear or you will get them for the empty database.

Posted by cverbiest on 25-Jul-2016 08:37

Hi George,

sorry but I'm kind of drowning in logfiles, can you give some more info on the process that you refer to ?

below is the most recent protrace file but that's probably not the process that interests you

PROGRESS stack trace as of Mon Jul 25 12:46:48 2016

Progress OpenEdge Release 11.6 build 1346 SP02 TF04 on Linux franky-db 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015

Command line arguments are

/usr/dlc/bin/_mprosrv -classpath /usr/oe116/jdk/lib/tools.jar:/usr/oe116/java/progress.jar:/usr/oe116/java/messages.jar:/usr/oe116/java/prosp.jar /data/cce/applprod/data/LISA -L 150000 -n 112 -spin 50000 -bibufs 20 -Mi 1 -Ma 1 -Mn 79 -Mpb 78 -Mm 10240 -lruskips 10 -semsets 6 -S 11000 -B 2000000 -tablerangesize 1000 -indexrangesize 3500 -minport 11001 -maxport 12999 -ServerType 4GL

Startup parameters:

-pf /usr/dlc/startup.pf,-cpinternal UTF-8,-cpstream UTF-8,-cpcoll Basic,-cpcase Basic,-d dmy,-numsep 46,-numdec 44,-Mm 10240,(end .pf),-classpath /usr/oe116/jdk/lib/tools.jar:/usr/oe116/java/progress.jar:/usr/oe116/java/messages.jar:/usr/oe116/java/prosp.jar,-db /data/cce/applprod/data/LISA,-L 150000,-n 112,-spin 50000,-bibufs 20,-Mi 1,-Ma 1,-Mn 79,-Mpb 78,-Mm 10240,-lruskips 10,-semsets 6,-S 11000,-B 2000000,-tablerangesize 1000,-indexrangesize 3500,-minport 11001,-maxport 12999,-ServerType 4GL

#1 [0x5e81bb] uttraceback+0xfb from /usr/dlc/bin/_mprosrv

#2 [0x5ead4e] uttrace+0x13e from /usr/dlc/bin/_mprosrv

#3 [0x5eaec4] utcore+0x74 from /usr/dlc/bin/_mprosrv

#4 [0x424952] drexit+0x212 from /usr/dlc/bin/_mprosrv

#5 [0x4d6045] dbExit+0x7c from /usr/dlc/bin/_mprosrv

#6 [0x4f8d27] dbKillAll+0x32a from /usr/dlc/bin/_mprosrv

#7 [0x4f91de] dbKillUsers+0x4a8 from /usr/dlc/bin/_mprosrv

#8 [0x47893b] mdend+0x13b from /usr/dlc/bin/_mprosrv

#9 [0x424911] drexit+0x1d1 from /usr/dlc/bin/_mprosrv

#10 [0x4d6045] dbExit+0x7c from /usr/dlc/bin/_mprosrv

#11 [0x581e07] semLockLog+0x200 from /usr/dlc/bin/_mprosrv

#12 [0x4f5e2c] dbDelUsrctl+0xd2 from /usr/dlc/bin/_mprosrv

#13 [0x4f7b23] brclrdead+0x44d from /usr/dlc/bin/_mprosrv

#14 [0x4fa06a] dbWatchdog+0x764 from /usr/dlc/bin/_mprosrv

#15 [0x51b214] dsmWatchdog+0xa7 from /usr/dlc/bin/_mprosrv

#16 [0x5d4355] nsaloop+0x518 from /usr/dlc/bin/_mprosrv

#17 [0x4784b3] doserve+0xbd8 from /usr/dlc/bin/_mprosrv

#18 [0x478771] main+0x103 from /usr/dlc/bin/_mprosrv

#19 [0x49eadb15] __libc_start_main+0xf5 from /lib64/libc.so.6

Posted by cverbiest on 25-Jul-2016 08:40

We have 2 databases and sometimes this issues appeared first on the second db.

I'll try to replace all db connections with client server connections to see if that helps

Posted by George Potemkin on 25-Jul-2016 08:42

> can you give some more info on the process that you refer to ?

[2016/07/25@12:54:48.591+0200] P-9540       T-140149562840896 I ABL    86: (452)   Login by proadmin on batch. 
[2016/07/25@12:54:48.608+0200] P-9540       T-140149562840896 I ABL    86: (7129)  Usr 86 set name to batchclnt. 
[2016/07/25@12:55:11.829+0200] P-9540       T-140149562840896 I ABL    86: (1132)  Invalid semaphore id 
[2016/07/25@12:55:11.829+0200] P-9540       T-140149562840896 I ABL    86: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22. 
[2016/07/25@12:55:11.830+0200] P-9540       T-140149562840896 I ABL    86: (453)   Logout by batchclnt on batch. 
[2016/07/25@12:55:11.830+0200] P-9540       T-140149562840896 I ABL        (1132)  Invalid semaphore id 
[2016/07/25@12:55:11.830+0200] P-9540       T-140149562840896 I ABL        (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22. 
[2016/07/25@12:55:11.830+0200] P-9540       T-140149562840896 I ABL        (-----) semLockLog_2: semValue val = -1
[2016/07/25@12:55:14.327+0200] P-4295       T-140025558579008 I BROKER  0: (2527)  Disconnecting dead user 86.

> Captured promon semaphore output 11 seconds before crash didn't look different than one after db start

Can you post first 5 lines of promon semaphore output?

Posted by George Potemkin on 25-Jul-2016 08:45

> We have 2 databases and sometimes this issues appeared first on the second db.

Only one session was connected to the second database at 14:29:29

Posted by cverbiest on 25-Jul-2016 09:01

I just detected our batch clients connect with rather low (old) values for -s -mmax and -D, could this cause these kind of issues ? I'm going to increase the values anyway

-d dmy

-yy 1960

-E

-h 7

-D 80

-s 63

-B 1000

-Mr 8192

-TB 31

-TM 32

-mmax 5000

-Bt 200

#-cpinternal ISO8859-1

-cpinternal utf-8

-cpstream IBM850

-rereadnolock

-p ccetools/start-b

-T /tmp

Posted by George Potemkin on 25-Jul-2016 09:02

Carl, when your database doesn't have the semaphore errors it has the tcp errors:

[2016/07/25@07:25:12.467+0200] P-37327      T-139985194260288 I SRV    16: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@07:42:00.265+0200] P-38269      T-140325921544000 I SRV    19: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@07:56:39.636+0200] P-37336      T-139744649774912 I SRV    18: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:22:32.523+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:33:28.461+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:53:58.327+0200] P-51006      T-139629681858368 I SRV    38: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:56:58.560+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:58:07.372+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:19:55.615+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:21:45.428+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:25:09.992+0200] P-56699      T-140641693988672 I SRV    43: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:47:20.925+0200] P-33969      T-140121481381696 I SRV     2: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:48:22.646+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:33:19.665+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:34:55.088+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:48:41.338+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:50:27.258+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:12:19.914+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:14:42.414+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:44:40.016+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:46:17.233+0200] P-11690      T-139931360241472 I SRV    48: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:55:47.234+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:59:46.517+0200] P-10225      T-139944178841408 I SRV    45: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@12:01:44.940+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:41:28.143+0200] P-4635       T-140640063956800 I SRV    29: (-----) TCP/IP write error occurred with errno 32

Most likely their root case is a hardware.

Posted by George Potemkin on 25-Jul-2016 09:06

> could this cause these kind of issues ?

Everything possible. The good method is to keep changing the things (of course, one change at once).

Posted by cverbiest on 25-Jul-2016 09:09

> The good method is to keep changing the things

I know but it's not easy to follow that rule right now

Thanks you for all your advice and interest

Posted by cverbiest on 25-Jul-2016 09:43

reply to the 14:13 question what the process was doing found following 4gl trace, unfortunately trace level is too low to be really useful

The second log file shows lsof output that is produced by the batch job prior to the quit statement

code fragment :

output close. /* batcherror logfile */
if opsys = "unix"
then os-command silent value(subst("sudo /usr/sbin/lsof -p &1 >> &2", WZ-PID, WZ-LOGFILE)).
run WriteMemo.

quit.

first log file

# cat /tmp/batch_9539.log
[16/07/25@12:54:48.609+0200] P-009540 T-485005120 1 4GL -- Logging level set to = 1
[16/07/25@12:55:11.829+0200] P-009540 T-485005120 1 4GL -- Invalid semaphore id (1132)
[16/07/25@12:55:11.829+0200] P-009540 T-485005120 1 4GL -- SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22. (10839)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- Invalid semaphore id (1132)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22. (10839)
[16/07/25@12:55:11.830+0200] P-009540 T-485005120 1 4GL -- semLockLog_2: semValue val = -1


# cat be9540.log print.sh -f acrobat blancke01 1 /tmp/Kat46489_9540.pdf 2>&1 Acrobat exitcode: 0 print.sh -f acrobat blancke01 1 /tmp/Kat46493_9540.pdf 2>&1 Acrobat exitcode: 0 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME _progres 9540 proadmin cwd DIR 8,2 12288 2228226 /usr2/cce/applprod _progres 9540 proadmin rtd DIR 8,1 4096 2 / _progres 9540 proadmin txt REG 8,1 12589577 400553 /usr/oe116/bin/_progres _progres 9540 proadmin mem REG 8,1 110808 394029 /usr/lib64/libresolv-2.17.so _progres 9540 proadmin mem REG 8,1 27512 394017 /usr/lib64/libnss_dns-2.17.so _progres 9540 proadmin mem REG 8,1 156770 1054635 /usr/oe116/lib/libproxml.so _progres 9540 proadmin mem REG 8,1 113328 394011 /usr/lib64/libnsl-2.17.so _progres 9540 proadmin mem REG 8,1 5102967 1054636 /usr/oe116/lib/libxerces_psc.so _progres 9540 proadmin mem REG 8,1 2546137 1054640 /usr/oe116/lib/libicui18n48_psc.so _progres 9540 proadmin mem REG 8,1 1751412 1054642 /usr/oe116/lib/libicuuc48_psc.so _progres 9540 proadmin mem REG 8,1 18187923 1054638 /usr/oe116/lib/libicudata48_psc.so _progres 9540 proadmin DEL REG 0,4 98305 /SYSV00000000 _progres 9540 proadmin mem REG 8,1 37441 1054490 /usr/oe116/lib/libenhlog.so _progres 9540 proadmin mem REG 8,1 61928 394019 /usr/lib64/libnss_files-2.17.so _progres 9540 proadmin mem REG 8,1 11376 393960 /usr/lib64/libfreebl3.so _progres 9540 proadmin mem REG 8,1 2112384 394001 /usr/lib64/libc-2.17.so _progres 9540 proadmin mem REG 8,1 88720 393235 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 _progres 9540 proadmin mem REG 8,1 995840 394050 /usr/lib64/libstdc++.so.6.0.19 _progres 9540 proadmin mem REG 8,1 1141560 394009 /usr/lib64/libm-2.17.so _progres 9540 proadmin mem REG 8,1 40816 394005 /usr/lib64/libcrypt-2.17.so _progres 9540 proadmin mem REG 8,1 44096 394031 /usr/lib64/librt-2.17.so _progres 9540 proadmin mem REG 8,1 19520 394007 /usr/lib64/libdl-2.17.so _progres 9540 proadmin mem REG 8,1 142304 394027 /usr/lib64/libpthread-2.17.so _progres 9540 proadmin mem REG 8,1 164440 402388 /usr/lib64/ld-2.17.so _progres 9540 proadmin 0r CHR 1,3 0t0 1028 /dev/null _progres 9540 proadmin 1w REG 8,6 14110 25 /tmp/batcherr32 _progres 9540 proadmin 2w REG 8,6 14110 25 /tmp/batcherr32 _progres 9540 proadmin 3r REG 8,1 1749843 400498 /usr/oe116/promsgs _progres 9540 proadmin 4r REG 8,1 944417 400495 /usr/oe116/convmap.cp _progres 9540 proadmin 5r REG 8,1 944417 400495 /usr/oe116/convmap.cp _progres 9540 proadmin 6u REG 8,6 74955 7235 /tmp/lbiN5u92T (deleted) _progres 9540 proadmin 7w REG 8,7 3082181 13528468 /data/cce/applprod/data/LISA.lg _progres 9540 proadmin 8u REG 8,7 655360 13528466 /data/cce/applprod/data/LISA.db _progres 9540 proadmin 9u REG 8,7 655360 13528466 /data/cce/applprod/data/LISA.db _progres 9540 proadmin 10u REG 8,7 26869760 13500420 /data/cce/applprod/data/LISA.b1 _progres 9540 proadmin 11u REG 8,7 15204352 13500421 /data/cce/applprod/data/LISA.d1 _progres 9540 proadmin 12u REG 8,7 15204352 13500421 /data/cce/applprod/data/LISA.d1 _progres 9540 proadmin 13u REG 8,7 524288000 13500422 /data/cce/applprod/data/LISA_10.d1 _progres 9540 proadmin 14u REG 8,7 524288000 13500423 /data/cce/applprod/data/LISA_10.d2 _progres 9540 proadmin 15u REG 8,7 524288000 13500424 /data/cce/applprod/data/LISA_10.d3 _progres 9540 proadmin 16u REG 8,7 131072 13500425 /data/cce/applprod/data/LISA_10.d4 _progres 9540 proadmin 17u REG 8,7 131072 13500425 /data/cce/applprod/data/LISA_10.d4 _progres 9540 proadmin 18u REG 8,7 262144000 13500426 /data/cce/applprod/data/LISA_11.d1 _progres 9540 proadmin 19u REG 8,7 262144000 13500427 /data/cce/applprod/data/LISA_11.d2 _progres 9540 proadmin 20u REG 8,7 262144000 13500428 /data/cce/applprod/data/LISA_11.d3 _progres 9540 proadmin 21u REG 8,7 131072 13500429 /data/cce/applprod/data/LISA_11.d4 _progres 9540 proadmin 22u REG 8,7 131072 13500429 /data/cce/applprod/data/LISA_11.d4 _progres 9540 proadmin 23u REG 8,7 131072 13500430 /data/cce/applprod/data/LISA_12.d1 _progres 9540 proadmin 24u REG 8,7 131072 13500430 /data/cce/applprod/data/LISA_12.d1 _progres 9540 proadmin 25u REG 8,7 2147352576 13500431 /data/cce/applprod/data/LISA_20.d1 _progres 9540 proadmin 26u REG 8,7 2147352576 13500432 /data/cce/applprod/data/LISA_20.d2 _progres 9540 proadmin 27u REG 8,7 2147352576 13500433 /data/cce/applprod/data/LISA_20.d3 _progres 9540 proadmin 28u REG 8,7 2147352576 13500434 /data/cce/applprod/data/LISA_20.d4 _progres 9540 proadmin 29u REG 8,7 2147352576 13500435 /data/cce/applprod/data/LISA_20.d5 _progres 9540 proadmin 30u REG 8,7 2147352576 13500436 /data/cce/applprod/data/LISA_20.d6 _progres 9540 proadmin 31u REG 8,7 2147352576 13500437 /data/cce/applprod/data/LISA_20.d7 _progres 9540 proadmin 32u REG 8,7 2147352576 13500438 /data/cce/applprod/data/LISA_20.d8 _progres 9540 proadmin 33u REG 8,7 2147352576 13500439 /data/cce/applprod/data/LISA_20.d9 _progres 9540 proadmin 34u REG 8,7 2147352576 13500440 /data/cce/applprod/data/LISA_20.d10 _progres 9540 proadmin 35u REG 8,7 2147352576 13500441 /data/cce/applprod/data/LISA_20.d11 _progres 9540 proadmin 36u REG 8,7 2147352576 13500442 /data/cce/applprod/data/LISA_20.d12 _progres 9540 proadmin 37u REG 8,7 2147352576 13500443 /data/cce/applprod/data/LISA_20.d13 _progres 9540 proadmin 38u REG 8,7 2147352576 13500444 /data/cce/applprod/data/LISA_20.d14 _progres 9540 proadmin 39u REG 8,7 2147352576 13500445 /data/cce/applprod/data/LISA_20.d15 _progres 9540 proadmin 40u REG 8,7 2147352576 13500446 /data/cce/applprod/data/LISA_20.d16 _progres 9540 proadmin 41u REG 8,7 2147352576 13500447 /data/cce/applprod/data/LISA_20.d17 _progres 9540 proadmin 42u REG 8,7 2147352576 13500448 /data/cce/applprod/data/LISA_20.d18 _progres 9540 proadmin 43u REG 8,7 2147352576 13500449 /data/cce/applprod/data/LISA_20.d19 _progres 9540 proadmin 44u REG 8,7 2147352576 13500450 /data/cce/applprod/data/LISA_20.d20 _progres 9540 proadmin 45u REG 8,7 2147352576 13500451 /data/cce/applprod/data/LISA_20.d21 _progres 9540 proadmin 46u REG 8,7 2147352576 13500452 /data/cce/applprod/data/LISA_20.d22 _progres 9540 proadmin 47u REG 8,7 524288 13500453 /data/cce/applprod/data/LISA_20.d23 _progres 9540 proadmin 48u REG 8,7 524288 13500453 /data/cce/applprod/data/LISA_20.d23 _progres 9540 proadmin 49u REG 8,7 2147352576 13500454 /data/cce/applprod/data/LISA_21.d1 _progres 9540 proadmin 50u REG 8,7 2147352576 13500455 /data/cce/applprod/data/LISA_21.d2 _progres 9540 proadmin 51u REG 8,7 2147352576 13500456 /data/cce/applprod/data/LISA_21.d3 _progres 9540 proadmin 52u REG 8,7 2147352576 13500457 /data/cce/applprod/data/LISA_21.d4 _progres 9540 proadmin 53u REG 8,7 2147352576 13500458 /data/cce/applprod/data/LISA_21.d5 _progres 9540 proadmin 54u REG 8,7 2147352576 13500459 /data/cce/applprod/data/LISA_21.d6 _progres 9540 proadmin 55u REG 8,7 2147352576 13500460 /data/cce/applprod/data/LISA_21.d7 _progres 9540 proadmin 56u REG 8,7 2147352576 13500461 /data/cce/applprod/data/LISA_21.d8 _progres 9540 proadmin 57u REG 8,7 2147352576 13500462 /data/cce/applprod/data/LISA_21.d9 _progres 9540 proadmin 58u REG 8,7 2147352576 13500463 /data/cce/applprod/data/LISA_21.d10 _progres 9540 proadmin 59u REG 8,7 2147352576 13500464 /data/cce/applprod/data/LISA_21.d11 _progres 9540 proadmin 60u REG 8,7 2147352576 13500465 /data/cce/applprod/data/LISA_21.d12 _progres 9540 proadmin 61u REG 8,7 524288 13500466 /data/cce/applprod/data/LISA_21.d13 _progres 9540 proadmin 62u REG 8,7 524288 13500466 /data/cce/applprod/data/LISA_21.d13 _progres 9540 proadmin 63u REG 8,7 1572864 13500467 /data/cce/applprod/data/LISA_22.d1 _progres 9540 proadmin 64u REG 8,7 1572864 13500467 /data/cce/applprod/data/LISA_22.d1 _progres 9540 proadmin 65u REG 8,7 131072000 13500468 /data/cce/applprod/data/LISA_30.d1 _progres 9540 proadmin 66u REG 8,7 131072000 13500469 /data/cce/applprod/data/LISA_30.d2 _progres 9540 proadmin 67u REG 8,7 131072000 13500470 /data/cce/applprod/data/LISA_30.d3 _progres 9540 proadmin 68u REG 8,7 524288000 13500471 /data/cce/applprod/data/LISA_30.d4 _progres 9540 proadmin 69u REG 8,7 524288000 13500472 /data/cce/applprod/data/LISA_30.d5 _progres 9540 proadmin 70u REG 8,7 524288000 13500473 /data/cce/applprod/data/LISA_30.d6 _progres 9540 proadmin 71u REG 8,7 524288 13500474 /data/cce/applprod/data/LISA_30.d7 _progres 9540 proadmin 72u REG 8,7 524288 13500474 /data/cce/applprod/data/LISA_30.d7 _progres 9540 proadmin 73u REG 8,7 131072000 13500475 /data/cce/applprod/data/LISA_31.d1 _progres 9540 proadmin 74u REG 8,7 131072000 13500476 /data/cce/applprod/data/LISA_31.d2 _progres 9540 proadmin 75u REG 8,7 131072000 13500477 /data/cce/applprod/data/LISA_31.d3 _progres 9540 proadmin 76u REG 8,7 131072000 13500478 /data/cce/applprod/data/LISA_31.d4 _progres 9540 proadmin 77u REG 8,7 524288 13500479 /data/cce/applprod/data/LISA_31.d5 _progres 9540 proadmin 78u REG 8,7 524288 13500479 /data/cce/applprod/data/LISA_31.d5 _progres 9540 proadmin 79u REG 8,7 524288 13500480 /data/cce/applprod/data/LISA_32.d1 _progres 9540 proadmin 80u REG 8,7 524288 13500480 /data/cce/applprod/data/LISA_32.d1 _progres 9540 proadmin 81ur REG 8,6 84 7248 /tmp/batch_9539.log _progres 9540 proadmin 82u REG 8,6 0 7250 /tmp/srtVk4EVZ (deleted) _progres 9540 proadmin 83u REG 8,6 46351608 7251 /tmp/rcdhNobO5 (deleted) _progres 9540 proadmin 84u REG 8,6 819200 7256 /tmp/DBI485005120kkz8Gb (deleted) _progres 9540 proadmin 88r REG 8,1 4949823 1054570 /usr/oe116/tty/adecomm.pl _progres 9540 proadmin 89r REG 8,1 26771 1054866 /usr/oe116/tty/OpenEdge.ServerAdmin.pl _progres 9540 proadmin 90r REG 8,1 586073 1054581 /usr/oe116/tty/adeedit.pl _progres 9540 proadmin 91r REG 8,1 20580737 1054623 /usr/oe116/tty/prodict.pl _progres 9540 proadmin 92r REG 8,1 56999 1054864 /usr/oe116/tty/OpenEdge.BusinessLogic.pl _progres 9540 proadmin 93r REG 8,1 1499066 1054865 /usr/oe116/tty/OpenEdge.Core.pl _progres 9540 proadmin 94r REG 8,1 2631412 1054577 /usr/oe116/tty/adeshar.pl _progres 9540 proadmin 95r REG 8,1 240309 1054584 /usr/oe116/tty/adecomp.pl _progres 9540 proadmin 96r REG 8,1 14002200 1054621 /usr/oe116/tty/dataadmin.pl

Posted by George Potemkin on 25-Jul-2016 12:17

Carl, can you save 'ipcs -s' output after each db startup? Footprint that the corruption is leaving in shared memory is the most painless way to find out the root case. The semaphore errors show what the processes found out in shared memory but we don't know what the same bytes stored after db startup.

> Just had a call with Progress tech support, they suspect something is leaking file handles

The size of the corrupted spot in shared memory, the byte alignment of the corrution's borders would confirm or deny the assumption.

> server changed, from real redhat 5 to HyperV hosted virtual centos 7

I suspect this is the root.

Posted by Richard Shulman on 25-Jul-2016 15:48

Hello cverbiest,

You may want to take a look at this article for tuning semaphores on Linux:

knowledgebase.progress.com/.../Database-stops-abnormally-after-showing-the-error-10839

The errors you have listed suggest an insufficient configuration for the system and the users of the system.

I agree that to find what the value should be you have to use ipcs -s to find out how many semaphores are commonly being used.

Then you can adjust the semaphores accordingly.

Thank you

Posted by George Potemkin on 25-Jul-2016 21:40

> The errors you have listed suggest an insufficient configuration for the system and the users of the system.

I saw this article but I can't understand it. "Insufficient semaphore identifiers"? For the -semsets 3 (default) and/or 6 (the recent value)? On the system runnining only 2 databases? But broker successfully creates the semaphore identifiers at database startup. What processes failed to create more identifiers? The ones that reported the semaphore errors? But the client processes use only one semaphore in the semaphore identifiers created by broker and they get it at connection time:

What are you waiting for? A reason for waiting around.
Richard Banville, June 7, 2011
pugchallenge.org/.../Waiting_AmericaPUG.pptx
Semaphore usage
Resource wait semaphore – one per connection

Posted by cverbiest on 26-Jul-2016 01:27

I'll add ipcs -s to the startup script.

Yesterday we stopped using shared-memory connections and it got us through the night without crashes

Posted by George Potemkin on 26-Jul-2016 02:36

Did the tcp errors disappear as well?

Posted by cverbiest on 26-Jul-2016 02:47

I haven't seen any TCP/IP errors in the db log, I do see a lot of dropped packets

ifconfig

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet x.x.x.x netmask 255.255.0.0 broadcast 10.0.255.255

inet6 yyyyyyyyyyyyyyyyyy prefixlen 64 scopeid 0x20<link>

ether 00:15:5d:00:01:12 txqueuelen 1000 (Ethernet)

RX packets 72734765 bytes 25463358707 (23.7 GiB)

RX errors 0 dropped 1023562 overruns 0 frame 0

TX packets 61839003 bytes 83457857300 (77.7 GiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

inet 127.0.0.1 netmask 255.0.0.0

inet6 ::1 prefixlen 128 scopeid 0x10<host>

loop txqueuelen 0 (Local Loopback)

RX packets 483858734 bytes 164988488628 (153.6 GiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 483858734 bytes 164988488628 (153.6 GiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Posted by George Potemkin on 26-Jul-2016 03:00

TCP/IP write error occurred with errno 32

>I haven't seen any TCP/IP errors in the db log,

Do you mean that you don't see them in the current log?

Yesterday these errors were reported in log once per 5-10 minutes through the whole day.

Posted by cverbiest on 26-Jul-2016 03:29

I missed a few during my last search, there were none between 17:00 and 06:50, there are a few errors since that time but the db stays up & running.

[2016/07/25@07:25:12.467+0200] P-37327      T-139985194260288 I SRV    16: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@07:42:00.265+0200] P-38269      T-140325921544000 I SRV    19: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@07:56:39.636+0200] P-37336      T-139744649774912 I SRV    18: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:22:32.523+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:33:28.461+0200] P-46233      T-139696198965056 I SRV    34: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:53:58.327+0200] P-51006      T-139629681858368 I SRV    38: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:56:58.560+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@08:58:07.372+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:19:55.615+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:21:45.428+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:25:09.992+0200] P-56699      T-140641693988672 I SRV    43: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:47:20.925+0200] P-33969      T-140121481381696 I SRV     2: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@09:48:22.646+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:33:19.665+0200] P-40014      T-140399388444480 I SRV    21: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:34:55.088+0200] P-53007      T-139680708523840 I SRV    41: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:48:41.338+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@10:50:27.258+0200] P-52605      T-140382806456128 I SRV    40: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:12:19.914+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:14:42.414+0200] P-10713      T-140585542993728 I SRV    46: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:44:40.016+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:46:17.233+0200] P-11690      T-139931360241472 I SRV    48: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:55:47.234+0200] P-33974      T-139697772349248 I SRV     4: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@11:59:46.517+0200] P-10225      T-139944178841408 I SRV    45: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@12:01:44.940+0200] P-51424      T-139858028005184 I SRV    39: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:41:28.143+0200] P-4635       T-140640063956800 I SRV    29: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:50:06.256+0200] P-3171       T-140347927385920 I SRV    12: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:56:17.786+0200] P-3171       T-140347927385920 I SRV    12: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@13:59:28.007+0200] P-4133       T-139904439777088 I SRV    25: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@14:06:39.272+0200] P-4133       T-139904439777088 I SRV    25: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@14:16:58.821+0200] P-5630       T-140431553333056 I SRV    37: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@14:24:17.698+0200] P-5099       T-139919449716544 I SRV    33: (-----) TCP/IP write error occurred with errno 32
[2016/07/25@16:58:01.036+0200] P-19422      T-140208870037312 I SRV    17: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@06:53:27.189+0200] P-23437      T-140599803811648 I SRV    29: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@07:39:35.723+0200] P-27754      T-140670744590144 I SRV    36: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@08:49:59.790+0200] P-60434      T-139664345315136 I SRV    48: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@09:08:27.365+0200] P-20130      T-139837751461696 I SRV    20: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@09:14:49.987+0200] P-20130      T-139837751461696 I SRV    20: (-----) TCP/IP write error occurred with errno 32
[2016/07/26@09:46:00.119+0200] P-57586      T-140041057462080 I SRV    46: (-----) TCP/IP write error occurred with errno 32

Posted by George Potemkin on 26-Jul-2016 04:25

If I understand correctly only the batch processes (20% of all sessions) were connected db locally. Today all of them are using a remote connection.

TCP/IP write error results in 794: Usernum <n> terminated abnormally. Is there anything common for all sessions that died with tcp error? The same username? The same remote host? I guess the batch processes are now connected using localhost. Will they get the 794's today?

I don't have a "theory" but I suspect that tcp errors and semaphore errors have the same source.

Posted by Ruanne Cluer on 26-Jul-2016 04:55

Just to be clear, not only were connections changed to client/server (-Ma 1)

But that this Linux guest is now on it's OWN host (as opposed to shared with other guests)

Where the Hyper-V node is now running only the Centos Linux.

All other VM's have been moved to the second node."

I believe that this is the significant part of the change that has somehow stabilised this awful semaphore down situation.

Posted by cverbiest on 01-Aug-2016 03:24

Found the trigger for the issue and a solution.

The database is started using a script that basically does

sudo -u proadmin proserve LISA
sudo -u proadmin probiw LISA
sudo -u proadmin proapw LISA
sudo -u proadmin proapw LISA

if someone logs on to the system with user proadmin using ssh, and executes

proshut LISA -C list

the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.

On our test server the issue went away with a yum update

Posted by George Potemkin on 01-Aug-2016 03:38

That is why the log files of the incidents should not be trimmed. ;-)

This thread is closed