Hi,
were running low on ideas. Since jul 13 our databases are crashing with semaphore errors.
History
things we tried
I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone
[2016/07/25@06:23:11.613+0200] P-32381 T-140189001459520 I ABL 86: (453) Logout by batchclnt on batch.[2016/07/25@06:25:04.592+0200] P-28093 T-140142107518784 I SHUT 88: (453) Logout by proadmin on /dev/pts/0. Mon Jul 25 04:25:04 2016[2016/07/25@04:25:04.000+0000] P-28093 T-140142107518784 I SHUT (562) HANGUP signal received. Mon Jul 25 06:28:41 2016[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.616+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (2520) Stopped.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (2520) Stopped.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 2, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (453) Logout by proadmin on batch.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (1132) Invalid semaphore id[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.[2016/07/25@06:28:41.623+0200] P-31604 T-140273689978688 I SRV 5: (-----) semLockLog_2: semValue val = -1[2016/07/25@06:28:48.303+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9[2016/07/25@06:28:48.303+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.[2016/07/25@06:29:28.457+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9[2016/07/25@06:29:28.457+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.[2016/07/25@06:29:42.516+0200] P-27133 T-140230977447744 I BROKER 0: (-----) enssbrocon: child exited with 2, err=9[2016/07/25@06:29:42.516+0200] P-27133 T-140230977447744 I BROKER 0: (8837) 4GL server process terminated.
attacheg log file errors
Found the trigger for the issue and a solution.
The database is started using a script that basically does
sudo -u proadmin proserve LISA sudo -u proadmin probiw LISA sudo -u proadmin proapw LISA sudo -u proadmin proapw LISA
if someone logs on to the system with user proadmin using ssh, and executes
proshut LISA -C list
the semaphore errors occur when that user logs off. As long as the proadmin user does not log off the db stays up & running.
On our test server the issue went away with a yum update
posted trimmed logs here
https://ccenv-my.sharepoint.com/personal/cvb_cce_be/_layouts/15/guestaccess.aspx?guestaccesstoken=FEt3KSY0aDGyDDObljDV9pIax531FM2sJq3OwCJLByQ%3d&docid=078d8fc568fd04005a1f4097a3d6f7049&rev=1
> I don't know if it's related but I noticed in the latest crash that the time on one of the entries is in the wrong timezone
It's just a feature:
knowledgebase.progress.com/.../Database-stalls-on-Linux-because-self-service-OpenEdge-client-deadlocks-if-it-receives-hangup-signal-at-the-wrong-time
A partial fix is included in OpenEdge 11.2.1 and 11.3.x
As a side effect of the fix, messages from signal handlers will always show as UTC time instead of the database's timezone.
> (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 0, errno = 22.
Errno 22 is an invalid argument. In this case the semid is invalid. I guess the other small values of semid are wrong as well.
Carl, can you create the list of semid's allocated by broker at startup?
promon/R&D/debghb/16. Semaphores (but first 5. Adjust Monitor Options / 1. Enter page length: 9999
[2016/07/13@22:01:49.085+0200] P-60282 T-140225063741248 I SRV 5: (794) Usernum 187 terminated abnormally. [2016/07/13@22:01:49.085+0200] P-60282 T-140225063741248 I SRV 5: (-----) TCP/IP write error occurred with errno 32 [2016/07/13@22:01:49.092+0200] P-60282 T-140225063741248 I SRV 5: (1132) Invalid semaphore id [2016/07/13@22:01:49.092+0200] P-60282 T-140225063741248 I SRV 5: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22.
I saw these errors (TCP/IP write error occurred with errno 32) when all remote clients servers were napped trying to get MTX latch.Do you monitor the latch activity? You can do that using ProTop or my dbmon script.
The list of the errors from your log with the unique semid:
(10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = -32763, errno = 22. (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32762, errno = 22. (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22. (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 6, errno = 22. (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22.
Note that -32763, -32762, -32762 = 32768 minus 5, 6 and 7
(6526) Number of Semaphore Sets (-semsets): 3
Your database used 3 semsets and all their semid's became truncated to one byte.
Hi George,
thanks for your reply. Not yet monitoring latch activity, will look that up
# ipcs -s
------ Semaphore Arrays --------
key semid owner perms nsems
0x00000000 294912 proadmin 666 7
0x00000000 327681 proadmin 666 97
0x00000000 360450 proadmin 666 97
0x00000000 393219 proadmin 666 7
0x00000000 425988 proadmin 666 97
0x00000000 458757 proadmin 666 97
for the LISA database
07/25/16 Semaphores
09:07:11
User Id Num
0 327681 0
1 360450 0
2 327681 1
3 360450 1
4 327681 2
5 360450 2
6 327681 3
7 360450 3
8 327681 4
9 360450 4
10 327681 5
11 360450 5
12 327681 6
13 360450 6
14 327681 7
15 360450 7
16 327681 8
17 360450 8
18 327681 9
19 360450 9
20 327681 10
21 360450 10
22 327681 11
23 360450 11
24 327681 12
25 360450 12
26 327681 13
27 360450 13
28 327681 14
29 360450 14
30 327681 15
31 360450 15
32 327681 16
33 360450 16
34 327681 17
35 360450 17
36 327681 18
37 360450 18
38 327681 19
39 360450 19
40 327681 20
41 360450 20
42 327681 21
43 360450 21
44 327681 22
45 360450 22
46 327681 23
47 360450 23
48 327681 24
49 360450 24
50 327681 25
51 360450 25
52 327681 26
53 360450 26
54 327681 27
55 360450 27
56 327681 28
57 360450 28
58 327681 29
59 360450 29
60 327681 30
61 360450 30
62 327681 31
63 360450 31
64 327681 32
65 360450 32
66 327681 33
67 360450 33
68 327681 34
69 360450 34
70 327681 35
71 360450 35
72 327681 36
73 360450 36
74 327681 37
75 360450 37
76 327681 38
77 360450 38
78 327681 39
79 360450 39
80 327681 40
81 360450 40
82 327681 41
83 360450 41
84 327681 42
85 360450 42
86 327681 43
87 360450 43
88 327681 44
89 360450 44
90 327681 45
91 360450 45
92 327681 46
93 360450 46
94 327681 47
95 360450 47
96 327681 48
97 360450 48
98 327681 49
99 360450 49
100 327681 50
101 360450 50
102 327681 51
103 360450 51
104 327681 52
105 360450 52
106 327681 53
107 360450 53
108 327681 54
109 360450 54
110 327681 55
111 360450 55
112 327681 56
113 360450 56
114 327681 57
115 360450 57
116 327681 58
117 360450 58
118 327681 59
119 360450 59
120 327681 60
121 360450 60
122 327681 61
123 360450 61
124 327681 62
125 360450 62
126 327681 63
127 360450 63
128 327681 64
129 360450 64
130 327681 65
131 360450 65
132 327681 66
133 360450 66
134 327681 67
135 360450 67
136 327681 68
137 360450 68
138 327681 69
139 360450 69
140 327681 70
141 360450 70
142 327681 71
143 360450 71
144 327681 72
145 360450 72
146 327681 73
147 360450 73
148 327681 74
149 360450 74
150 327681 75
151 360450 75
152 327681 76
153 360450 76
154 327681 77
155 360450 77
156 327681 78
157 360450 78
158 327681 79
159 360450 79
160 327681 80
161 360450 80
162 327681 81
163 360450 81
164 327681 82
165 360450 82
166 327681 83
167 360450 83
168 327681 84
169 360450 84
170 327681 85
171 360450 85
172 327681 86
173 360450 86
174 327681 87
175 360450 87
176 327681 88
177 360450 88
178 327681 89
179 360450 89
180 327681 90
181 360450 90
182 327681 91
183 360450 91
184 327681 92
185 360450 92
186 327681 93
187 360450 93
188 327681 94
189 360450 94
190 327681 95
191 360450 95
192 327681 96
for the lisalrap database (we also had the crashes when the second db was not running)
09:10:34
0 425988 0
1 458757 0
2 425988 1
3 458757 1
4 425988 2
5 458757 2
6 425988 3
7 458757 3
8 425988 4
9 458757 4
10 425988 5
11 458757 5
12 425988 6
13 458757 6
14 425988 7
15 458757 7
16 425988 8
17 458757 8
18 425988 9
19 458757 9
20 425988 10
21 458757 10
22 425988 11
23 458757 11
24 425988 12
25 458757 12
26 425988 13
27 458757 13
28 425988 14
29 458757 14
30 425988 15
31 458757 15
32 425988 16
33 458757 16
34 425988 17
35 458757 17
36 425988 18
37 458757 18
38 425988 19
39 458757 19
40 425988 20
41 458757 20
42 425988 21
43 458757 21
44 425988 22
45 458757 22
46 425988 23
47 458757 23
48 425988 24
49 458757 24
50 425988 25
51 458757 25
52 425988 26
53 458757 26
54 425988 27
55 458757 27
56 425988 28
57 458757 28
58 425988 29
59 458757 29
60 425988 30
61 458757 30
62 425988 31
63 458757 31
64 425988 32
65 458757 32
66 425988 33
67 458757 33
68 425988 34
69 458757 34
70 425988 35
71 458757 35
72 425988 36
73 458757 36
74 425988 37
75 458757 37
76 425988 38
77 458757 38
78 425988 39
79 458757 39
80 425988 40
81 458757 40
82 425988 41
83 458757 41
84 425988 42
85 458757 42
86 425988 43
87 458757 43
88 425988 44
89 458757 44
90 425988 45
91 458757 45
92 425988 46
93 458757 46
94 425988 47
95 458757 47
96 425988 48
97 458757 48
98 425988 49
99 458757 49
100 425988 50
101 458757 50
102 425988 51
103 458757 51
104 425988 52
105 458757 52
106 425988 53
107 458757 53
108 425988 54
109 458757 54
110 425988 55
111 458757 55
112 425988 56
113 458757 56
114 425988 57
115 458757 57
116 425988 58
117 458757 58
118 425988 59
119 458757 59
120 425988 60
121 458757 60
122 425988 61
123 458757 61
124 425988 62
125 458757 62
126 425988 63
127 458757 63
128 425988 64
129 458757 64
130 425988 65
131 458757 65
132 425988 66
133 458757 66
134 425988 67
135 458757 67
136 425988 68
137 458757 68
138 425988 69
139 458757 69
140 425988 70
141 458757 70
142 425988 71
143 458757 71
144 425988 72
145 458757 72
146 425988 73
147 458757 73
148 425988 74
149 458757 74
150 425988 75
151 458757 75
152 425988 76
153 458757 76
154 425988 77
155 458757 77
156 425988 78
157 458757 78
158 425988 79
159 458757 79
160 425988 80
161 458757 80
162 425988 81
163 458757 81
164 425988 82
165 458757 82
166 425988 83
167 458757 83
168 425988 84
169 458757 84
170 425988 85
171 458757 85
172 425988 86
173 458757 86
174 425988 87
175 458757 87
176 425988 88
177 458757 88
178 425988 89
179 458757 89
180 425988 90
181 458757 90
182 425988 91
183 458757 91
184 425988 92
185 458757 92
186 425988 93
187 458757 93
188 425988 94
189 458757 94
190 425988 95
191 458757 95
192 425988 96
294912 = 0x48000 327681 = 0x50001 360450 = 0x58002 393219 = 0x60003 425988 = 0x68004 458757 = 0x70005
I guess Remote Client Servers get the semid's from db shared memory. When one of servers corrupts this information in shared memory then the other ones begin to report the 10839's. "Mad" server "updates" semid's for its clients. If it served more than one client it may corrupt more than one semid. Can you use the -Ma 1 as a workaround?
BTW, the first semset (the one with 7 semaphores) is used only during db startup and during logins/logouts of self-serviced clients. I guess you will not see the last byte of this semid in the 10839's
I ran dbmon.sh, output uploaded here
https://ccenv-my.sharepoint.com/personal/cvb_cce_be/_layouts/15/guestaccess.aspx?guestaccesstoken=DfptMKQvgq3MsCs2MEV32d6KU6hydxAMvMIwKVVWKOY%3d&docid=0b802bea945d04006b17f18a20cf1f814&rev=1
I found a (seemingly) similar issue reported by SteveSaltDanwood, I asked what the solution was and this is his reply
Carl,
We had been in discussions with Tech Support regarding this. The question refers to KB 61278, and our sys Admin has increased both SEMMSL, SEMMNS and SEMOPM slightly. The one thing that I did, was to add -semsets 6 to the Primary Login Broker pf file, and our issue disappeared - not sure whether this actually fixed it, but since the change on 10th Feb we have had no further instances.
Regards,
Steve
Carl, do you always pre-start 60 remote client servers immidiately after db startup?
[2016/07/13@23:48:13.123+0200] P-3986 T-140605298083648 I APW 82: (2518) Started. [2016/07/13@23:48:27.788+0200] P-4350 T-140402236524352 I SRV 1: (5646) Started on port 11002 using TCP IPV4 address 0.0.0.0, pid 4350. [2016/07/13@23:49:14.562+0200] P-4843 T-140184589739840 I SRV 2: (5646) Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 4843. [2016/07/14@00:26:16.996+0200] P-17010 T-140595944453952 I ABL 87: (1132) Invalid semaphore id [2016/07/14@00:26:16.996+0200] P-17010 T-140595944453952 I ABL 87: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = 7, errno = 22. [2016/07/14@00:26:16.996+0200] P-17010 T-140595944453952 I ABL 87: (2252) Begin transaction backout. [2016/07/14@00:26:16.996+0200] P-17010 T-140595944453952 I ABL 87: (2253) Transaction backout completed. [2016/07/14@00:26:17.002+0200] P-17010 T-140595944453952 I ABL (1132) Invalid semaphore id [2016/07/14@00:26:17.002+0200] P-17010 T-140595944453952 I ABL (1132) Invalid semaphore id [2016/07/14@00:26:17.002+0200] P-17010 T-140595944453952 I ABL (10839) SYSTEM ERROR: Unable to get value of semaphore set : semid = 5, errno = 22. [2016/07/14@00:26:17.002+0200] P-17010 T-140595944453952 I ABL (-----) semLockLog_2: semValue val = -1 [2016/07/14@00:26:19.903+0200] P-3966 T-140549975136064 I BROKER 0: (2527) Disconnecting dead user 87. [2016/07/14@00:55:14.379+0200] P-2288 T-139877706889024 I APW 82: (2518) Started. [2016/07/14@00:57:04.277+0200] P-3010 T-140641843181376 I SRV 2: (5646) Started on port 11003 using TCP IPV4 address 0.0.0.0, pid 3010. ... [2016/07/14@01:02:05.506+0200] P-4201 T-140565902128960 I SRV 5: (5646) Started on port 11006 using TCP IPV4 address 0.0.0.0, pid 4201. [2016/07/14@01:25:33.654+0200] P-2626 T-139742344922944 I APPSRV 91: (1132) Invalid semaphore id [2016/07/14@01:25:33.654+0200] P-2626 T-139742344922944 I APPSRV 91: (10839) SYSTEM ERROR: Unable to set semaphore set : semid = -32761, errno = 22.
It looks like sometimes only a few servers were started successfully and the semaphore errors were issued by the self-service clients:
Latch activity collected by dbmon is normal. And I guess db activity was next to zero right after db startup in the midnight. I suspect that starting many RCS at once can be a root case of the errors.Multiple semsets will decrease the competion for semset's during their launches.
No servers are prestarted, would that be better ?
Just had a call with Progress tech support, they suspect something is leaking file handles.
> No servers are prestarted
Db log shows that /sometimes/ a lot of servers were started but there were no remote users connected.
> Just had a call with Progress tech support, they suspect something is leaking file handles.
I'd run promon in background to check every second a few first lines on 16. "Semaphores" screen. If they are changed then the information in shared memory has got corrupted.
Ø I'd run promon in background to ..
When you will be able to catch the changes of the shmid's then the next step would be to generate the protrace files for all Progress sessions. And maybe it's not too late to do the same immediately after a first semaphore error in db log.
Lastest crash 25 minutes after db startup following reboot of server. increase of -semsets to 6 did not help.
Captured promon semaphore output 11 seconds before crash didn't look different than one after db start
ccenv-my.sharepoint.com/.../guestaccess.aspx
Was collecting semaphore output every minute.
Crash frequency increased (2 crashes last 2.5 hours) will stop collecting this data. May be unrelated but I wont take the risk
> Was collecting semaphore output every minute
Time interval was too long. One second would have more chances to catch the corruptions of semid's. But even one second can be too long.
What values had these semid's for last multi-user session?
Only 2 of 5 semid's were corrupted this time.