In November last year we migrated
Everything went according to plans and we got a huge performance boost:
A huge succes, everything was good …
… until runtime clients on Windows 7 Enterprise system randomly started to get SSL connections error when trying to connect to the database. So far so bad, but the situation is getting worse over time rendering the database unavailable for connections due to zombie remote server processes for as long as 17 plus minutes.
This is the error the clients are getting:
SSL error 12072 - SSL Client handshake failure (336130315) SSL routines occurred. (12168)
Error starting SSL handshake with the OpenEdge database server. (12167)
This is what I can see in the database log file:
[2020/02/17@09:31:33.820+0100] P-1045937 T-140231528724288 I SRV 13: (12151) SSL error 12067 - SSL accept failed occurred.
[2020/02/17@09:31:33.820+0100] P-1045937 T-140231528724288 I SRV 13: (12154) Error while attempting to create the SSL Client instance.
[2020/02/17@09:49:03.940+0100] P-1045937 T-140231528724288 I SRV 13: (1334) Rejecting login -- too many users for this server.
[2020/02/17@09:49:03.940+0100] P-1045937 T-140231528724288 I SRV 13: (-----) User count inconsistency detected: usrcnt=5 users=15
[2020/02/17@09:49:03.940+0100] P-1045937 T-140231528724288 I SRV 13: (-----) User count corrected: usrcnt=15 users=15
Relevant database startup parameters - (250 concurrent remote clients max.):
Therefore I opened a TechSupport case. I turns out the issue which causes a remote server to become a zombie as soon as it encounters a 12151 error is a product defect (OCTA-19107). The TechSupport engineer was able to reproduce it on 11.7.5 and so far a fix for that is expected to make it into 11.7.6 which is scheduled sometime in Q2/2020.
What we’ve found out so far is that the 12151 error on the database remote server renders it to a zombie while the database broker still forwards connection requests to that zombie remote server which all fail for minutes until the same server adjusts the user count to its max when the clients get the error that the server has no more resources.
Still, I have not found out what causes the initial SSL connection error to a given remote server so that it encounters the 12151 error which makes it zombie server and what causes the database broker to adjust the user count to its max some time later.
Nevertheless - until I found out the root cause and OpenEdge 11.7.6 is released - I need a strategy to mitigate the problem in some way and this is what I’ve come up with:
Any thoughts are welcome!
Thanks in Advance.
you have minp[ort 8400 and maxport 8499 which allows only for 99 network clients if all those ports are free (which they may not be).
Thanks for you reply. Yes, those ports - 8400 to 8499 are free on the system and should be enough to be able to start 80 remote servers.
If the fix is expected to be in 11.7.6, which is due soon, I'd ask TS whether a hotfix can be provided for an earlier 11.7 SP. Then you can have the fix rather than try to find a workaround, without having to wait for 11.7.6.