PASOE - analyse and recover from high CPU problem ... runawa

Posted by mroberts@rev.com.au on 26-Mar-2018 17:26

Morning All,

PASOE 11.7.2 on Linux 64 bit environment

I have 3 PASOE instances that have migrated recently to 11.7.2 from 10.2B08, and are using PAS instead of classic appserver.  These are the first PASOE instances we are using in production.

Each one is now running high CPU, and I have applied my classic appserver methodology to hunt the offending code.

I have found on each of the PASOE instances, and agent session that has been running for over a day, with a status of Executing, and a line number of 0 on the CallStack  ... sample below

"ABLStacks": [
{
"AgentSessionId": 4,
"Callstack": [
{
"Line": 0,
"Routine": "sys/da_getui01.p",
"Source": "/tunz/dealer/bl/sys/da_getui01.r"
}
],
"Status": "Executing"
}
]

For each PASOE instance, the program name differs, but the line number remains consistent with 0

So, something has gone bad, and the program is spinning, using LOTS of CPU.

The silver lining on this is that none of the end users have noticed, the PAS is still serving requests with no apparent loss of service to the users ... I like this!  It's an improvement from 10.2B

But now, I am trying the kill off the day old threads, without resorting to an outage for the users.

In the "classic" days, I would have done a kill -SIGUSR1 on the PID, got the stack trace for follow up, disconnected the PID from the databases, and then used ever increasing severity of kill commands until the process was gone.

So I have tried the following on one of the PAS servers, using oemanager REST APIs

1) trim IDLE Sessions ... nothing happened (DELETE http://server/oemamanger/applications/appid/agentid/sessions)

2) Terminate the session gracefully (DELETE http://server/oemamanger/applications/appid/sessions?sessionID=sessid&terminatept=0) ... did not work

3) Terminate session Forcefully (DELETE http://server/oemamanger/applications/appid/sessions?sessionID=sessid&terminatept=1)  ... session went away, but thread running program remained

2) Trim Agent - (DELETE http://server/oemamanger/applications/appid/agents/agentid) ... the agent has been stuck in STOPPING for 12 hours, and the CPU is still impressive

Now I'm stuck, with high CPU, and agent stuck in stopping, and I'm afraid if I try anything else I might do more damage than good.  I don't want to wield the kill wand around and take the database out as well.

So, I'm going to arrange an outage for these 3 customers to stop everything cleanly and start stuff back up, but was wanting advice on the correct way to kill thing off in this instance without resorting to an outage.

Thing I have noticed are

1) There does not seem to be an API to kill or stop a thread if required.  The only stop is for a session, and in this instance, stopping it did not stop the underlying thread

2) There is no API to start a new agent.  Before stopping an existing agent, I would really like to start a second one to have it handling calls before I stop the first one.  Once you stop an agent, there is a time gap in starting a replacement that could be filled, if I could preemptively start the replacement.  In my case above, the PAS instance started a secondary agent itself at some stage (no input from me), that meant I could stop the first one and be sure that the second one was handling requests

3) Documentation points towards jConsole and JMX as an alternative to oemanager, but the stuff I have read does not show any detail of what it offers, and my testing on a test server, shows it exposes overall memory and CPU usage, but does not do it down to thread level.  Also, the mBeans is exposes seem to offer the same controls that are available to oemanager.  I'm willing to learn more about this option, but documentation seems to be sparse.

So, I'm wondering if anyone out there has hit this hurdle yet, and what they might have done to clear away the offending threads without resorting to an outage for the users.  It's thing thing about classic appserver I liked the best, a lot of things could go bad behind the scenes, and as long as the broker was still running, combinations of trim agents, start agents and os level killing could resolve them with end users being none the wiser.  I need to do this is PASOE now,   and have run out of ideas!

Thanks

Mark

Posted by Roy Ellis on 29-Mar-2018 06:24

Hi Mark,

there may be something you can do in 11.7.2

In 11.7.2 there is a new setting minAgents= (by default set to 0)

1) set this value to 2:  minAgents=2

2) start 2 agents: numInitialAgents=2

So now you have 2 agents that start up with PASOE and minimum agents set to 2.  

You can now send an agent stop via the REST API (something like: curl -X DELETE //host_name:port/oemanager/applications/App_name/agents/agentID) to the agent with the offending ABL sessions.

This will cause the agent to stop taking requests and after 10 seconds it will shut down that agent (this should give most requests time to finish).  

There is still the extra, second agent which can start handling customer requests.

Also the minAgents will start a new agent giving you a spare for the next time you have to stop an agent.

Something you can try before you upgrade to 11.7.3 where you can manually start the extra agent.

Regards, Roy

All Replies

Posted by mollyfed on 26-Mar-2018 17:58

Not that this answers a lot of the questions you asked, but did you see this knowledge base that explains the "line 0" means:

knowledgebase.progress.com/.../000043822

Also, hardly likely to be related here, if you are using REST service on the PASOE, you may want to get the hot fix 04 for 11.7.2 as it has to do with REST user sessions getting mixed up (according to the hotfix readme).

Posted by mroberts@rev.com.au on 26-Mar-2018 18:30

Thanks for that Molly,

I've always wondered about the 0 ... its interesting that the 2 programs that have popped up with that typically don't receive much in incoming parameters, but if someone has screwed up the query then the output parameter may be huge.  I wondering of line 0 encompasses a ridiculously large output parameter as well?

I'll have a look at the hotfix 4 and see if it applies to us, I've got a test environment I can try it on.

Mark

Posted by Roy Ellis on 27-Mar-2018 07:52

Hi Mark,

there is a new'ish REST API to terminate a session at the agent level.

curl -v -X DELETE -u tomcat:tomcat http(s)://host:port/oemanager/applications/agents/agentId/sessions/sessionId

Of course, this is still limited by what the session is doing. If it is in an infinite loop, it may not stop.

FYI: We've been working with customers struggling to solve these types of issues and have added new features in 11.7.3 (out soon) to help

1) the ability to manually start an agent through the REST API (also JMX and JConsole)

2) the ability to stop an agent and specify how long you want it to wait for requests to complete before shutting down

3) and also when you don't want to wait any longer and force it to shut down

Let me know if terminate an agent session works for you.

Regards,

Roy

Posted by mroberts@rev.com.au on 27-Mar-2018 21:26

Thanks Roy,

I'm looking forward to the new bits in the API ... especially the start agent.  

Due to CPU issues, I had to shutdown the offending environments last night.

I ended up having to kill the _mrpoapsv processes will a kill -8 after disconnecting them from the DBs.

They really did not want to die.

When I get the opportunity, I will try to kill the agent session and let you know how it goes, I'm sure I'll get another chance next week :)

Thanks

Mark.,

Posted by mroberts@rev.com.au on 28-Mar-2018 16:40

Hi Roy,

I got another chance earlier than I thought I would

I tried to terminate the agent session with the URL supplied, and it did not work.

The API returned TRUE, but the sessions are still ACTIVE.

I'm holding out for 11.7.3 now.

Mark

Posted by Roy Ellis on 29-Mar-2018 06:24

Hi Mark,

there may be something you can do in 11.7.2

In 11.7.2 there is a new setting minAgents= (by default set to 0)

1) set this value to 2:  minAgents=2

2) start 2 agents: numInitialAgents=2

So now you have 2 agents that start up with PASOE and minimum agents set to 2.  

You can now send an agent stop via the REST API (something like: curl -X DELETE //host_name:port/oemanager/applications/App_name/agents/agentID) to the agent with the offending ABL sessions.

This will cause the agent to stop taking requests and after 10 seconds it will shut down that agent (this should give most requests time to finish).  

There is still the extra, second agent which can start handling customer requests.

Also the minAgents will start a new agent giving you a spare for the next time you have to stop an agent.

Something you can try before you upgrade to 11.7.3 where you can manually start the extra agent.

Regards, Roy

Posted by mroberts@rev.com.au on 04-Apr-2018 20:03

This is a great workaround thanks Roy.

I still need to solve why the processes are running away, and work to kill them properly (11.7.3 will hopefully get us some of the way)

But for now, having multiple agents serving the requests and having a "spare" at any one time takes the stress out of the equation

For the agent stop API ... I have found 2 attempts to stop seem to partly work.  The first all has the agent in STOPPING (indefinitely) ... a second call removes the agent from the REST API visibility, and the replacement agent starts up, and things get back to normal.  The agent that I stopped twice still exits on the server, but at that stage I can kill it with a kill -8.

It's a workaround, but it is going OK for now.

Thanks Again

Mark.

This thread is closed