Information

Title	Troubleshooting why PROSHUT does not remove the userid from the PROMON User Control Table

URL Name	P69391

Article Number	000141997

Environment	Product: Progress Version: 9.x Product: OpenEdge Version: All Supported Versions OS: All Supported Platforms

Question/Problem Description

Troubleshooting why a client session hangs after being disconnected with PROMON or PROSHUT
Troubleshooting why after being disconnected, the process's UserID is not removed from the shared-memory
Troubleshooting why a user is still showing in the PROMON User Control Table after being disconnected

User is not shown in PROSHUT User List: proshut <dbname> -C list
User is still seen in the PROMON User Control list

PROSHUT -C DISCONNECT was used to disconnect an active client session.
Creating, Updating or Deleting a record causes the client to hang and cannot be disconnected from the database.
Client process disconnects from the database but is not removed from the database shared-memory.
The client process ( _progres, prowin, _proapsv) process is hung and cannot be terminated
Client process is still running in the UNIX process list (ps -ef) or Windows tasklist
The Client process PID is still found in the UNIX process list (ps -ef) or Windows Task manager (tasklist)
The parent PID of the client process may be 1 or Init

Steps to Reproduce

Clarifying Information

Error Message

Defect Number	Enahncement PSC00142267 / OE00092240

Enhancement Number

Cause

There are many reasons why a process may not disconnect properly and for this reason the best approach is to gather as much information as possible (specifically multiple a process stacktraces) prior to killing the process at the operating system level (with caution). It is virtually impossible to determine the cause of the hung process without this information.

Additional signal logging was added Progress 9.1D09, 9.1E, OpenEdge 10.1A to assist development with determining the state a hung process is in. This logging assists in finding what signals are being sent to a process and how the process is reacts to those signals. For example:

When a remote client is disconnected, the Lines 2-4 are additional debug messages
When a local client is disconnected, the Lines 2 and 4 are additional debug messages

$   proshut dbname -C disconnect 24

1. SHUT    6: (-----) User 24 disconnect initiated
2. BROKER  0: (-----) Notifying srvr 1, conn 1, to terminate remote user 24
3. BROKER  0: (-----) Sending signal <n> to pid <pid> for user 24
4. SRV     1: (1166)  Server disconnecting user 24.
5. SRV     1: (739)   Logout usernum 24, userid <>, on <>

Resolution

Troubleshooting why after being disconnected, the UserID is not removed from the shared-memory:

When a hung process does not disconnect, gather the following information prior to killing the process (with caution) then log a support call with Progress Technical Support.

Gather PROMON latch information:

The UserID that was initially used to disconnect the client with PROSHUT or PROMON is important.
If that userid is shown in the "Owner" column for any of the latches under the "Latch Counts" output, then killing that user will cause the database to shut down (by design). Choose "U", to update the latch screen again to confirm whether it's holding the latch for longer than a millisecond (you may have just caught a user holding a latch at that millisecond) then re-confirm the latch has not be acquired again. Unless the promon Display page length has been adjusted, there are two pages of latches to check, press <Enter> to view the second page.

$ promon -NL <dbname> | tee promon.txt"

1.  User Control -> 1.  Display all entries -> q
4. Record Locking Table -> 1. Display all entries -> q
-> R&D -> 1. Status Displays -> 4. Processes/Clients -> 2. Blocked Clients -> p (previous menu)
[-> R&D -> 1. Status Displays -> 4. Processes/Clients ] -> 3. Active Transactions -> t (top menu)
[-> R&D -> ] 1. Status Displays -> 6. Lock Table -> 1. Display all lock entries -> t (top menu)
[-> R&D -> ] debghb -> 6 (hidden option) -> 5. Locked Buffers -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 6. Buffer Locks -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 15. Buffer Lock Queue -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 8. Resource Queues -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 16.  Semaphores -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 9. TXE Lock Activity -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 11. Latch Counts -> p
x

Find the process ID (PID)

Find the process ID (PID) of the hung process from the PROMON User Control screen:

This information has already been gathered above, otherwise:

$   promon -NL <dbname> 
> 1.  User Control.  > 2.  Match a User Number  >  <Enter the Hung User's number UserID >.

Use proGetStack to determine what the process is doing

Run proGetStack <PID> on the machine where the user is connecting from will generate a protrace..<PID> file for the process associated with the executable. This may require root/Administrator permissions. If this process is no longer around an error will indicate this fact. If the process is still running, execute it again a couple more times, about 5-10 seconds apart which will generate a new protrace into the same file, appending to the bottom each new time you call proGetStack.

Review the output for ABL Stack Trace, looking at the top line, to see if it changes at all.

** ABL Stack Trace **

--> foo.p at line 25369  (foo.r)
    bar.p at line 4367 (bar.r)

Compile the program referenced in the top line of the ABL Stack, with the DEBUG-LIST option to find out what is happening at the line referenced (25369 in this example).

COMPILE foo.p DEBUG-LIST foo.dbg.

Find what resources the process is currently using

Verify if this process is using CPU resources. Substitute "<PID>" for the PID number obtained using Step 2 above.

UNIX: "ps -ef | grep <PID>".  
WINDOWS: "tasklist -v | FINDSTR <PID>

Attempting to disconnect a process that it appears to be a rogue process consuming a lot of CPU may be difficult if that process is holding a latch and somehow gets into a loop, this may account for the inability to stop the process.

If the client process has a Parent Process ID of 1 this means that it is bound to the INIT process and the only resolution is to shutdown the database. Refer to Article:

User process bound to UNIX init process is locking a record and cannot be released

Get at least 3 Stack traces

Try to get a stacktrace from the hung process. These are a couple of the more basic methods of doing this and the commands available will depend on the Operating System:

(most UNIX's): kill -16 PID > pid.txt
(Solaris): pstack PID > pid.txt
(AIX): procstack PID > pid.txt
Use a different debugger like gstack to generate the C+ stack

Since OpenEdge 10.1C: a protrace.<pid> including the ABL stack if available.

UNIX: kill -SIGUSR1 PID
WINDOWS: %DLC%\bin\_debugConfig.exe" -getstack PID
on Windows the same version of %DLC%\pdbfiles as the current installed version + service pack/update + hotfix must be present in order to unwind the stack properly) Refer to Article: What is the purpose of .PDB files?

If a stacktrace is produced:

Run the command again at least 3 times to different files (or append to current file), so that the process state can be determined as static or changing between stacktrace snapshots.

If a stacktrace is not been produced or hangs :

On UNIX: In the absence of stack information, generate a core file by using "kill -8 pid" as a last resort to gather information. As long as ulimit -c is not disabled and there are no suid or hardlimits preventing core file generation, a core file also provides information about the state of local and global variables in addition to the C-stack.

On Windows use Microsoft Process Explorer. (https://technet.microsoft.com/en-gb/sysinternals/bb896653.aspx)

Identify if the hung process is still attached to shared memory or a database file descriptor

On UNIX:

$   lsof -p <pid>

Review the list of file descriptors that the process is attached to. If any of the files listed is a database file, killing the process once information collection is complete, may well cause the database to crash. Start by running "kill -15 <pid>" (SIGHUP) to kindly ask the process to terminate, otherwise more aggressive kill options might be needed:

Guidelines on the use of UNIX kill command to stop a process

On Windows, one way to identify the process shared-memory or file handles is the Microsoft Process Explorer referred above:

View > Show Lower Pane [CTRL + L]
View > Lower Pane View > Handles [CTRL + H ]

From the 'client-side' perspective:

Find out why the client session needed to be disconnected in the first place. This should not be a 'normal' action especially when that client was doing work actions at the time. This information is not always easy to obtain, but the effort in doing so has proven to be the quintessential key information in resolving that this situation does not repeat.

What was the last thing the user remembers doing in their application workflow?
What was the application session doing which caused the user to terminate their session or caused their session to hang?
Exactly how was their session terminated? The subsequent termination actions tried by the user at the time helps to further diagnose corrective actions:
- Did the user simply turn off their machine or exit their telnet session without first exiting their application session?
- That untrappable kill signals were involved or perhaps there was something else that unexpectedly terminated their session for them (timeouts on the network / firewall / terminal)

Retain the database log file.

Inclusive of the database startup prior to the issue all the way until after the database shutdown (if applicable).

Finally terminate the process, or shut down the database.

If there were any latches held by this user, then the database will crash when this user dies. This will not corrupt the database as long as there are not other corruption at play which will fail the BI redo and undo processing when the database is next accessed. It will also mean that all currently running users will be kicked off and the database will shut down. Therefore first consider gracefully disconnecting current users first. An ABL script example is provided in Article:

How to disconnect all self and remote users from a database?

The longer the database is left running, the longer bi recovery time will take. For OpenEdge 11.6.1 or later databases when the database is next started, first truncate the bi file with the -crStatus and -crTXDisplay parameters before re-starting multi-user. For further information refer to Article:

How to estimate how long BI Recovery will take to complete

Workaround

Notes

Progress Articles:

Orphan Remote Clients cannot be disconnected resulting in bi growth
Explanation of the Lock Table flags in promon
How to disconnect a database user with PROSHUT or PROMON
Record Locking - How to know who has locked the record using VST?

Keyword Phrase

Last Modified Date	9/16/2021 1:52 PM

Troubleshooting why PROSHUT does not remove the userid from the PROMON User Control Table

Information