Troubleshooting why after being disconnected, the UserID is not removed from the shared-memory:When a hung process does not disconnect, gather the following information prior to killing the process (with caution) then log a support call with Progress Technical Support.
- Gather PROMON latch information:
- The UserID that was initially used to disconnect the client with PROSHUT or PROMON is important.
- If that userid is shown in the "Owner" column for any of the latches under the "Latch Counts" output, then killing that user will cause the database to shut down (by design). Choose "U", to update the latch screen again to confirm whether it's holding the latch for longer than a millisecond (you may have just caught a user holding a latch at that millisecond) then re-confirm the latch has not be acquired again. Unless the promon Display page length has been adjusted, there are two pages of latches to check, press <Enter> to view the second page.
$ promon -NL <dbname> | tee promon.txt"
1. User Control -> 1. Display all entries -> q
4. Record Locking Table -> 1. Display all entries -> q
-> R&D -> 1. Status Displays -> 4. Processes/Clients -> 2. Blocked Clients -> p (previous menu)
[-> R&D -> 1. Status Displays -> 4. Processes/Clients ] -> 3. Active Transactions -> t (top menu)
[-> R&D -> ] 1. Status Displays -> 6. Lock Table -> 1. Display all lock entries -> t (top menu)
[-> R&D -> ] debghb -> 6 (hidden option) -> 5. Locked Buffers -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 6. Buffer Locks -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 15. Buffer Lock Queue -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 8. Resource Queues -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 16. Semaphores -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 9. TXE Lock Activity -> p
[-> R&D -> debghb -> 6 (hidden option) ] -> 11. Latch Counts -> p
x
- Find the process ID (PID)
Find the process ID (PID) of the hung process from the PROMON User Control screen:
This information has already been gathered above, otherwise:
$ promon -NL <dbname>
> 1. User Control. > 2. Match a User Number > <Enter the Hung User's number UserID >.
- Use proGetStack to determine what the process is doing
Run proGetStack <PID> on the machine where the user is connecting from will generate a protrace..<PID> file for the process associated with the executable. This may require root/Administrator permissions. If this process is no longer around an error will indicate this fact. If the process is still running, execute it again a couple more times, about 5-10 seconds apart which will generate a new protrace into the same file, appending to the bottom each new time you call proGetStack.
Review the output for ABL Stack Trace, looking at the top line, to see if it changes at all.
** ABL Stack Trace **
--> foo.p at line 25369 (foo.r)
bar.p at line 4367 (bar.r)
Compile the program referenced in the top line of the ABL Stack, with the DEBUG-LIST option to find out what is happening at the line referenced (25369 in this example).
COMPILE foo.p DEBUG-LIST foo.dbg.
- Find what resources the process is currently using
Verify if this process is using CPU resources. Substitute "<PID>" for the PID number obtained using Step 2 above.
UNIX: "ps -ef | grep <PID>".
WINDOWS: "tasklist -v | FINDSTR <PID>
Attempting to disconnect a process that it appears to be a rogue process consuming a lot of CPU may be difficult if that process is holding a latch and somehow gets into a loop, this may account for the inability to stop the process.
If the client process has a Parent Process ID of 1 this means that it is bound to the INIT process and the only resolution is to shutdown the database. Refer to Article:
- Get at least 3 Stack traces
Try to get a stacktrace from the hung process. These are a couple of the more basic methods of doing this and the commands available will depend on the Operating System:
- (most UNIX's): kill -16 PID > pid.txt
- (Solaris): pstack PID > pid.txt
- (AIX): procstack PID > pid.txt
- Use a different debugger like gstack to generate the C+ stack
Since OpenEdge 10.1C: a protrace.<pid> including the ABL stack if available.
- UNIX: kill -SIGUSR1 PID
- WINDOWS: %DLC%\bin\_debugConfig.exe" -getstack PID
- on Windows the same version of %DLC%\pdbfiles as the current installed version + service pack/update + hotfix must be present in order to unwind the stack properly) Refer to Article: What is the purpose of .PDB files?
- If a stacktrace is produced:
Run the command again at least 3 times to different files (or append to current file), so that the process state can be determined as static or changing between stacktrace snapshots.
- If a stacktrace is not been produced or hangs :
On UNIX: In the absence of stack information, generate a core file by using "kill -8 pid" as a last resort to gather information. As long as ulimit -c is not disabled and there are no suid or hardlimits preventing core file generation, a core file also provides information about the state of local and global variables in addition to the C-stack.
- Identify if the hung process is still attached to shared memory or a database file descriptor
On UNIX:
Review the list of file descriptors that the process is attached to. If any of the files listed is a database file, killing the process once information collection is complete, may well cause the database to crash. Start by running "kill -15 <pid>" (SIGHUP) to kindly ask the process to terminate, otherwise more aggressive kill options might be needed:
On Windows, one way to identify the process shared-memory or file handles is the Microsoft Process Explorer referred above:
View > Show Lower Pane [CTRL + L]
View > Lower Pane View > Handles [CTRL + H ]
- From the 'client-side' perspective:
Find out why the client session needed to be disconnected in the first place. This should not be a 'normal' action especially when that client was doing work actions at the time. This information is not always easy to obtain, but the effort in doing so has proven to be the quintessential key information in resolving that this situation does not repeat.
- What was the last thing the user remembers doing in their application workflow?
- What was the application session doing which caused the user to terminate their session or caused their session to hang?
- Exactly how was their session terminated? The subsequent termination actions tried by the user at the time helps to further diagnose corrective actions:
- Did the user simply turn off their machine or exit their telnet session without first exiting their application session?
- That untrappable kill signals were involved or perhaps there was something else that unexpectedly terminated their session for them (timeouts on the network / firewall / terminal)
- Retain the database log file.
Inclusive of the database startup prior to the issue all the way until after the database shutdown (if applicable).
- Finally terminate the process, or shut down the database.
If there were any latches held by this user, then the database will crash when this user dies. This will not corrupt the database as long as there are not other corruption at play which will fail the BI redo and undo processing when the database is next accessed. It will also mean that all currently running users will be kicked off and the database will shut down. Therefore first consider gracefully disconnecting current users first. An ABL script example is provided in Article:
The longer the database is left running, the longer bi recovery time will take. For OpenEdge 11.6.1 or later databases when the database is next started, first truncate the bi file with the -crStatus and -crTXDisplay parameters before re-starting multi-user. For further information refer to Article: