Salesforce

Cannot terminate replication server process is blocked or when the pica buffer is full

« Go Back

Information

 
TitleCannot terminate replication server process is blocked or when the pica buffer is full
URL NameP150061
Article Number000139429
EnvironmentProduct: OpenEdge Replication
Version: 10.x, 11.0 to 11.7.5, 12.0x, 12.1.0, 12.2x
OS: All supported platforms
Question/Problem Description
Cannot stop the Replication Server process when the pica buffer (RPLS-Q) is full.

Replication Server process doesn't terminate when instructed with: DSRUTIL -C terminate server

The DSRUTIL -C terminate server instruction completes and returns to the command prompt, but the RPLS process does not stop:
The Replication Server has been instructed to shutdown

The RPLS does not terminate when there are no Free Message Entries are shown in the current RPLS-Q:
PROMON -> R&D -> 1. Status Displays -> 16. Database Service Manager 

The RPLS does not terminate when messaging between the RPLS and RPLA is blocked and there are still Free Message Entries available in the RPLS-Q
Delays in network stack or network problems cause the RPLS not to get TCP messages that the RPLA is no longer receiving messages exhausts Free Message entries and stalls production

When the next instruction to terminate the replication server process is sent, the process hangs and does not return to the command prompt
Steps to Reproduce1. Start the source database with a small -pica 128
2. Run an intensive update, example:

DEF VAR i AS INT NO-UNDO.
DEF BUFFER bcust FOR customer.

DO TRANSACTION:
FIND LAST customer NO-LOCK.

DO i = 1 TO 400000:
CREATE bcust.
BUFFER-COPY customer EXCEPT custnum TO bcust.
END.
DISPLAY i.
END.

3. As soon as the RPLS-Q is full:

$ dsrutil <source> -C terminate server // RPLS does not stop
$ proshut <source> -C disconnect <usernum> // RPLS does not stop
$ kill -1 (SIGHUP) ; kill -2 (SIGINT) ; kill -15 (SIGTERM)
$ kill -8 (SIGFPE) ; kill -9 // terminates RPLS pid, which may cause a dbdown

To mimic a network communications block, interupt then resume the rpla process:
LINUX/UNIX: kill -stop agent-pid / dsrutil -C terminate server / kill -cont agent-pid
WINDOWS: suspend and resume rpagent with resmon GUI [WIN+R : resmon] or use the Microsoft Sysinternals [ pssuspend rpagent / dsrutil -C terminate server / pssuspend -r rpagent ]
Clarifying Information
No error messages are displayed on screen or exist in the database log file
-asc 501 to turn off acknowledgement  or -asc 10 to increase ack messages do not always make any difference

Cannot proshut the replication target when the RPLA does not terminate, it hangs
Cannot proshut the replication source when the RPLS does not terminate, it hangs
Production source hangs all transaction processing when all AI files are LOCKED
Production source may hang further transaction processing when there are available AI files but there is no AI write activity

A source or target database with client/server connections may show Pending Connections being cleared or dead remote servers at the time 2526, 2525 & 14254, 12455
(2526)  Disconnecting client <usernum> of dead server <servernum>. 
(2525)  Disconnecting dead server <servernum>. 
(12454) Server <servernum> has <n >unresolved pending connections(s). Please check port <serverport>
(12455) Clearing pending connections from server <servernum>.
Error MessageWaiting 30 seconds for Replication Server to process last request.

The wait for a previous request to be processed has expired, this request is canceled.
Failed to instruct Replication Server to shutdown, error:-187.
Defect NumberEnhancement OE00188363 / PSC00217061 / OCTA-2584
Enhancement Number
Cause
The network bandwidth between the source and target databases is unable to send TCP/IP  messages from the RPLS to the RPLA to process during periods of high OLTP activity. As a result, the RPLS-Q fills as the current -pica value used for the environment's high write activity periods is under scoped:
  1. PROMON -> R&D -> 1. Status Displays -> 16. Database Service Manager : Free Message Entries
  2. _DbSvcMgr-FreeMsgEntries VST
  • When the buffer between the RPLS and RPLA fills to capacity (RPLS-Q), the RPLS process writting a TCP/IP message for the RPLA is blocked until that message can be "fit" into the tcp/ip buffer
  • When the RPLS is blocked, it cannot check for messages from DSRUTIL and so remains ignorant of the 'terminate server' request
  • When the RPLS-Q is filled, pointers to new filled AI blocks cannot be written, which stalls source database activity.
Another cause is when there is a dead-socket condition which blocks replication communications. While this situation persists, there is no RPLS-Q processing and the database stalls as the last filled after-image block cannot be added to the RPLS-Q for processing. In other words the RPLS-Q is not full.
Resolution
Upgrade to OpenEdge 11.7.6, 12.2.1.0, 12.3 where OCTA-2584 enhancements are available.

Terminating the replication server or agent was re-worked to be consistent with the existing ability to disconnect servers and agents via promon or proutil -C disconnect <n>. Once upgraded, DSRUTIL (rprepl) will be able to terminate the Replication Server regardless:
A. Of the state of the PICA queue, and 
B. Of a "blocking" network communication message. 
C. In addition, the Replication Agent will also be able to terminate when it is waiting on a acknowledgement from the server (RPLS).

It's important to understand, while this fix hardens the stack from the OpenEdge side of things, the underlying problem to be solved is the environment's network. This enhancement effectively allows this situation to be dealt with by assuring that replication can be stopped, resuming production operations (as AI files will move to the LOCKED status when switched ) until the RPLS and RPLA are able to communicate over the network again when restarted once that is resolved.

It is advised to periodically monitor with the _DbSvcMgr-FreeMsgEntries VST, bearing in mind that a low Free Message Entries will not alert to a dead-socket condition which blocks replication communications, in this case the _DbServiceManager entries typically remain static or the "Access Count" incrementing, but no movement on Used/Free Messages.
Workaround

A.  Upgrade to OpenEdge 10.2B08, 11.2, 11.3 or later 

The maximum value for -pica has been increased to 1000000. 

It is not recommended to use the maximum value as this will cause longer synchronization times at startup or re-connection time after failure. Instead, get the network and the target machine performance checked. Then revisit calculating the optimum -pica value for the environment's high write activity periods. Using a higher -pica value reduces the likelihood of the pica queue becoming full and therefore avoiding this issue of not being able to terminate the RPLS process and otherwise the source database OLTP activity stalling. For example refer to Article:  How to calculate the optimum -pica setting for OpenEdge Replication  
             

B.  Before terminating the Replication Server 

1. First determine if the RPLS-Q is full before terminating the RPLS process:

PROMON > R&D > 1. Status Displays > 16. Database Service Manager
Status: Database Service Manager
    Free Message Entries    :          < this value >

Further detail are provided in Article: How to monitor the  message replication queue set with the -pica parameter.   
               
2a.  If there are Free Message Entries remaining:

Proceed with terminating the Replication Server from the command line, then verify through the PROMON user list (outlined below) that the RPLS process has terminated.
     
$  dsrutil source -C terminate server
     
2b.  When there are no Free Message Entries remaining:
  • The Free Message Entries shows: 0, a small value that keeps resetting to zero or a very low value that could fill in the meantime 
  • Sending an instruction to terminate the server then Kill the rpserver.exe process (RPLS )
Steps to disconnect and terminate the Replication Server process:

i)  instruct the RPLS to terminate so that the database marks it 'todie'
$  dsrutil source -C terminate server

ii)   Find the RPLS User ID and PID using PROMON or  _Connect VST
    
promon > 1.  User Control > 1.  Display All Entries

Find the RPLS process under the "Type" column then lookup the PID

Usr     Name Domain  Type Wait  Table:Part Dbkey Trans    PID Sem Srv    Login  Time    
  5  dbstart      0  RPLS   --           0     0     0  10400   0   0 02/09/18 12:44


Find the RPLS PID using ABL to query the _Connect VST:
FOR EACH _Connect WHERE
    _Connect-Usr  <> ? AND
    _Connect-Type = "RPLS" NO-LOCK:
 
  DISPLAY
    LDBNAME("dictdb") FORMAT "x(20)" LABEL "Current DB"
    _connect-type
    _connect-usr        COLUMN-LABEL "Usr#" FORMAT ">>>9"
    _connect-pid        FORMAT ">>>>9".
END.

iii) Use promon or proshut to disconnect the User ID
$   proshut <dbname> -C disconnect <usernum_of_RPLS>

iv)  If the Replication Server PID does not go away only then terminate is OS signals

Terminate the RPLS PID by using the UNIX kill command or on Windows using the Task Manager.
     
UNIX: kill -15 <pid> ( Before running a kill, verify that the PID is not touching DB files with a command like lsof -p <PID>)
WINDOWS: TASKKILL /PID <pid> /T /F

Note: Kill -9 should be used as a last resort and may result in a DB crash if the PID is still touching database files.  If the RPLS process is holding a latch, buffer, or lock then the killing of the RPLS process may force a shutdown of the database.
Notes
Keyword Phrase
Last Modified Date7/13/2021 1:54 PM

Powered by