Information

Title	How long does RPLS take to detect network outage to the RPLA?

URL Name	000042483

Article Number	000163758

Environment	Product: OpenEdge Version: 10.2x, 11.x OS: All supported platforms Other: OpenEdge Replication

Question/Problem Description

How long does the Replication Server (RPLS) take to detect a Network outage to the target?
Why does the RPLS not wait repl-Keep-Alive when the target is shutdown?

Steps to Reproduce

Clarifying Information

Error Message

Defect Number

Enhancement Number

Cause

Resolution

How long does the Replication Server (RPLS) take to detect a Network outage to the target?

The Replication Server (RPLS) is optimistic by design providing timeouts to respond and waits until keep alive timeout (repl-Keep-Alive) expires:

The repl-keep-alive feature was introduced in 10.0x as a logical/application implementation for when the Replication Server or Agent blocks when trying to send the message and the failure is not recognized by the listener.
repl-keep-alive is for when there are TCP/IP failure detected as communicated by the underlying TCP stack. For further context refer to Article How does the TCP KeepAlive mechanism work?
The Replication Server will not react to a network outage for a minimum of the repl-Keep-Alive value. By default, the repl-Keep-Alive is 300 seconds which means it expires 5 minutes after the last message is received by the Server.
When the RPLS and RPLA have established contact and then loose contact at some future time, the source database's Repl-Keep-Alive and connect-timeout values affect how long the RPLS remains running, filling the RPLS-Q until the RPLS eventually terminates.
The RPLS polls the RPLA every third iteration where a rpNLS_PingAgent is sent in the message. Each RPLS iteration includes checking the pica queue RPLS-Q looking for AI transaction data to send as AI blocks fill on the source, building/sending network messages and looking for messages from the DSRUTIL utility. The time it takes to perform these activities may push the detection of an unavailable agent past the repl-keep-alive value and how much can depend on what kind of contention there is on the queue based on the update activity on the source database. Also, this type of lag can be made longer in the two target environment, in a heavy update environment, and where a very active ai archiver is present.
When the tcp/ip stack does not communicate the failure, network messages sent to the "other side" end up queued and if eventually the queue fills, replication will block on a tcp/ip send until the failure is communicated from tcp to replication or network communications resume.
The repl-Keep-Alive minimum is 90 seconds since 10.2B and there is no maximum value. In order to accommodate earlier detection of the remote target the value can be set to lower value than the default of 5 minutes. Tuning network and specifically the tcp kernel will also improve the situation.

Why does the RPLS not wait repl-Keep-Alive when the target is shutdown?

A common misconception is that when the target database is shutdown, the production source RPLS will wait repl-keep-alive time and automatically reconnect to the replication target agent process (RPLA) once it had been restarted.

When the RPLA is manually terminated with DSRUTIL or the target database itself is shutdown, the RPLS will never remain running waiting for the RPLA to be restarted. The RPLA however will go into PRE-TRANSITION when the RPLS is stopped when configured for: agent-shutdown-action=recovery . There is no equivalent for the RPLS process by design.

The repl-keep-alive parameter is not to 'keep the RPLS alive'. It is a logical/application implementation that the RPLS sends a PING in the tcp packet to the RPLA. When this expires, the connect-timeout value then affects how much longer the RPLS remains running before it terminates. In other words, repl-keep-alive is for when there are TCP/IP failure detected as communicated by the underlying TCP stack.

When the RPLA is terminated with DSRUTIL, or when the replication target is shutdown, this is a stop condition for the RPLS process. This is by design. There is no benefit in keeping the RPLS alive, filling the RPLS-Q (pica) and retrying the connection as we know that the target is no longer available. Bear in mind that should the RPLS-Q (pica) eventually fill, this will stall the source production database. Should for some reason the target not come online again as anticipated, this would be an undesirable side-effect. Were it to be implemented, we would need new timeout and retry configuration parameters and values for this scenario.

While the target may "just be being stopped and immediately started again", the design is for you to then restart the RPLS with DSRUTIL as soon as the target is online again. At this stage, the connect-time and eventually defer-agent-startup timeouts will initiate trying to reconnect then eventually complete synchronization and continue with forward processing of ai transaction notes.

Workaround

Reference Progress article:
What happens with replication when there are network problems?
Under the workaround section a manually recover option is provided.

Notes

Progress Articles:

How does the TCP KeepAlive mechanism work?
How does dsrutil -C terminate agent work
How to simulate a TCP/IP communication break between Replication Source and Target Agent.

Keyword Phrase

Last Modified Date	11/20/2020 7:14 AM

How long does RPLS take to detect network outage to the RPLA?

Information