Salesforce

What happens with replication when there are network problems?

« Go Back

Information

 
TitleWhat happens with replication when there are network problems?
URL Name000044843
Article Number000165539
EnvironmentProduct: OpenEdge
Version: All supported versions
OS: All supported platforms
Other: Replication
Question/Problem Description
What happens with replication when there are network problems?
What happens with replication when there is a network outage? 
Steps to Reproduce
Clarifying Information
Error Message
Defect Number
Enhancement Number
Cause
Resolution
When the OpenEdge Replication Server (RPLS) loses connection with one or more OpenEdge Replication Agents (RPLA), after it has previously synchronised, the RPLS tries to contact the RPLA and establish connection for the amount of time determined by the connect-timeout value set in the replication server properties file.  Should the RPLS not be able to establish connection withing the timeout or should communications break during synchronisation, the RPLS will mark the agent as unrecoverable and it will not try to connect-again. The RPLS must be manually restarted, which will then go through initial connection and enter defer-agent timeout if configured.
Nonwithstanding, focusing efforts on improving network stablility, assuring sufficient bandwidth, assuring that timeouts in the network topology and that scanning probes to the ports in use do not break communications in the first place, is paramount to maintaining a reliable replication environment.

When Replication Communications break:

1.  When the RPLA recognizes there has been a RPLS communication failure, if agent-shutdown-action=recovery is configured the RPLA goes into PRE-TRANSITION mode until such time as the RPLS re-establishes communications and otherwise the RPLA terminates.
 
When the RPLS recognizes there has been an Agent failure, the server places itself into a state that allows continuous RDBMS activity, as if the RPLS is not running.
Blocked sockets or network instability may result in either the RPLS / RPLA not able to be notified of the failure in which case the Replication Service queue fills leading to the inability to stop the RPLS. Being able to terminate the replication server/agent under these conditions has been improved in OpenEdge 11.7.6. 12.21, 12.3:
2. The RPLS tries to reconnect to RPLA for a set amount of time (connect-timeout).
  • Source database activity by clients is still allowed unless synchronous replication is being used or schema updates are being performed by a process. 
3. If the RPLS is able to reconnect to the RPLA in the configured connect-timeout period, it synchronises:
  • RPLA again begins processing AI blocks from the source database. When it gets within ten AI blocks of the RDBMS, the RPLS halts normal database activity and completes the synchronization process.
  • Schema updates are not allowed while the RPLSis performing synchronization. If schema updates are being performed when failure recovery synchronization begins, source database updates will block until failure recovery completes.
  • Source database activity cannot continue without the agent connected when synchronous replication is being used. 
  • When synchronization completes, the RPLS reinserts itself back into the AI block write process and the database is unlocked, allowing normal database activity and replication activity to continue.
  • If communications break during synchronisation, the RPLS will terminate. It will not try to re-connect using connect-timeout. Either the RPLS needs to be manually started or the source database itself to re-initialise initial connection using connect-timeout and thereafter defer-agent time
  • For various reasons the RPLS can get stuck connecting to the RPLA. These situations arise when there are underlying problems in the network layers, typically specifically one of RPLA/RPLS was aware of the communication failure but not the other.  The following Article discusses this further:  What procedure to use when RPLS is stuck in status 1100 "RP STATE CONNECTING"?   
4. If the RPLS is unable to reconnect to all agents or to a critical agent in the configured connect-timeout period, the RPLS will terminate and source database activity will continue.
  • In other words, if one agent is a critical agent, the server will continue if it can reconnect to the single critical agent. Both agents will need to be started once the second agent can be connected to.
  • If there are no critical agents, the server must be able to reconnect to all agents or it will terminate. The RPLS marks the agent as irrecoverable and terminates.
  • When source database activity continues while the RPLS is not running, be sure that there is enough AI extent space to handle all database activity until the RPLS is restarted and replication continues. 
  • Re-starting the RPLS will get target out of PRE-Transition and if connect-timeout has not expired, replication will resume faster.     $   dsrutil <sourcedb> -C restart server
  • Since version 11.6 new feature is implemented to restart Replication Agent like the Replication Server with DSRUTIL without having to restart the target database:  $   dsrutil <targetdb> -C restart agent
Workaround

 
Notes
Keyword Phrase
Last Modified Date9/15/2021 7:01 PM

Powered by