Communication error - Replication - Forum - OpenEdge RDBMS - Progress Community

Communication error - Replication

 Forum

Communication error - Replication

This question is not answered

Still having trouble with Replication with a client. I'll try and explain. OE 10.2B08. Windows. 

This has been an ongoing issue for a while. Struggling to get anywhere with Tech Support as well. Had another instance of problems just now. 

Client rebooted the target. This comes up perfectly well and the agents are listening. But the repl servers are "Performing Failure Recovery". If I Terminate Server all you get is a line in the log file indicating that the administrator account logged in and out, but no action, and the repl server does NOT terminate. If I repeat the action the process terminating the server hangs itself up, and if I kill it the DB shuts down due to a latch being held. I can't disconnect it using proshut. 

At the point the target server was rebooted we get this in the source logs:

[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (9407)  Connection failure for host 192.168.16.238 port 4400 transport TCP. 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (11713) A communications error -4008 in rpCOM_RecvMsg. 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Receive Error
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0000:  d8a0 0102 0000 0000 0000 0000 3011 0000 8abb 0000 8abb 0000 0200 0000 4400 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0020:  c3e1 0300 fb6c 0000 0000 0000 efce 1959 0000 0000 3c41 0000 0100 0000 1900 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0040:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 00c0:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3136 2e32 3338 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0100:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3136 2e32 3338 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.178+0100] P-6108       T-7148  I RPLS   26: (-----) 0140:  0000 0000 0000 0000 0000 0000 
[2017/05/15@16:54:16.180+0100] P-6108       T-7148  I RPLS   26: (10492) A communications error -157 occurred in function rpNLS_PollListener while receiving a message. 
[2017/05/15@16:54:16.180+0100] P-6108       T-7148  I RPLS   26: (10661) The Fathom Replication Server is beginning recovery for agent l_idx_audit. 
[2017/05/15@16:54:16.180+0100] P-6108       T-7148  I RPLS   26: (10842) Connecting to Fathom Replication Agent l_idx_audit.

We've advised the client to reboot the source, but it's not an ideal solution as they are 24/7. 

 
All Replies
  • %% Properties File
    %% version 1.1
    %% 6 janv. 06 13:56:16
    
    [server]
        control-agents=l_idx_cs
        database=chemsource
        defer-agent-startup=1000
        transition=manual
        transition-timeout=600
        agent-shutdown-action=recovery
        
    [control-agent.l_idx_cs]
       name=l_idx_cs
       database=chemsource
       host=192.168.16.238
       port=48090
       connect-timeout=1000
       replication-method=async
       critical=0
    
    [agent]
        name=l_idx_cs
        database=chemsource
        connect-timeout=1000
        listener-maxport=4408
        listener-minport=4406
    
    [transition]
        database-role=reverse
        restart-after-transition=0
        auto-begin-ai=1
        transition-to-agents=l_idx_cs
    
  • Known issue. There are a few of these in 10.2b08. You're toast. Restart src is only solution of which I am aware.

    Paul Koufalis
    White Star Software

    pk@wss.com
    @oeDBA (https://twitter.com/oeDBA)

    ProTop: The #1 Free OpenEdge DB Monitoring Tool
    http://protop.wss.com
  • Thanks Paul. Shouldn't you be relaxing on a beach somewhere?! :)

    Is there a way of preventing this from happening? Would, say, stopping the databases before rebooting reduce the chances?

    I ask because we don't experience these sorts of issues on other sites with very similar OERepl configurations.

  • proshut DB -C disconnect RPLS (= its userid)

    and then kill from TaskManager so far has always worked for me, but then I don't use -C terminate, but -C restart server first, when that tells me 'server already running', then proshut -C disconnect and kill from TaskManager

  • I'm very very reluctant to kill from TaskManager as even after the disconnect I've had the process kill crash the DB.

  • Hello ​James, how did you solve the problem?​
    Ezequiel Montoya
    Lima - Perú
  • No solution yet. Working with Progress Support on it. We've enabled additional logging to see if we can capture info, but we need a system reboot for it to become active.