Quick replication question - Forum - OpenEdge RDBMS - Progress Community

Quick replication question

 Forum

Quick replication question

  • Our replication server is complaining when trying to monitor:

    F:\DATABASE\LIVE>dsrutil icmasliv -C monitor
    Cannot connect to replication shared memory.  Status = -1


    The target is in normal processing, but hasn't received any data since the server stopped responding. 

    I've run Restart Server and it is now Connecting to Agents but seems to be unsuccessful in doing so. Is there anything else I can attempt before restarting the DB? 

    Also, is there anywhere I can look to see if I can work out why replication failed? Nothing obvious in the logs I've looked in so far. 

  • Had to restart the DB and it's still complaining so I'm restarting the server. :/

  • Did you check the connectivity between both servers/databases, i.e. telnet on the target database port from the source db server, do you get a response?

  • Patience and being lil more specific would be good J
     
    The agent should be in pre-transition, since your replication server was/has not seem to be running.
    So you can connect onto target with dsrutil and it says “normal processing” ?
     
    Unless you post both database log files, it’s literally impossible to make a guess.
     
    What database did you restart ?
     
    From: James Palmer [mailto:bounce-jdpjamesp@community.progress.com]
    Sent: Monday, March 30, 2015 2:12 PM
    To: TU.OE.RDBMS@community.progress.com
    Subject: RE: [Technical Users - OE RDBMS] Quick replication question
     
    Reply by James Palmer

    Had to restart the DB and it's still complaining so I'm restarting the server. :/

    Stop receiving emails on this subject.

    Flag this post as spam/abuse.

  • Unfortunately I am restricted by when the business says it's convenient to do things. The next "convenient" window being Wednesday when our AI extents would be full...

    I restarted the source as that is the one which was giving the error.

  • 
    

    Found some tell-tale problems in the Target log file, although the timings don't match completely. Target DB was up to date until 12:19 when things went south.

    [2015/03/30@11:35:35.088+0100] P-4528       T-4520  I RPLA  162: (9407)  Connection failure for host 192.168.125.1 port 4859 transport TCP. 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Poll Error:2
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0000:  0000 0000 0000 0000 6080 4050 2811 0000 2311 0000 9411 0000 0200 0000 2400 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0020:  4a92 1100 3730 af00 0000 0000 9e26 1955 0000 0000 4021 0000 0000 0000 2c01 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0040:  0000 0000 58f0 ffff 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00c0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0100:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3132 352e 3100 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0140:  0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (10492) A communications error -157 occurred in function rpNLA_PollListener while receiving a message. 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (11699) A TCP/IP failure has occurred.  The Agent's will enter PRE-TRANSITION, waiting for connection from the Replication Server. 
    [2015/03/30@11:35:38.027+0100] P-4528       T-4520  I RPLA  162: (10392) Database f:\database\live\icmasliv is being replicated from database f:\database\live\icmasliv on host 192.168.125.1. 
    [2015/03/30@11:35:39.030+0100] P-4528       T-4520  I RPLA  162: (10671) The OpenEdge Replication Agent agent1 is beginning Recovery Synchronization at block 11913. 
    [2015/03/30@11:35:39.399+0100] P-4528       T-4520  I RPLA  162: (6806)  Retry transaction point located at dbkey 0 note type 10 updctr 0. 
    [2015/03/30@11:35:39.399+0100] P-4528       T-4520  I RPLA  162: (10705) Retry point located at logical op 1 note type 70 trid 908325446. 
    [2015/03/30@11:35:39.720+0100] P-4528       T-4520  I RPLA  162: (10670) The Source and Target databases are synchronized.  Normal processing is resuming.
  • There is nothing wrong per say with this.
     
    Target detected a failure, agent went into pre-transition and then the connection got back up.
     

    Would need data from “Target DB was up to date until 12:19 when things went south”.

     
    From: James Palmer [mailto:bounce-jdpjamesp@community.progress.com]
    Sent: Monday, March 30, 2015 2:55 PM
    To: TU.OE.RDBMS@community.progress.com
    Subject: RE: [Technical Users - OE RDBMS] Quick replication question
     
    Reply by James Palmer

    Found some tell-tale problems in the Target log file, although the timings don't match completely. Target DB was up to date until 12:19 when things went south.

    [2015/03/30@11:35:35.088+0100] P-4528       T-4520  I RPLA  162: (9407)  Connection failure for host 192.168.125.1 port 4859 transport TCP. 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Poll Error:2
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0000:  0000 0000 0000 0000 6080 4050 2811 0000 2311 0000 9411 0000 0200 0000 2400 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0020:  4a92 1100 3730 af00 0000 0000 9e26 1955 0000 0000 4021 0000 0000 0000 2c01 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0040:  0000 0000 58f0 ffff 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00c0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0100:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3132 352e 3100 0000 0000 0000 
    [2015/03/30@11:35:35.089+0100] P-4528       T-4520  I RPLA  162: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
  • Is the target database "aware" that you restarted the source db/server, did it say anything about it or it kept happily ignoring the issue on source?

  • No target carried merrily on its way when source was rebooted.

    When I attempted to restart the target db I got a message saying shared memory was already in use.

  • So target db stopped, but gave you an error upon starting. I've seen it take sometimes a couple of minutes to completely shut down. Is the dbname.lk still there? Can you promon the database, or it says that there is no server for it? Is the rpagent.exe process still alive (assuming Windows platform)?

  • I left it 30 minutes before trying again and still no joy. The admin service log file had an error saying shared memory was already in use. DB log file showed shutdown was complete.

  • Assuming this is Windows – Process Monitor will/should tell what is holding the shared memory
     
    From: James Palmer [mailto:bounce-jdpjamesp@community.progress.com]
    Sent: Monday, March 30, 2015 4:00 PM
    To: TU.OE.RDBMS@community.progress.com
    Subject: RE: [Technical Users - OE RDBMS] Quick replication question
     
    Reply by James Palmer

    I left it 30 minutes before trying again and still no joy. The admin service log file had an error saying shared memory was already in use. DB log file showed shutdown was complete.

    Stop receiving emails on this subject.

    Flag this post as spam/abuse.

  • If (.lk isn't there AND promon says "no server..." AND rpagent.exe isn't there) then maybe the db's port is in a hanging state. You should be able to kill/force disconnect the port (with cports or TCP View, etc...)

    Then start the db again.

  • Hmmmm Back in the scenario again. I'll try and give more info this time.

    Source:

    Win 2003 server 32 bit, running Progress 11.2.1 32 bit (yes I know. We are migrating to 11.5 64 bit in May).

    Target:

    Win 2008 R2 64 bit running Progress 11.2.1 32 bit.

    Log File:

    [2015/03/30@21:01:27.491+0100] P-3632       T-3628  I RPLA  162: (9407)  Connection failure for host 192.168.125.1 port 2633 transport TCP. 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) Diagnostic Dump of RPCommInfo_t - TCP/IP Poll Error:2
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0000:  0000 0000 0000 0000 28a9 6700 2811 0000 2311 0000 9411 0000 0200 0000 2400 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0020:  8d6d 0000 a044 0400 0000 0000 508f 1955 0000 0000 4021 0000 0000 0000 0500 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0040:  0000 0000 58f0 ffff 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0060:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0080:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 00a0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 00c0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 00e0:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0100:  0000 0000 0000 0000 0000 0000 3139 322e 3136 382e 3132 352e 3100 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0120:  0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (-----) 0140:  0000 0000 0000 0000 0000 0000 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (10492) A communications error -157 occurred in function rpNLA_PollListener while receiving a message. 
    [2015/03/30@21:01:27.492+0100] P-3632       T-3628  I RPLA  162: (11699) A TCP/IP failure has occurred.  The Agent's will enter PRE-TRANSITION, waiting for connection from the Replication Server. 

    We are in Pre Transition. 

    I've restarted the server and we've gone to Performing Startup Synchronisation. 

    Fingers crossed it'll come back but some ideas where to look for why this is happening would be good as I don't appreciate getting alerted during the night :D 

  • When I say restarted the server I mean I've restarted the replication server, not the whole server!