Setting the -pica parameter for OE Replication - Forum - OpenEdge RDBMS - Progress Community

Setting the -pica parameter for OE Replication

 Forum

Setting the -pica parameter for OE Replication

This question is not answered

Hello,

I am looking for some advise on setting the -pica parameter for a group of databases with replication enabled.

From what I've read, all databases must have the same -pica value. Articles suggest the calculation is

-pica = (TAIW / BlockCount) * 1.25

So should I take the Total AI writes from the most used database?

Is this the total AI writes within 1 hour?

Also, it says from the  AI writes from the most busy period but I don't know when this is so not sure how to best estimate. I did an average of this in order to calculate the ai extents required so could this be used?

Any advice would be greatly appreciated.

All Replies
  • Example of pica filled:

    Communication Area Size: 20500.00 KB
    Total Message Entries: 187428

    Time     Free Msgs  Used Msgs  Used HWM  Filled Cnt  Access Cnt  Collisions  Locked by Replication Server
    04:00:28    187428          0     39351           0   146653272       24813          0
    04:30:29    176446      10982     39351           0   146787113       24823      10982 
    05:00:30    129880      57548     57554           0   146903954       24837      57548 
    05:30:31     82080     105348    105349           0   147012984       24864     105348 
    06:00:32     40849     146579    146579           0   147115650       24895     146579 
    06:30:33      1727     185701    185736           0   147259217       24934     185701 
    07:00:34        90     187338    187428        2757   147399645       24985     187338 
    07:30:35         9     187419    187428        6039   147546696       25036     187419 
    08:00:36      6094     181334    187428        6826   147699472       25088     181334 
    08:30:37       148     187280    187428        8676   147835357       25132     187280 
    09:00:38       282     187146    187428       11422   147990051       25175     187146 
    09:30:39        11     187417    187428       13920   148140550       25230     187417 
    10:00:40         8     187420    187428       16565   148301390       25271     187420 
    10:30:40         0     187428    187428       19071   148474473       25320     187428   
    11:00:41         6     187422    187428       20430   148667535       25351     187422 
    11:30:42         0     187428    187428       21451   148860809       25367     187428 
    12:00:43         0     187428    187428       22290   149052507       25387     187428
    12:30:44         0     187428    187428       23280   149244708       25404     187428
    13:00:45         6     187422    187428       24393   149439472       25427     187422 
    13:30:46         0     187428    187428       25901   149634877       25463     187428
    14:00:47         0     187428    187428       27601   149829466       25516     187428
    14:30:48         0     187428    187428       30126   150026275       25562     187428
    

    Access Collisions are almost the same all the time.

    Used Message Entries = Locked by OpenEdge Replication Server

    Activity: AI Log

    Activity: AI Log
    
    Time    Total Wrt  AIW Wrt Notes Wrt  Bytes Wrt BusyWaits Partial writes
    04:00:28  6410836  6134415    860955K 98412861K  3506  248953
    04:30:29    30234    29677   4079689    476171K    17     429
    05:00:30    76472    75515  10441933   1211905K    40     616
    05:30:31    73177    72211  10123924   1159241K    45     614
    06:00:32    66706    65792   9253486   1055882K    45     614
    06:30:33    86124    85119  11848009   1365994K    54     622
    07:00:34    50341    49417   6874101    794400K    35     615
    07:30:35    47549    46614   6566173    749848K    37     612
    08:00:36    64745    63774   8997106   1024461K    46     620
    08:30:37    44369    43458   6426670    698868K    43     621
    09:00:38    29663    28685   4459687    464162K    41     609
    09:30:39    41087    40112   6251794    646457K    56     624
    10:00:40    28148    27162   4220942    439805K    91     617
    10:30:40    26172    25201   4019687    408615K    99     595
    11:00:41    13070    12416   2005665    201711K   173     443
    11:30:42    10044     9531   1543874    154703K   112     359
    12:00:43     8280     7677   1261205    126705K    68     347
    12:30:44     9675     9166   1479821    148782K   149     362
    13:00:45    11091    10619   1723126    171474K   185     359
    13:30:46    15510    14665   2414339    240059K   287     488
    14:00:47    17176    16211   2681374    266200K   226     514
    14:30:48    24562    23546   3876160    383844K   154     538

    And, of course, the filled pica queue cause the slowness on the source db (Total AI Writes are decreased by 10 times).

  • Access collisions is the number of times that a process tried to add to the queue or remove from the queue (repl server) and found
    that the queue was locked.  It can be used with access counts to determine what level contention exists when accessing the queue.
    Your example suggests that there is very little contention on the pica queue.  And that interactions with the pica queue are not affecting
    the performance of the source database.
     
    -jeff
     
  • > Access collisions is the number of times that a process tried to add to the queue or remove from the queue (repl server) and found that the queue was locked.

    What is a typical duration of a queue lock and what are the factors that can affect it?

  • Once the lock is obtained, a free entry is either available or it is not. Search of the queue is not required to find a free entry while
    the lock is held.  If a free entry is not available, the lock is released.  If a free entry on the queue is available, a small amount of  data
    is updated in the message entry ( 112 bytes). Then the lock is released.  So the lock is held for a very, very, short time in duration.
     
    -jeff
     
     
  • Thanks, Jeff.

    The pica queue lock is the same as service latch lock, isn't it?

    >  So the lock is held for a very, very, short time in duration.

    Is it less than a microsecond as in case with the "spin" latches? Sorry I don't know a correct term to name the classical latches (like MTX, USR etc) in opposite to the administrative latches (like the pica service latch).

    Access Count: 150,000,000 per 30 min in my example. If access count = lock count then we can roughly estimate the total duration of the queue lock as being less than 150 sec during 30 min interval or 8% (or less) of all time the queue is locked. Is it correct?

    Access Collisions to Access Count ratio is only 0.017% in my example and 0.019% in Dmitri's example. Which processes and how activily are competing for the queue lock? Why the collisions are so rare?

    UPD: The administrative latches seem to be used only for situations with low collisions (like BIADD latch used by prostrct add online utility). I guess they don't even use the spin lock mechanism. So the low access collisions in Service Manager seem to be expected.

  • I missed that Status: Database Service Manager reports the counters since db startup. So I slightly "re-worded" the same statistics I posted above.

    Status: Database Service Manager


    Time Free Msgs Used Msgs Locked by RPLS Filled Access Collisions AI Blocks Percent Access/Block Notes/Block 04:00:28 187428 0 0 0 146,653,272 24813 6,161,883 96% 23.80 140 04:30:29 176446 10982 10982 0 133,841 10 29,805 99% 4.49 137 05:00:30 129880 57548 57548 0 116,841 14 75,856 99% 1.54 138 05:30:31 82080 105348 105348 0 109,030 27 72,563 99% 1.50 140 06:00:32 40849 146579 146579 0 102,666 31 66,092 99% 1.55 140 06:30:33 1727 185701 185701 0 143,567 39 85,502 99% 1.68 139 07:00:34 90 187338 187338 2757 140,428 51 49,726 99% 2.82 138 07:30:35 9 187419 187419 3282 147,051 51 46,937 99% 3.13 140 08:00:36 6094 181334 181334 787 152,776 52 64,125 99% 2.38 140 08:30:37 148 187280 187280 1850 135,885 44 43,748 99% 3.11 147 09:00:38 282 187146 187146 2746 154,694 43 29,054 98% 5.32 153 09:30:39 11 187417 187417 2498 150,499 55 40,463 98% 3.72 155 10:00:40 8 187420 187420 2645 160,840 41 27,531 98% 5.84 153 10:30:40 0 187428 187428 2506 173,083 49 25,577 98% 6.77 157 11:00:41 6 187422 187422 1359 193,062 31 12,627 97% 15.29 159 11:30:42 0 187428 187428 1021 193,274 16 9,685 96% 19.96 159 12:00:43 0 187428 187428 839 191,698 20 7,933 96% 24.16 159 12:30:44 0 187428 187428 990 192,201 17 9,313 96% 20.64 159 13:00:45 6 187422 187422 1113 194,764 23 10,732 97% 18.15 161 13:30:46 0 187428 187428 1508 195,405 36 15,022 97% 13.01 161 14:00:47 0 187428 187428 1700 194,589 53 16,662 97% 11.68 161 14:30:48 0 187428 187428 2525 196,809 46 24,024 98% 8.19 161 15:00:49 8 187420 187420 2627 191,212 59 26,520 98% 7.21 160 15:30:51 2244 185184 185184 2285 175,022 70 28,485 98% 6.14 159 16:00:52 24712 162716 162716 1165 172,104 87 42,393 99% 4.06 161 16:30:53 103103 84325 84325 0 143,864 55 26,906 98% 5.35 152 17:00:54 123870 63558 63558 0 68,242 14 17,954 97% 3.80 145 17:30:55 141597 45831 45831 0 67,828 9 19,297 97% 3.51 146 18:00:56 187428 0 0 0 178,082 40 16,868 97% 10.56 149 18:30:57 187428 0 0 0 257,580 44 7,775 94% 33.13 143 19:00:58 187428 0 0 0 345,113 96 25,769 98% 13.39 149 19:30:59 187428 0 0 0 355,315 77 22,158 98% 16.04 137 20:00:59 187428 0 0 0 281,343 69 18,364 98% 15.32 137 20:31:00 187428 0 0 0 81,337 10 902 89% 90.17 142 21:01:01 187428 0 0 0 60,178 5 190 71% 316.73 138 21:31:02 187428 0 0 0 59,145 2 247 79% 239.45 140 22:01:03 187428 0 0 0 55,854 4 130 71% 429.65 132 22:31:04 187428 0 0 0 53,761 0 28 25% 1,920.04 125 23:01:05 187428 0 0 0 57,486 1 161 65% 357.06 137 23:31:06 187428 0 0 0 48,780 1 15 42% 3,252.00 132 00:01:07 187428 0 0 0 51,271 3 658 96% 77.92 147 00:31:08 187428 0 0 0 50,421 3 1,696 98% 29.73 143 01:01:09 187428 0 0 0 78,023 3 442 71% 176.52 140 01:31:10 187428 0 0 0 172,987 4 2,707 83% 63.90 143 02:01:11 187428 0 0 0 80,442 1 281 51% 286.27 142 02:31:12 187428 0 0 0 85,199 1 593 68% 143.67 134 03:01:13 187428 0 0 0 185,981 23 9,860 97% 18.86 132

    where:

    First interval - 410 hrs 14 min (db uptime).
    The rest intervals - 30 min.
    "Filled", "Access", "Collisions" are the increments per interval of the corresponding counters in Status: Database Service Manager ("Area Filled Count", "Access Count", "Access Collisions").

    "AI Blocks" - the number of AI blocks generated per interval = "Total AI Writes" minus "Partial Writes" (from Activity: AI Log)

    "Percent" = "AI Blocks" / "Total AI writes" * 100% = an indirect indicator of idle db.

    "Notes/Block" = "Records written" (from Activity: AI Log) / "AI Blocks". AI blocksize = 16K. Average size of recovery notes is 112 bytes in the given example.

    Conclusions/Remarks:

    An unknown process(es) on source db is creating the collisions trying to access the pica queue but all 47 checks by promon show that the message entries locked only by Replication Server. The collisions are extremely low even during the most busy time: 96 collisions per 30 minutes (or one collision in 20 sec). In other words, almost 100% of the pica queue accesses are generated by Replication Server. I guess the ABL sessions do not access the pica queue. In my example there were about 2,500 sessions. It's unlikely that Replication Server would be able to dominate over their access requests. So the access collisions can be created only by the special processes that are working with AI files. UPD: Documentation says: "AIW and REPL attempting simultaneous access".

    Replication Server generates at least one access per incoming AI block (e.g. 102,666 accesses per 66,092 AI blocks). When transaction activity is low the number of pica queue accesses per AI block can be huge. In other words, Replication Server accesses the pica queue at least 27 times per sec (e.g. 48,780 accesses per 30 min with only 15 incoming AI blocks). For comparison: at busy time db was able to generate 85,502 AI blocks per 30 min or 48 AI blocks per sec.

    Max value of "Area Filled Count" per 30 min interval was 3,282 or almost twice per sec.

  • > Now I know I set -pica too high at 16384.

    > Used HighWater Mark shows that.

    16 MB of RAM per db is not a big price for insurance. ;-)

    > So the goal is that HWM is somehow close to Total Message Entries, like half or more, right?

    Unfortunately HWM is almost useless without date/time when the message entries had rised to the specified value. It could take place a month ago due to some particular conditions at this time (e.g. abnormal peak of transaction activity, a network issue between source and target dbs etc). The rest of time (untill db restart) you will see a "reminder" about past bad day that does not matter for the current situation.

    .

    > Area Filled Count should be small (single digit) or zero?

    Only zero! Other values would mean that performance on source db was considerably degradated due to a short pica queue at least for some period of time.

    > Access Collisions ???

    It's a field that can be totally ignored.

  • The duration of a lock of the pica queue is less than a millisecond.  Processes that flush AI buffers add entries to the pica queue and the replication server reads entries from the queue.    I should note that the pica queue is not used during the startup synchronization process.  During synchronization, the repl server reads directly from the AI extent(s).  The pica queue  is then “turned on” once is gets close to the most recently written AI blocks.  The pica queue is then used  during normal processing.  I would assume your contention is very low because you are running an AIW on the source.  The only contention in this case would only between the AIW and the repl server.
     
    Generally, you should see very little contention for the pica queue with an AI configuration that is well tuned. Otherwise, we’d like to hear about an environment where a high level of pica queue is seen.
     
    -jeff
     
  • Earlier mentioned knowledgebase article "Clarification of the -pica replication enabled database startup parameter"

    knowledgebase.progress.com/.../P104638

    says: "There is a PICA queue on both the source and target databases"

    Should the -pica parameters be the same on source and target databases or should we tune them individually? These databases are running on a different hardware and under a different load.

    I guess that in the customer's case I posted above the performance degradation on source db was caused by slowness of  replication agent on target db. Theory: replication server was spending a lot of time waiting for the acknowledgments from replication agent that in turn resulted in the filled pica queue on source db => source's APW was waiting for pica access => the updates on source db was very slow. Is it a reasonable explanation?

    Will we also get the negative impact on source database if the pica queue is filled on target database for any reasons? Will it cause the filled pica queue also on source db as well?

  • Comparing the pica queues on source and target dbs:

    Source:

    09/27/17        Status: Database Service Manager
    16:30:53        
    Communication Area Size   :      20500.00 KB
      Total Message Entries   :      187428
      Free Message Entries    :      103103
      Used Message Entries    :       84325
      Used HighWater Mark     :      187428
      Area Filled Count       :       36203
      Service Latch Holder    :          -1
      Access Count            :   150708477
      Access Collisions       :       25833
    Registered Database Service Objects
    Name                             Rdy Status  Messages Locked by
    OpenEdge Replication Server       Y  RUN        84325 
    

    Target:

    09/27/17        Status: Database Service Manager
    16:30:49        
    Communication Area Size   :      20500.00 KB
      Total Message Entries   :      187428
      Free Message Entries    :      187428
      Used Message Entries    :           0
      Used HighWater Mark     :           0
      Area Filled Count       :           0
      Service Latch Holder    :          -1
      Access Count            :    24717946
      Access Collisions       :         141
    Registered Database Service Objects
    Name                             Rdy Status  Messages Locked by
    OpenEdge Replication Agent        Y  RUN            0 
    

    Access count on source db is 6 times larger than the one on target.
    Large Area Filled Count on source db and zero HWM on target.

  • The pica queue on the target does not need to be same on the source and the target.  The default size on the target is likely sufficient since the queue is only used by dsrutil  to terminate and agent, restart and agent (11.6+), and to start transition.  When configuring restart after transition, the value used for the restart should be the same as the setting for the original source.
     
    Regarding the customer’s problem, the pica queue on the target should not fill up during the processing of the AI blocks.  The queue on the target is not used as part of the replication process.   The repl server generally doesn’t wait for acknowledgments from the agent so I would be skeptical of that theory.  The agent may not be keeping up with the source.  How similar in configuration are the source and target servers?  Is a BIW running on the target?  And if AI is configured on the target, is an AIW running?
     
  • BIW is running on the target.  AI on the target is not enabled.

    The customer's case: a rather high transaction activity (~ 0.7 MB/sec of AI/BI writes) caused the persistent increase of the used message entries in the pica queue on source db. When the pica queue was totally filled we got 100% busy MTX and AIB latches. It means that almost every snapshot in promon shown an owner of these latches. MTX latch was owned by ABL users (each time a different one). AIB latch was always owned by AIW process. Just for a record: the competition for AIB latch between ABL users are not possible because they always lock AIB (and BIB) latch under cover of MTX latch: they lock MTX latch first and only then they're requesting AIB/BIB lock. I guess there are only a few special processes that can compete for AIB latch directly but mainly it's AIW.

    The number of the AIB latch locks was rather low for a busy latch: 700-2000 locks per sec. Under normal conditions the lock duration of AIB latch is less than 100 nanoseconds. Hence to make the latch to be 100% busy we need 10 millions locks per sec - it's impossible to get such high numbers of AIB locks in a real database.

    > The duration of a lock of the pica queue is less than a millisecond.

    It's a huge time compared to the duration of the "spin" latches like AIB. And it also means that we can create only about a thousand locks per sec (that matches the numbers above).

    New explanation of the customer's case:

    I guess the AIW process is requesting the pica queue lock while it holds AIB latch. When AIW is forced to wait for the pica queue lock then the duration of AIB latch is increasing by 10 thousands times - from 100 ns to 1 ms. ABL process that already owns the MTX latch is forced to wait for AIB latch. That is why the MTX latch became 100% busy at the same time with the AIB latch. The number of MTX/AIB/BIB latch locks are closed to each other - the difference between MTX and AIB latch locks used to be plus/minus 1%, the difference between MTX and BIB latches used to be plus/minus 10% (or more during the idle time).

    If this my "theory" is correct then the only solution is to have a large pica queue to avoid its total filling during the periods with highest transaction activity.

  • I misspoke regarding the duration of a  pica queue lock.  The duration of a pica queue lock is less than a microsecond.   The mechanism is the same as with “spin” latches.  My apologizes for any confusion.
     
    It’s not so much, in the customer’s case, that is it is waiting for the pica queue lock, it’s that it’s waiting for a free message entry on the queue.  The APW is queuing messages faster than the repl server can read AI blocks and send them to the target.  The repl server’s job can be affected by network bandwidth, the target server being less  performant than the source server, and if the target is not tuned as well as the source.  In any case, I would agree that the solution for the customer is to increase the pica queue size so that it is large enough to accommodate the highest level of transaction activity.
     
     
  • Thanks again, Jeff.

    > The duration of a pica queue lock is less than a microsecond.   The mechanism is the same as with “spin” latches.

    How long AIW will wait before a second try if it fails to get the pica queue lock? 10 ms? Does it use the -nap value?

  • Yes, 10 ms.  No for using -nap value, it is not configurable.
     
    -jeff