Hello,
I am looking for some advise on setting the -pica parameter for a group of databases with replication enabled.
From what I've read, all databases must have the same -pica value. Articles suggest the calculation is
-pica = (TAIW / BlockCount) * 1.25
So should I take the Total AI writes from the most used database?
Is this the total AI writes within 1 hour?
Also, it says from the AI writes from the most busy period but I don't know when this is so not sure how to best estimate. I did an average of this in order to calculate the ai extents required so could this be used?
Any advice would be greatly appreciated.
Where did you read that -pica should be the same on all databases? Because that's news to me, and something I fall foul of.
Which version of Progress are we talking by the way?
This advice should apply to 99% of OE environments:
If you are on any version before 10.2B08, use the maximum allowable pica value in your version. In 10.2B08 the max was raised from 8192 to 1,000,000 (I think they were sick of raising it every couple of years). In that case, use 32,768.
Larger pica values are required if a) the total AI writes is VERY high and/or b) the bandwidth to the target is relatively slow.
You cannot base your pica calculation only on AI writes unless you have a very fast (read local) network. The minute you're going over a WAN, bandwidth varies wildly. Last week one of our customer's replication was falling behind every day and almost catching up every night. We were tracking it in ProTop, watching AI files fill and get locked and fill and get locked. Replication status was stuck in "synchronizing" mode and never quite got back to "Normal Processing". The network admins swore nothing was wrong and only 1% of the WAN bandwidth was being used.
It must be a Progress bug!! Not.
Copying a 1 MB file from source server to target server took minutes. Copying it back from target to source server took one second. Yeah, must be a Progress issue.
The lesson here is to monitor pica usage and to stop the replication server process before the pica buffers fill, otherwise forward processing for ALL users will be stalled and your phone will start ringing off the hook. Monitor replication server AND agent status (both sides) as there are bugs in earlier versions whereby the source will report "Normal Processing" and the target will report "Listening". ProTop does all of this and more.
Regards,
Paul
To complete Paul's story...
... after pestering the network vendor they had an "aha moment" and suddenly found something to change. Normal bandwidth was restored and the replication caught up in a few hours and returned to its normal state of being always up to date.
When you say that "watching AI files fill and get locked and fill and get locked", is it then possible to fill up all available AI extents on the source DB and if so what happens if that is the case? I would assume that the source db would then generate an error and pause, but does it give an AI write error or a replication sync error? Thanks,
If AI files are fixed size, once all of them fill your database will either crash or stall, depending on the use or not of the -aistall DB startup parameter.
If AI files are variable and are rotated every x minutes, (n-1) AI files will fill (status: locked) and the nth AI file (the busy one) will continue to grow until you hit some external limit: 2 GB file size if large files are not enabled, or you fill the file system. When this happens, again the DB will crash or stall depending on the use of -aistall.
As this is happening, the replication server and agent processes would be generating error messages in the database log files.
To summarize, you should be monitoring pica usage, AI file status (free, full, locked), replication server and agent status and the database log file. You can write your own scripts to do this or you can use ProTop to generate alerts when configured threshold(s) are reached. For example, you could configure ProTop to send an email when 2 AI files are locked and to send an SMS when 4 AI files are locked. Please note that alert generation is a paid add-on to ProTop Free.
Regards,
Thanks. One of the databases will have a high volume of ai writes so I think I should start with this database to get the most accurate value for -pica.
I'm not sure how to check the network speed. Just copy a file and see the speed it says?
I should have said - I am using version 11.6.
Your network admin can tell you the theoretical bandwidth and then you can copy some largish files (hundreds of MB) to estimate real bandwidth.
Another thing that I like to do is rotate AI files every 15 minutes or less and chart the file sizes. If the peaks are high relative to my network bandwidth, I might ask the network admin to spit out a chart of bandwidth utilization between the main datacentre and the DR site. You may find that your peak AI writes correspond to peak network usage, requiring some QoS configuration.
Paul
Is there a way to quantitatively monitor -pica usage with either promon or VST or anything else?
Other than it is Good ("Normal Processing".) or Bad ("synchronizing" )
All the rest that you write above sounds pretty straight forward:
Count how many AI are Locked, check the status of server and agent being 1101 and 3101. We can even rotate AI files faster ( 15 min ) or slower ( 60 min or even 2 hr ) based on how many AI are locked out of total number that are available.
If server status not 1101 dsrutil -C restart server
If agent status not 3101 dsrutil -C restart agent
promon - r&d - 1 - 16 I think
Dmitri,
I think you're asking for promon - R&D - 1 - 16. That's the only way that I know to monitor pica usage and that's how ProTop does it. There is nothing in the VSTs.
FYI I use dsrutil db -C status -detail and check for 6021 or 3049.
FYI #2 (not for Dmitri but for other readers) dsrutil db -C restart agent is only available in 11.6+. Prior to that, you had to bounce the target DB to restart the agent.
You can get R&D 1|16 info from the _DbServiceManager VST in 11.6+. Example:
find _dbservicemanager no-lock. display _dbservicemanager with 1 column. ID : 1 Size of Pica : 8193.08 Available Pica Message Entries : 74908 Free Pica Message Entries : 74908 Queued Pica Message Entries : 0 Used Message High Water Mark : 15288 Number of Times Pica Filled : 0 User Number of Last Pica Latch Holder: -1 Number of Times Locks Taken Out : 799392373 Number of Lock Collisions : 2021 PID of Lock Holder : 0 User Type of Lock Holder : Not available User Number of Lock Holder : 0
(I fudged the column labels for the sake of clarity.)
More info on the new and updated VSTs, and other OE Repl enhancements:
http://pugchallenge.org/downloads2017/372_Whats%20New%20in%20Replication.pptx
Thank you Libor, Paul and Rob
Communication Area Size : 16384.00 KB
Total Message Entries : 149796
Free Message Entries : 149796
Used Message Entries : 0
Used HighWater Mark : 216
Area Filled Count : 0
Service Latch Holder : -1
Access Count : 589755
Access Collisions : 114
Now I know I set -pica too high at 16384.
Used HighWater Mark shows that.
So the goal is that HWM is somehow close to Total Message Entries, like half or more, right?
Area Filled Count should be small (single digit) or zero?
Access Collisions ???
I see you have also Access Collisions is there somebody who can tell something more about this parameter.
I think there is no problem if it not exceeds a certain proportion to the Access Count but I'm curious.
Communication Area Size : 65.00 KB
Total Message Entries : 595
Free Message Entries : 595
Used Message Entries : 0
Used HighWater Mark : 495
Area Filled Count : 0
Service Latch Holder : -1
Access Count : 213971543
Access Collisions : 65747
My bet: access count and access collisions are similar to the locks and the naps for the latches. Service latch is an administrative (a.k.a. “private”) latch. The administrative latches are quite different from the spin's latches but any latches are born to resolve the collisions.
Example of pica filled:
Communication Area Size: 20500.00 KB
Total Message Entries: 187428
Time Free Msgs Used Msgs Used HWM Filled Cnt Access Cnt Collisions Locked by Replication Server 04:00:28 187428 0 39351 0 146653272 24813 0 04:30:29 176446 10982 39351 0 146787113 24823 10982 05:00:30 129880 57548 57554 0 146903954 24837 57548 05:30:31 82080 105348 105349 0 147012984 24864 105348 06:00:32 40849 146579 146579 0 147115650 24895 146579 06:30:33 1727 185701 185736 0 147259217 24934 185701 07:00:34 90 187338 187428 2757 147399645 24985 187338 07:30:35 9 187419 187428 6039 147546696 25036 187419 08:00:36 6094 181334 187428 6826 147699472 25088 181334 08:30:37 148 187280 187428 8676 147835357 25132 187280 09:00:38 282 187146 187428 11422 147990051 25175 187146 09:30:39 11 187417 187428 13920 148140550 25230 187417 10:00:40 8 187420 187428 16565 148301390 25271 187420 10:30:40 0 187428 187428 19071 148474473 25320 187428 11:00:41 6 187422 187428 20430 148667535 25351 187422 11:30:42 0 187428 187428 21451 148860809 25367 187428 12:00:43 0 187428 187428 22290 149052507 25387 187428 12:30:44 0 187428 187428 23280 149244708 25404 187428 13:00:45 6 187422 187428 24393 149439472 25427 187422 13:30:46 0 187428 187428 25901 149634877 25463 187428 14:00:47 0 187428 187428 27601 149829466 25516 187428 14:30:48 0 187428 187428 30126 150026275 25562 187428
Access Collisions are almost the same all the time.
Used Message Entries = Locked by OpenEdge Replication Server
Activity: AI Log
Activity: AI Log Time Total Wrt AIW Wrt Notes Wrt Bytes Wrt BusyWaits Partial writes 04:00:28 6410836 6134415 860955K 98412861K 3506 248953 04:30:29 30234 29677 4079689 476171K 17 429 05:00:30 76472 75515 10441933 1211905K 40 616 05:30:31 73177 72211 10123924 1159241K 45 614 06:00:32 66706 65792 9253486 1055882K 45 614 06:30:33 86124 85119 11848009 1365994K 54 622 07:00:34 50341 49417 6874101 794400K 35 615 07:30:35 47549 46614 6566173 749848K 37 612 08:00:36 64745 63774 8997106 1024461K 46 620 08:30:37 44369 43458 6426670 698868K 43 621 09:00:38 29663 28685 4459687 464162K 41 609 09:30:39 41087 40112 6251794 646457K 56 624 10:00:40 28148 27162 4220942 439805K 91 617 10:30:40 26172 25201 4019687 408615K 99 595 11:00:41 13070 12416 2005665 201711K 173 443 11:30:42 10044 9531 1543874 154703K 112 359 12:00:43 8280 7677 1261205 126705K 68 347 12:30:44 9675 9166 1479821 148782K 149 362 13:00:45 11091 10619 1723126 171474K 185 359 13:30:46 15510 14665 2414339 240059K 287 488 14:00:47 17176 16211 2681374 266200K 226 514 14:30:48 24562 23546 3876160 383844K 154 538
And, of course, the filled pica queue cause the slowness on the source db (Total AI Writes are decreased by 10 times).
> Access collisions is the number of times that a process tried to add to the queue or remove from the queue (repl server) and found that the queue was locked.
What is a typical duration of a queue lock and what are the factors that can affect it?
Thanks, Jeff.
The pica queue lock is the same as service latch lock, isn't it?
> So the lock is held for a very, very, short time in duration.
Is it less than a microsecond as in case with the "spin" latches? Sorry I don't know a correct term to name the classical latches (like MTX, USR etc) in opposite to the administrative latches (like the pica service latch).
Access Count: 150,000,000 per 30 min in my example. If access count = lock count then we can roughly estimate the total duration of the queue lock as being less than 150 sec during 30 min interval or 8% (or less) of all time the queue is locked. Is it correct?
Access Collisions to Access Count ratio is only 0.017% in my example and 0.019% in Dmitri's example. Which processes and how activily are competing for the queue lock? Why the collisions are so rare?
UPD: The administrative latches seem to be used only for situations with low collisions (like BIADD latch used by prostrct add online utility). I guess they don't even use the spin lock mechanism. So the low access collisions in Service Manager seem to be expected.
I missed that Status: Database Service Manager reports the counters since db startup. So I slightly "re-worded" the same statistics I posted above.
Status: Database Service Manager
Time Free Msgs Used Msgs Locked by RPLS Filled Access Collisions AI Blocks Percent Access/Block Notes/Block 04:00:28 187428 0 0 0 146,653,272 24813 6,161,883 96% 23.80 140 04:30:29 176446 10982 10982 0 133,841 10 29,805 99% 4.49 137 05:00:30 129880 57548 57548 0 116,841 14 75,856 99% 1.54 138 05:30:31 82080 105348 105348 0 109,030 27 72,563 99% 1.50 140 06:00:32 40849 146579 146579 0 102,666 31 66,092 99% 1.55 140 06:30:33 1727 185701 185701 0 143,567 39 85,502 99% 1.68 139 07:00:34 90 187338 187338 2757 140,428 51 49,726 99% 2.82 138 07:30:35 9 187419 187419 3282 147,051 51 46,937 99% 3.13 140 08:00:36 6094 181334 181334 787 152,776 52 64,125 99% 2.38 140 08:30:37 148 187280 187280 1850 135,885 44 43,748 99% 3.11 147 09:00:38 282 187146 187146 2746 154,694 43 29,054 98% 5.32 153 09:30:39 11 187417 187417 2498 150,499 55 40,463 98% 3.72 155 10:00:40 8 187420 187420 2645 160,840 41 27,531 98% 5.84 153 10:30:40 0 187428 187428 2506 173,083 49 25,577 98% 6.77 157 11:00:41 6 187422 187422 1359 193,062 31 12,627 97% 15.29 159 11:30:42 0 187428 187428 1021 193,274 16 9,685 96% 19.96 159 12:00:43 0 187428 187428 839 191,698 20 7,933 96% 24.16 159 12:30:44 0 187428 187428 990 192,201 17 9,313 96% 20.64 159 13:00:45 6 187422 187422 1113 194,764 23 10,732 97% 18.15 161 13:30:46 0 187428 187428 1508 195,405 36 15,022 97% 13.01 161 14:00:47 0 187428 187428 1700 194,589 53 16,662 97% 11.68 161 14:30:48 0 187428 187428 2525 196,809 46 24,024 98% 8.19 161 15:00:49 8 187420 187420 2627 191,212 59 26,520 98% 7.21 160 15:30:51 2244 185184 185184 2285 175,022 70 28,485 98% 6.14 159 16:00:52 24712 162716 162716 1165 172,104 87 42,393 99% 4.06 161 16:30:53 103103 84325 84325 0 143,864 55 26,906 98% 5.35 152 17:00:54 123870 63558 63558 0 68,242 14 17,954 97% 3.80 145 17:30:55 141597 45831 45831 0 67,828 9 19,297 97% 3.51 146 18:00:56 187428 0 0 0 178,082 40 16,868 97% 10.56 149 18:30:57 187428 0 0 0 257,580 44 7,775 94% 33.13 143 19:00:58 187428 0 0 0 345,113 96 25,769 98% 13.39 149 19:30:59 187428 0 0 0 355,315 77 22,158 98% 16.04 137 20:00:59 187428 0 0 0 281,343 69 18,364 98% 15.32 137 20:31:00 187428 0 0 0 81,337 10 902 89% 90.17 142 21:01:01 187428 0 0 0 60,178 5 190 71% 316.73 138 21:31:02 187428 0 0 0 59,145 2 247 79% 239.45 140 22:01:03 187428 0 0 0 55,854 4 130 71% 429.65 132 22:31:04 187428 0 0 0 53,761 0 28 25% 1,920.04 125 23:01:05 187428 0 0 0 57,486 1 161 65% 357.06 137 23:31:06 187428 0 0 0 48,780 1 15 42% 3,252.00 132 00:01:07 187428 0 0 0 51,271 3 658 96% 77.92 147 00:31:08 187428 0 0 0 50,421 3 1,696 98% 29.73 143 01:01:09 187428 0 0 0 78,023 3 442 71% 176.52 140 01:31:10 187428 0 0 0 172,987 4 2,707 83% 63.90 143 02:01:11 187428 0 0 0 80,442 1 281 51% 286.27 142 02:31:12 187428 0 0 0 85,199 1 593 68% 143.67 134 03:01:13 187428 0 0 0 185,981 23 9,860 97% 18.86 132
where:
First interval - 410 hrs 14 min (db uptime).
The rest intervals - 30 min.
"Filled", "Access", "Collisions" are the increments per interval of the corresponding counters in Status: Database Service Manager ("Area Filled Count", "Access Count", "Access Collisions").
"AI Blocks" - the number of AI blocks generated per interval = "Total AI Writes" minus "Partial Writes" (from Activity: AI Log)
"Percent" = "AI Blocks" / "Total AI writes" * 100% = an indirect indicator of idle db.
"Notes/Block" = "Records written" (from Activity: AI Log) / "AI Blocks". AI blocksize = 16K. Average size of recovery notes is 112 bytes in the given example.
Conclusions/Remarks:
An unknown process(es) on source db is creating the collisions trying to access the pica queue but all 47 checks by promon show that the message entries locked only by Replication Server. The collisions are extremely low even during the most busy time: 96 collisions per 30 minutes (or one collision in 20 sec). In other words, almost 100% of the pica queue accesses are generated by Replication Server. I guess the ABL sessions do not access the pica queue. In my example there were about 2,500 sessions. It's unlikely that Replication Server would be able to dominate over their access requests. So the access collisions can be created only by the special processes that are working with AI files. UPD: Documentation says: "AIW and REPL attempting simultaneous access".
Replication Server generates at least one access per incoming AI block (e.g. 102,666 accesses per 66,092 AI blocks). When transaction activity is low the number of pica queue accesses per AI block can be huge. In other words, Replication Server accesses the pica queue at least 27 times per sec (e.g. 48,780 accesses per 30 min with only 15 incoming AI blocks). For comparison: at busy time db was able to generate 85,502 AI blocks per 30 min or 48 AI blocks per sec.
Max value of "Area Filled Count" per 30 min interval was 3,282 or almost twice per sec.
> Now I know I set -pica too high at 16384.
> Used HighWater Mark shows that.
16 MB of RAM per db is not a big price for insurance. ;-)
> So the goal is that HWM is somehow close to Total Message Entries, like half or more, right?
Unfortunately HWM is almost useless without date/time when the message entries had rised to the specified value. It could take place a month ago due to some particular conditions at this time (e.g. abnormal peak of transaction activity, a network issue between source and target dbs etc). The rest of time (untill db restart) you will see a "reminder" about past bad day that does not matter for the current situation.
.
> Area Filled Count should be small (single digit) or zero?
Only zero! Other values would mean that performance on source db was considerably degradated due to a short pica queue at least for some period of time.
> Access Collisions ???
It's a field that can be totally ignored.
Earlier mentioned knowledgebase article "Clarification of the -pica replication enabled database startup parameter"
knowledgebase.progress.com/.../P104638
says: "There is a PICA queue on both the source and target databases"
Should the -pica parameters be the same on source and target databases or should we tune them individually? These databases are running on a different hardware and under a different load.
I guess that in the customer's case I posted above the performance degradation on source db was caused by slowness of replication agent on target db. Theory: replication server was spending a lot of time waiting for the acknowledgments from replication agent that in turn resulted in the filled pica queue on source db => source's APW was waiting for pica access => the updates on source db was very slow. Is it a reasonable explanation?
Will we also get the negative impact on source database if the pica queue is filled on target database for any reasons? Will it cause the filled pica queue also on source db as well?
Comparing the pica queues on source and target dbs:
Source:
09/27/17 Status: Database Service Manager 16:30:53 Communication Area Size : 20500.00 KB Total Message Entries : 187428 Free Message Entries : 103103 Used Message Entries : 84325 Used HighWater Mark : 187428 Area Filled Count : 36203 Service Latch Holder : -1 Access Count : 150708477 Access Collisions : 25833 Registered Database Service Objects Name Rdy Status Messages Locked by OpenEdge Replication Server Y RUN 84325
Target:
09/27/17 Status: Database Service Manager 16:30:49 Communication Area Size : 20500.00 KB Total Message Entries : 187428 Free Message Entries : 187428 Used Message Entries : 0 Used HighWater Mark : 0 Area Filled Count : 0 Service Latch Holder : -1 Access Count : 24717946 Access Collisions : 141 Registered Database Service Objects Name Rdy Status Messages Locked by OpenEdge Replication Agent Y RUN 0
Access count on source db is 6 times larger than the one on target.
Large Area Filled Count on source db and zero HWM on target.
BIW is running on the target. AI on the target is not enabled.
The customer's case: a rather high transaction activity (~ 0.7 MB/sec of AI/BI writes) caused the persistent increase of the used message entries in the pica queue on source db. When the pica queue was totally filled we got 100% busy MTX and AIB latches. It means that almost every snapshot in promon shown an owner of these latches. MTX latch was owned by ABL users (each time a different one). AIB latch was always owned by AIW process. Just for a record: the competition for AIB latch between ABL users are not possible because they always lock AIB (and BIB) latch under cover of MTX latch: they lock MTX latch first and only then they're requesting AIB/BIB lock. I guess there are only a few special processes that can compete for AIB latch directly but mainly it's AIW.
The number of the AIB latch locks was rather low for a busy latch: 700-2000 locks per sec. Under normal conditions the lock duration of AIB latch is less than 100 nanoseconds. Hence to make the latch to be 100% busy we need 10 millions locks per sec - it's impossible to get such high numbers of AIB locks in a real database.
> The duration of a lock of the pica queue is less than a millisecond.
It's a huge time compared to the duration of the "spin" latches like AIB. And it also means that we can create only about a thousand locks per sec (that matches the numbers above).
New explanation of the customer's case:
I guess the AIW process is requesting the pica queue lock while it holds AIB latch. When AIW is forced to wait for the pica queue lock then the duration of AIB latch is increasing by 10 thousands times - from 100 ns to 1 ms. ABL process that already owns the MTX latch is forced to wait for AIB latch. That is why the MTX latch became 100% busy at the same time with the AIB latch. The number of MTX/AIB/BIB latch locks are closed to each other - the difference between MTX and AIB latch locks used to be plus/minus 1%, the difference between MTX and BIB latches used to be plus/minus 10% (or more during the idle time).
If this my "theory" is correct then the only solution is to have a large pica queue to avoid its total filling during the periods with highest transaction activity.
Thanks again, Jeff.
> The duration of a pica queue lock is less than a microsecond. The mechanism is the same as with “spin” latches.
How long AIW will wait before a second try if it fails to get the pica queue lock? 10 ms? Does it use the -nap value?
>The customer's case: a rather high transaction activity (~ 0.7 MB/sec of AI/BI writes) caused the persistent increase
>of the used message entries in the pica queue on source db.
What was the -pica value? And if it was increased, what is it now?
We have pretty similar transaction activity.
>BIW is running on the target. AI on the target is not enabled.
Why would one enable AI on Target DB and is it even possible?
>Why would one enable AI on Target DB and is it even possible?
Yes it is possible. Someone use AI on the Target database to archive AI extents from Target not from Source, but for me this decision is doubtful, since there is a possible a time lag between the Source and Target.
However, if you want to use a Replication Set in OE 11.7 you must activate after-imaging on your target databases for this.