How to survive after the errors like 819 or 10566?
(819) SYSTEM ERROR: Error in undo of record deleteor(10566) SYSTEM ERROR: Undo failed to reproduce the record in area <area-num> with rowid <DBKEY> and return code <Return Code>.
They mean that undo part of recovery note is corrupted. The note will be successfully replicated to a target database or rolled forward on warm standby copy but then these databases will fail to start: the crash recovery will be terminated due to the same error. In other words we will lose all databases.
We can restore old backup and roll forward all AI files using endtime before the error. But it may not work as well because we need to find a time when a corrupted note was created rather than when Progress tried to undo the changes related to the note. Last one is a point of no return. As minimum we should known the time of transaction’s beginning. Unfortunately the error does not report even TRID.
Does anybody see any other solutions rather than to use the –F option?
Just for information - we already got error #819 for two customers.
Both customers are running 11.7 build 1497 SP02. One on Linux 64-bit, second on on AIX 64-bit.
Knowledgebase describes the defect PSC00293031 related to the error #819 but it was fixed in 11.3.3 and 11.4. In our cases the transactions did not use LOB objects.
In both our cases the corrupted recovery note was RL_RMCR – the one that describes the creation of new record. Transaction undo did replace the records by placeholders (recid locks) and exactly at this moment we got the error #819.
Progress could make a survival a bit easy and less painful for us if it would:1. Report TRID with the errors above;2. Use Transaction Ignore List at db startup that would sets the list of TRIDs whose recovery notes should be ignored during crash recovery. It’s better to be ill than dead. It’s better to leave uncommitted only a transaction that caused the above errors rather than to skip the crash recovery for all transactions that were active at the moment of db crush;3. Enhance rollforward scan by the “loud” verbose option to report the changes described by each recovery note, in other words – the option to fully decode the contents of recovery notes. It would help us to find all changes done by the transactions on Transaction Ignore List and to fix them manually.
If all databases (source, target and standby) are deathly infected by a corrupted recovery note, which db copy to open using the -F option?
Is a standby database the best choice? Source and taget databases were crashed. Some buffers modified by the commited transactions were not written on disk. The -F option will lose these changes. Standby database used to roll forward AI files did not crashed. All recent changes are written to disk. The -F option will not lose the changes done by the commited transactions. At least the theory seems to say this. But in theory there is no difference between theory and practice. In practice there is.
In the second incident we were lucky to have only one uncommited transaction (the one with a corrupted recovery note). The customer had opened the database using the -F option but database soon crashed again due to the new corruptions that are expected when someone uses the force access:
(14684) SYSTEM ERROR: Attempt to read block 0 which does not exist in area <index-area>, database <db>. (210)
(10831) SYSTEM ERROR: Record continuation not found, fragment recid <recid> area <data-area> 3.
(10833) SYSTEM ERROR: Invalid record with recid <recid> area <data-area> 1.
The uncommited transaction did not update those areas. And I almost sure these corruptions did not exist before the -F was used. Why we got these errors?
> On Feb 13, 2019, at 5:20 AM, George Potemkin wrote:
> The -F option will not lose the changes done by the commited transactions
yes, it will sometimes. consider the following scenario:
0) a transaction begins and a transaction begin note is spooled.
1) transaction creates a record. this could likely cause several block changes if a new block must be allocated, or just one if record fits in block at head of rm chain.
2) transaction creates an index entry. best case, one index block is changed, else a block split may be required.
2a) at this point, there are bi notes describing all those changes made by the transaction, probably still in memory.
3) transaction commits and a commit note is spooled.
4) lazy commit timer expires and all bi notes up to and including the commit note are flushed to disk.
5) system crashes. contents of bi and ai buffers and database buffers are lost.
6) you do a normal database restart, the redo phase will recreate the actions of any notes whose database actions did not make it to disk. what was in memory and not written to disk is recreated. the transaction will be ok. nothing lost.
alternate 6) you do a database start with -F. contents of bi log are discarded. there is no redo phase. memory contents are NOT recreated and whatever was in memory is lost forever. that could be the contents of any action performed in steps 0 through 3, including the entire transaction.
I meant that we will not lose the changes done by the commited transactions when we will use the -F option to open a standby database that was used to roll forwad AI files. It did not crahed when the corrupted note was applied. Rfutil was successful for last AI. All db changes done by rfutil were saved on disk. But it will be crashed if we will try to open database in normal mode that will try to undo a transaction with the corrupted recovery note.
george, you are correct.
I’m thinking about the following plan what to do if we’ll get the “error in undo” again. Any comments are welcomed.
0. Closely (for example, once per second) watch db logs. If the error happens then:
1. Freeze a watchdog process (kill -SIGSTOP). It will prevent watchdog from a death during undo of dead client’s transaction. Hence database will not crash immediately;
2. Optionally proquiet database. Any changes done from this point can be lost. We need a time to make a decision;
3. Get the full information about the transaction of the dead client - mainly transaction start time, the number of notes written and read for the current transaction;
4. Based on this information we can decide if we are going to switch to a warm standby database and to roll forward AI files to a point in time before beginning of the transaction or (if the transaction was opened long time ago) we can decide to continue with the current state of database even if we will be forced to use the -F option to open db.
5. If we choose to use the -F option then:
5.1 Disable a "quiet" point and disconnect all db sessions except, of course, the dead one;
5.2 Proquiet database again to write all dirty blocks on disk;
5.3 Shut database down (emergency shutdown?). Of course, the database will not be closed normally because the transaction of dead session can’t be undone due to the error;
5.4 Truncate bi -F. It’s expected that at this point of time we will lose only some changes done by the dead uncommitted transaction. The changes done by other transactions supposed to be saved on disk;
5.5 When db is up and running eliminate the changes made by dead transaction. To find out those changes we can use (with a bit of luck) AI scans.
Did I miss some points?
>>In other words we will lose all databases.
>>Just for information - we already got error #819 for two customers.
I'm shocked. How did you survive those two cases? Have you opened a сase in Progress Technical Support? What does Progress tell you about this?
Why is no one here responding to this message?
It's the database administrator nightmare. I wouldn't want to be in this situation.
> Have you opened a сase in Progress Technical Support?
What's the news? What does Progress say about this?
It's still under investigation.
Did they confirm it was a bug?
I'll share the conclusion when I'll get it from PTS.
Development team did a big job while investigating a root case of the errors.
First of all, two incidents that I mentioned above were caused by two absolutely different errors:SYSTEM ERROR: Error in undo of record delete (815)SYSTEM ERROR: Error in undo of record delete (819)
There is the error # 820 but our customers did not yet ;-) get it:SYSTEM ERROR: Error in undo of record delete (820)
The error 815 is fatal for database. To get an access to a database we need either to use the -F option or to roll forward AI files to a time before a corrupted note was created (it’s not the time when message # 815 was issued).
The error 819 is recoverable. Transaction undo performed by a client’s session failed but database crash recovery will be successful. In our case crash recovery took a long time (more than 5 hours) because the remote user did not really logout from database which resulted in a very large bi file (user’s transaction stayed open for a week). The error will be fixed in 12.1.
IMHO, a workaround for such errors: on standby database do not apply AI files that contain the notes for transactions that are not yet committed on source database.