Keep database online (“Five 9s”) after critical errors like 1124 or 819
In case of some critical errors Progress can marks the blocks as inaccessible instead of crashing a database. The list of the inaccessible blocks can be stored in the special database blocks and contains the following information: area, dbkey and state where the inaccessibility state can have two values: fully inaccessible and read-only access. The inaccessibility flag can be verified only when the blocks are retrieved from disk - this will minimize the impact on performance.
Case of the 1124 (“Wrong dbkey in block”):Session can mark the correspondent block as fully inaccessible, remove (empty) it from database buffer pool, release a buffer lock and issue the error 1124 that is not fatal now. Database will keep running. Any sessions that will try to retrieve the block from disk should get the message like: “block is marked as inaccessible”. If DBA will be able to fix the issue with block (for example, by emptying filesystem cache) then an inaccessibility flag of the corresponding block can be removed.
Case of the 819 (“Error in undo of record delete”):Session can mark a block as “read-only access”, remove (empty) it from database buffer pool, release a buffer lock and continue to process the rest of undo notes. The correspondent transaction will be undone except the block related to the corrupted recovery note. Access to the block will be allowed only for read-only sessions (for the ones that will not try to update the blocks in an unknown state). We can use such session to dump the contents of the block. Then DBA can format the block as empty (by online and re-written version of dbrpr/8. Reformat Block to a Free Block). Inaccessibility flag will be removed and we can load the dumped records (if there are any) back to a database. In worse case we will lose one block of data but not the whole database.
Case of the errors like 815 (“Error in undo of record delete”) where an error means the corruption in shared memory:Database will be crashed and its restart will eliminate the error. In such cases the blocks should not be marked as inaccessible.
Thanks for the specific error examples- very useful! We are working on a number of initiatives to support improved uptime as part of our Continuous Operations strategy, and will be sure to include improved error handling as part of our considerations.