Salesforce

Database crashes with bkio read errors weekly due to insufficient resources on Windows

Information

 
TitleDatabase crashes with bkio read errors weekly due to insufficient resources on Windows
URL NameP135466
Article Number000167587
EnvironmentProduct: Progress
Version: 9.x
Product: OpenEdge
Version 10.x, 11.1x
OS: Windows 2003
Question/Problem Description
PROBKUP fails with bkioread errors regarding insufficient disk space

Errors are being reported on the fixed extents for certain data areas in the database.
Database crashes weekly due to insufficient resources on Windows 2003
Clients as well as the backup process hit the same error messages.
Sometimes they terminate abnormally, sometimes they successfully complete after retrying.
java.net.SocketException can sometimes be seen in the adminserver log at about the same time as the bkioread errors
Steps to Reproduce
Clarifying Information
Event Viewer log sometimes states the following error message:

Windows cannot load the user's profile but has logged you on with the default profile for the system.
DETAIL - Insufficient system resources exist to complete the requested service.
For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp
Error Message<function>:Insufficient disk space during <system call>, fd <file descriptor>, len <bytes>, offset <bytes>, file <file-name>. (9450)

SYSTEM ERROR: <function>: Bad file descriptor was used during <system call>, fd <file descriptor>, len <bytes>, offset <bytes>, file <file-name>. (9446)

SYSTEM ERROR: read wrong dbkey at offset <offset> in file <file> found <dbkey>, expected <dbkey>, retrying. area <number> (9445)

Corrupt block detected when reading from database. (4229)

<func-name>: Error occurred in area <num>, block number: <num>, extent<name>: . (10560)

Writing block <num> to log file. Please save and send the log file to Progress Software Corp. for investigation. (10561)

SYSTEM ERROR: Wrong dbkey in block. Found <dbkey>, should be <dbkey2> in area <num>. (1124)
Begin ABNORMAL shutdown code (2249)

[AdminServer] * java.net.SocketException: No buffer space available (maximum connections reached?): create (8172)(java.rmi.ConnectIOException: Exception creating connection to: 10.105.55.22; nested exception is:
[12/1/13 1:34:57 PM] [0] [AdminServer] * java.net.SocketException: No buffer space available (maximum connections reached?): create)
Defect Number
Enhancement Number
Cause
There is not enough memory available to the Windows Paged Pool when the server is busy and demand for memory is high.
Resolution
Increase the Windows PagedPoolSize value to allow a larger paged pool. A setting of 0xFFFFFFFF will allocate the maximum Paged Pool in lieu of other resources to the computer.

Suggestions from Microsoft:

1. Use 0xFFFFFFFF for the PagedPoolSize which is the maximum.
And the default PoolUsageMaximum is 80, which means the Memory Manager trims pool memory at 80%.

2. Upgrading to a 64 bit platform will give a lot more pool resources (up to 128 GB instead of 256MB Non-paged and 470MB Paged Pool). The following Microsoft Article provides further information about pool resources on x64:

Server is unable to allocate memory from the system paged pool
http://support.microsoft.com/kb/312362  

Comparison of 32-bit and 64-bit memory architecture for 64-bit editions of Windows XP and Windows Server 2003
http://support.microsoft.com/default.aspx?scid=kb;EN-US;294418

3.  Examine PERFMON metrics

Perfmon will show that the system is critically low on Pool Paged memory with less than 20% available.
It may not yet have generated a 2020 / 2021 event but instead error 1450.  The following Microsoft Article provides further information about these events:
Extract: 317249 

Other components of the operating system may not work and may generate error messages that report a status code of 1450 in the data section of their event log message. That is, "Insufficient System Resources."

These events may be found in the System event log or in the Application event log. These messages may apply to the issue that is described in this article only if the underlying event was a connection to the server service. However, this fact is not easily determined. For example, there is Event ID 1055 that is generated by CLUSSVC. This event is from the cluster service that usually reports a failed connection to the server service.

Based on the symptoms I have seen in this case I would encourage the customer to move to x64. It is not possible for me now to see if it will delay the issue, however it is not very likely this will happen. Only with a memory dump we are able to get more grip on what is actually happening. Based on the Perfmon only, a x64 OS will relief the system. Fine tuning the current system is possible but may be a lengthy process.
Workaround
Notes
Keyword Phrase
Last Modified Date11/20/2020 7:34 AM

Powered by