Unexplained database crashes

Posted by James Palmer on 03-Jun-2016 06:29

Windows 2008 R2

Progress 10.2B08

VMWare environment

Our client has reported that they are finding databases crashing on them from time to time. It's different databases and different times. 

Audit DB: 

[2016/06/03@06:46:40.489+0100] P-6584 T-6588 I WDOG 22: (2527) Disconnecting dead user 33.
[2016/06/03@06:47:58.692+0100] P-13924 T-3848 I ABL 33: (452) Login by WH17 on CON:.
[2016/06/03@06:47:58.696+0100] P-13924 T-3848 I ABL 33: (12699) Database audit Options:
[2016/06/03@06:48:40.507+0100] P-6584 T-6588 I WDOG 22: (2527) Disconnecting dead user 35.
[2016/06/03@06:49:21.512+0100] P-12836 T-17296 I ABL 35: (452) Login by WH17 on CON:.
[2016/06/03@06:49:21.516+0100] P-12836 T-17296 I ABL 35: (12699) Database audit Options:
[2016/06/03@06:49:50.507+0100] P-6584 T-6588 I WDOG 22: (2256) SYSTEM ERROR: User 33 died during microtransaction.
[2016/06/03@06:49:50.507+0100] P-6584 T-6588 I WDOG 22: (2520) Stopped.

and from the admin server log

[03/06/16 06:49:50] [2] [Database] Error detected in config: l_idx40_audit.defaultconfiguration(15); attempting graceful shutdown. (8468) 
[03/06/16 06:49:51] [3] [Database] Abnormal communication failure for config: l_idx40_audit.defaultconfiguration (8490) 
[03/06/16 06:49:51] [2] [Database] In crash handler for config com.progress.juniper.admin.MProservJuniper@1a76eff. (8486) 
[03/06/16 06:49:51] [1] [Database] Configuration is exiting: l_idx40_audit.defaultconfiguration (7394) 

pws DB

[2016/06/02@17:18:50.192+0100] P-7064 T-6212 I WDOG 22: (2523) User 30 died with 1 buffers locked.
[2016/06/02@17:18:50.192+0100] P-7064 T-6212 I WDOG 22: (2520) Stopped.

from the admin server log

[02/06/16 17:18:50] [2] [Database] Error detected in config: l_idx40_pws.defaultconfiguration(16); attempting graceful shutdown. (8468) 
[02/06/16 17:18:51] [3] [Database] Abnormal communication failure for config: l_idx40_pws.defaultconfiguration (8490) 
[02/06/16 17:18:51] [2] [Database] In crash handler for config com.progress.juniper.admin.MProservJuniper@169360f. (8486) 
[02/06/16 17:18:51] [1] [Database] Configuration is exiting: l_idx40_pws.defaultconfiguration (7394) 

I've found http://knowledgebase.progress.com/articles/Article/P112852?q=8490&l=en_US&fs=Search&pn=1 but the DataCentre that host the server for them can find no evidence of maintenance around this time and nothing in the VMWare logfiles that's pertinent. 

Has anyone seen this before? Any pointers for what the cause might be?

All Replies

Posted by George Potemkin on 03-Jun-2016 06:51

Users died while updating the databases. If after-image is enabled then you can check what users did just before the crashes.

Posted by James Palmer on 03-Jun-2016 06:59

No after imaging available. Yes I know.

Posted by ChUIMonster on 03-Jun-2016 07:13

It is a Windows "server"...

The problem is almost certainly that Windows is killing a task.  Possibly just another window or something put in the background but also very possibly a scheduled task being run with the same id as someone who logs in occasionally.  

When they logout (or get disconnected) Windows "helpfully" kills all of the other tasks running under the same userid.

Posted by James Palmer on 03-Jun-2016 07:23

The tricky thing is that the user number in question on each of the failures is not one that's ever shown as logging in to the database, so I can't see who it is. This leads me to believe it's an internal process rather than an actual user. So user 30 on pws is currently an APW process.

Posted by George Potemkin on 03-Jun-2016 07:24

> Windows "helpfully" kills all of the other tasks running under the same userid.

How to explain that in all (two) cases it's the same message in the admin server log?

Posted by ChUIMonster on 03-Jun-2016 07:27

So far as i can tell that could just be the admin servers way of saying "darn!  the db disconnected a session".

Posted by ChUIMonster on 03-Jun-2016 07:29

The db log file should tell you who user 30 *was* prior to wdog cleaning up after it.

Posted by James Palmer on 03-Jun-2016 07:47

I know but I can't find reference to the users. Log files:

audit.lg

pws.lg

Posted by ChUIMonster on 03-Jun-2016 07:59

Are you saying that the db .lg was truncated and that there was no previous reference to user 33 and 30 respectively?

Or are you saying that it is a complete .lg and that the very first reference to these users is  WDOG disconnecting them?

And what's up with logins on the same IDs usr# *after* WDOG disconnecting them?  Is WH17 something special?

Posted by Richard Shulman on 03-Jun-2016 08:02

Hello James,

A quick way to search for the dead user.

Identify the wdog message.

Example:

[2016/06/03@09:24:38.070+0100] P-8156       T-6724  I WDOG   25: (2527)  Disconnecting dead user 33.

Then search up from there for the user number followed by a colon.

Example:

[2016/06/03@08:08:26.572+0100] P-10144      T-10412 I ABL    33: (452)   Login by WH5 on CON:.

[2016/06/03@08:08:26.579+0100] P-10144      T-10412 I ABL    33: (12699) Database pws Options:  

Though this doesn't answer why it died it gives you something to search for.

You now know that WH5 which is a shared memory connection (confirmed two ways one WDOG caught it and two it shows up as ABL in this listing as opposed to SRV).

You could search to see if a batch job is scheduled to connect at this time or if that is a true user you could ask them what they can remember from that login session (if you are close enough in time for them to recall).

Look for protraces that are between the login and the wdog detection as dead in case it died but was able to leave a protrace. (I admit since it died in a fashion that didn't give it time to say it generated a core in the log the chances are slim.)

Hope this helps.

Posted by George Potemkin on 03-Jun-2016 08:02

audit.lg was used 2645927.644 sec = 30 day(s) 14:58:47.644

It contains 2,127 messages: "Disconnecting dead user"

In average it's one dead user per 20 min.

Posted by George Potemkin on 03-Jun-2016 08:16

[2016/06/01@15:56:16.924+0100] P-7972       T-11908 I ABL    33: (453)   Logout by WH21 on CON:.

It was the last logout of user 33.

Followed by 15 (!!!) logins of user 33.

[2016/06/01@15:56:38.785+0100] P-15724      T-16740 I ABL    33: (452)   Login by WH17 on CON:.

...

[2016/06/03@06:47:58.692+0100] P-13924      T-3848  I ABL    33: (452)   Login by WH17 on CON:.

and by 14 messages: "Disconnecting dead user 33."

and one message "User 33 died during microtransaction"

Posted by James Palmer on 03-Jun-2016 08:58

[mention:c5a1bc9bc3aa4bdb801cb182cec075cd:e9ed411860ed4f2ba0265705b8793d05] No idea why I didn't spot that. Slow day. Very helpful thanks.

Posted by George Potemkin on 03-Jun-2016 09:03

James, can you enable debugging (or clientlog) for a particular user? 15 sequential deaths is not a coincidence.

Posted by James Palmer on 03-Jun-2016 09:12

I'm not sure George. New to the application. I need to establish who these WH users are with the client. Most sites the users have names relating to their windows logon, but these are not.

This thread is closed