07-Aug-2015

Crashes
All of a sudden, Diana was completely unresponsive a few days ago. Not even the HALT button (that would normally halt the CPU and return to console mode) didn’t work, nor did CTRL_P so I had to power off the system and restart it. All seemed well after that.
Until today.

When looking to the server stats this morning to see if there were more PHP errors, I found none AT ALL: The server log showed the webserver had been running for about 30 minutes. That means the system was completely restarted after a crash; not a power outage since that would stop the disks as well and the system would not restart (the boot process would not be able to access the disks since the HSZ50 isn’t ready yet).
So I checked HyperSpy, and that showed two of them. Since I was at work, investigations had to be postponed till after work.
When This could be done, I logged in remotely and found a number of crashes today:
$ dir operator*/dat

Directory SYS$SYSROOT:[SYSMGR]

$ dir/dat oper*

Directory SYS$SYSROOT:[SYSMGR]

OPERATOR.LOG;4369     7-AUG-2015 12:11:09.78
OPERATOR.LOG;4368     7-AUG-2015 07:25:40.54
OPERATOR.LOG;4367     7-AUG-2015 06:03:43.25
OPERATOR.LOG;4366     7-AUG-2015 02:11:16.71
OPERATOR.LOG;4365     7-AUG-2015 00:21:40.78
OPERATOR.LOG;4364     7-AUG-2015 00:00:00.83

Using ANA/ERR/ELV translate/since=today, I found that one was caused by INVTQEFMT in the NULL process, and it seems all others were caused by INVEXCEPTN on a webserver process – either the server itself, or a worker process, all where the current IPL was 8. Thw last crashe told me:

$ ana/crash

OpenVMS system dump analyzer
...analyzing an Alpha compressed selective memory dump...

%SDA-W-LMBMISSING, "S2 Space" LMB missing from dump
Dump taken on  7-AUG-2015 12:06:30.75 using version V8.4
INVEXCEPTN, Exception while above ASTDEL

%SDA-W-PROCNOTDUMPED, process WASD:80-123 not dumped; process private memory not accessible

Something to look into at home.
So I did, but when I logged in into my (still open) Motiv session, the system crashed – again.
This gave me a clue on the other crashes: it may have to do with the new memory.

So the next step was to take Diana down and test the state of memory:
>>> show mem

showed 2048 Mg of memory, no bad pages.

Let the system run for some time , and than it happened: The number of bad pages increased, got stable, decreased, stabilized, increased again – like a yoyo. This explained the odd behaviour: That the system ran fine for quite some time and suddenly found itself in bad memory – causing errors above ASTDEL (IPL=2).
But which one of the four was faulty – if any?
So I first installed the original 512 Mb in bank 0 and let it run. No problems.
Installed 2 of the new DIMMs in bank 0 and let it run for 30 minutes, no problem
Swapped one othe the two with another one. Same test, same result.
Swapped the other one, so no I had all four of them tested: No problem either.

This was odd.

So I installed all four of them, and booted. At least: Tried to, but now there wasn’t any response from the keyboard – and the sequence shown on the console didn’t show anything else.
Only that DKA0 (the local disk) was missing, but that I knew already.

Checked that as well, moved the disk from under the floppy drive to the bottom of the right-side cage (already holding CD and DBD drives), tested connectors and found that this issue seemed to be caused by a bad cable….

Too late to continue tonight.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.