10-Aug-2015

Memory in error
Because heat may contribute to the problems, I reorganized the installation.
I had the DS10 under the HSZ50 and three BA356 cabinets; these have their airflow directed downward so that may block hot air from the server to escape.
So I turned the units upside-down so they now blow upward. And I gave the DS10 a bit more room around.
Second, I set console to serial – I already have the HSZ50 hooked up to a VT420 terminal, the server could be hooked to the second session. That shows a bit more information on boot.

Did A memtest – it now worked where it, for some reason, refused because 0 wasn’t a right address… But now it did. I did a test over a large portion of memory and again, the system froze. Not even CTRL-P on the console, or the HALT button gave a response.

Restarted – and memtest failed again on the same message..
Restarted again – and memtest did run. tested to the max possible, no errors.
Did sho mem, several times – and the system froze again.
Restarted, sho mem gave no problems over a longer period – so I booted. All seemed Ok, until about 30 minutes, the system froze again.

There is definitely something wrong with this memory. My guess: one of the DIMMs breaks the system, the question is which one?
There isn’t any message – nowhere – that gives a hint of memory problems. and once VSM is running, there is no way to get this information. I could, of course, write a program that checks pages on the fly but that would cause havoc on performance. Plus that the kernel will intervene to prevent problems with system data and structures – what this program wants to detect…

In the end, I decided to remove the 2Gb memory and return to 512, for the time being. I could, of course, add 1 Gb at a time and see if that works. But only when I’m at the site…

08-Aug-2015

More on the crashes
Little time to do some more testing.
But whatever I tried: I couldn’t get Diana working again. Perhaps I had the bad DIMM in slot 0? But since we had a small tripped planned, I decided to take the server down for the weekend – and see what I could do on return. There are still two PWS systems available that could do the job, or part of it, if all else would fail.
Could it be something in the motherboard that caused the system to be unresponsive, in that case I would need to get a replacement…

07-Aug-2015

Crashes
All of a sudden, Diana was completely unresponsive a few days ago. Not even the HALT button (that would normally halt the CPU and return to console mode) didn’t work, nor did CTRL_P so I had to power off the system and restart it. All seemed well after that.
Until today.

When looking to the server stats this morning to see if there were more PHP errors, I found none AT ALL: The server log showed the webserver had been running for about 30 minutes. That means the system was completely restarted after a crash; not a power outage since that would stop the disks as well and the system would not restart (the boot process would not be able to access the disks since the HSZ50 isn’t ready yet).
So I checked HyperSpy, and that showed two of them. Since I was at work, investigations had to be postponed till after work.
When This could be done, I logged in remotely and found a number of crashes today:
$ dir operator*/dat

Directory SYS$SYSROOT:[SYSMGR]

$ dir/dat oper*

Directory SYS$SYSROOT:[SYSMGR]

OPERATOR.LOG;4369     7-AUG-2015 12:11:09.78
OPERATOR.LOG;4368     7-AUG-2015 07:25:40.54
OPERATOR.LOG;4367     7-AUG-2015 06:03:43.25
OPERATOR.LOG;4366     7-AUG-2015 02:11:16.71
OPERATOR.LOG;4365     7-AUG-2015 00:21:40.78
OPERATOR.LOG;4364     7-AUG-2015 00:00:00.83

Using ANA/ERR/ELV translate/since=today, I found that one was caused by INVTQEFMT in the NULL process, and it seems all others were caused by INVEXCEPTN on a webserver process – either the server itself, or a worker process, all where the current IPL was 8. Thw last crashe told me:

$ ana/crash

OpenVMS system dump analyzer
...analyzing an Alpha compressed selective memory dump...

%SDA-W-LMBMISSING, "S2 Space" LMB missing from dump
Dump taken on  7-AUG-2015 12:06:30.75 using version V8.4
INVEXCEPTN, Exception while above ASTDEL

%SDA-W-PROCNOTDUMPED, process WASD:80-123 not dumped; process private memory not accessible

Something to look into at home.
So I did, but when I logged in into my (still open) Motiv session, the system crashed – again.
This gave me a clue on the other crashes: it may have to do with the new memory.

So the next step was to take Diana down and test the state of memory:
>>> show mem

showed 2048 Mg of memory, no bad pages.

Let the system run for some time , and than it happened: The number of bad pages increased, got stable, decreased, stabilized, increased again – like a yoyo. This explained the odd behaviour: That the system ran fine for quite some time and suddenly found itself in bad memory – causing errors above ASTDEL (IPL=2).
But which one of the four was faulty – if any?
So I first installed the original 512 Mb in bank 0 and let it run. No problems.
Installed 2 of the new DIMMs in bank 0 and let it run for 30 minutes, no problem
Swapped one othe the two with another one. Same test, same result.
Swapped the other one, so no I had all four of them tested: No problem either.

This was odd.

So I installed all four of them, and booted. At least: Tried to, but now there wasn’t any response from the keyboard – and the sequence shown on the console didn’t show anything else.
Only that DKA0 (the local disk) was missing, but that I knew already.

Checked that as well, moved the disk from under the floppy drive to the bottom of the right-side cage (already holding CD and DBD drives), tested connectors and found that this issue seemed to be caused by a bad cable….

Too late to continue tonight.