15-Aug-2015

Retry more memory
Because it was foreseen that there would be no external activity for the weekend, I stopped the server and added 1 Gb of memory to the 512 that has been I stalled from the beginning, and tried to test the system.
Well, of course I could start the test:
>>>test
It tells runtime is 150 seconds, and that CTRL-C may stop testing once started
And that’s about it. Let it run for about 10 minutes, lost confidence:

  • NO output, whatsoever.
  • ^C doesn’t work
  • ^P doesn’t work
  • HALT button doesn’t work
  • No output may have to do with environment variable d_verbose being zero.
    So stopped the machine by the power-off button, and started it again using the same.

    >>> more el

    doesn’t say anything but the last startup-sequence so it is of no use at all.
    Hence the next step was
    >>>set d_verbose 1
    >>>test

    With the very same result.

    Restarted, let the system run for some time, then did
    >>>show mem
    and that gave me no bad pages, and 1.5 Gb of memory.
    No surprises 🙂
    But since the keyboard was completely unresponsive, there might have been a bad DIMM? So I exchanged them, and repeated the whole sequence, with exactly the same results.
    So since I would be around if thins would go wrong, I started VMS (knowing that the startup-sequence has now be tested several times and just worked 🙂 ) to see where this would bring me.

    11-Aug-2015

    Check
    A simple check this morning:
    * It all works. No crashes, no freezes since yesterday evening
    * One stack overflow in the PHP executor – so this definitively is caused by lack of internal memory

    The plan is to do some testing spread over a few days: Add 1 GB at a time to the 512 that will stay in place and see what happens. Let it run for a night and check. If all is well, I could assume these DIMMs are both Ok. Swap one of them with one of the other pair and test again; If still Ok, the one not yet tested should be the bad one.
    There is one weird thing though that I noticed yesterday evening: Outlook won’t send a message – a reply to an older message; It is kept in the Outbox and Outlook complains that I’m not connected when showing the activity. But all other internet related activity just works. I restarted SMTP but that didn’t help at all.

    10-Aug-2015

    Memory in error
    Because heat may contribute to the problems, I reorganized the installation.
    I had the DS10 under the HSZ50 and three BA356 cabinets; these have their airflow directed downward so that may block hot air from the server to escape.
    So I turned the units upside-down so they now blow upward. And I gave the DS10 a bit more room around.
    Second, I set console to serial – I already have the HSZ50 hooked up to a VT420 terminal, the server could be hooked to the second session. That shows a bit more information on boot.

    Did A memtest – it now worked where it, for some reason, refused because 0 wasn’t a right address… But now it did. I did a test over a large portion of memory and again, the system froze. Not even CTRL-P on the console, or the HALT button gave a response.

    Restarted – and memtest failed again on the same message..
    Restarted again – and memtest did run. tested to the max possible, no errors.
    Did sho mem, several times – and the system froze again.
    Restarted, sho mem gave no problems over a longer period – so I booted. All seemed Ok, until about 30 minutes, the system froze again.

    There is definitely something wrong with this memory. My guess: one of the DIMMs breaks the system, the question is which one?
    There isn’t any message – nowhere – that gives a hint of memory problems. and once VSM is running, there is no way to get this information. I could, of course, write a program that checks pages on the fly but that would cause havoc on performance. Plus that the kernel will intervene to prevent problems with system data and structures – what this program wants to detect…

    In the end, I decided to remove the 2Gb memory and return to 512, for the time being. I could, of course, add 1 Gb at a time and see if that works. But only when I’m at the site…

    08-Aug-2015

    More on the crashes
    Little time to do some more testing.
    But whatever I tried: I couldn’t get Diana working again. Perhaps I had the bad DIMM in slot 0? But since we had a small tripped planned, I decided to take the server down for the weekend – and see what I could do on return. There are still two PWS systems available that could do the job, or part of it, if all else would fail.
    Could it be something in the motherboard that caused the system to be unresponsive, in that case I would need to get a replacement…

    07-Aug-2015

    Crashes
    All of a sudden, Diana was completely unresponsive a few days ago. Not even the HALT button (that would normally halt the CPU and return to console mode) didn’t work, nor did CTRL_P so I had to power off the system and restart it. All seemed well after that.
    Until today.

    When looking to the server stats this morning to see if there were more PHP errors, I found none AT ALL: The server log showed the webserver had been running for about 30 minutes. That means the system was completely restarted after a crash; not a power outage since that would stop the disks as well and the system would not restart (the boot process would not be able to access the disks since the HSZ50 isn’t ready yet).
    So I checked HyperSpy, and that showed two of them. Since I was at work, investigations had to be postponed till after work.
    When This could be done, I logged in remotely and found a number of crashes today:
    $ dir operator*/dat

    Directory SYS$SYSROOT:[SYSMGR]

    $ dir/dat oper*

    Directory SYS$SYSROOT:[SYSMGR]

    OPERATOR.LOG;4369     7-AUG-2015 12:11:09.78
    OPERATOR.LOG;4368     7-AUG-2015 07:25:40.54
    OPERATOR.LOG;4367     7-AUG-2015 06:03:43.25
    OPERATOR.LOG;4366     7-AUG-2015 02:11:16.71
    OPERATOR.LOG;4365     7-AUG-2015 00:21:40.78
    OPERATOR.LOG;4364     7-AUG-2015 00:00:00.83

    Using ANA/ERR/ELV translate/since=today, I found that one was caused by INVTQEFMT in the NULL process, and it seems all others were caused by INVEXCEPTN on a webserver process – either the server itself, or a worker process, all where the current IPL was 8. Thw last crashe told me:

    $ ana/crash

    OpenVMS system dump analyzer
    ...analyzing an Alpha compressed selective memory dump...

    %SDA-W-LMBMISSING, "S2 Space" LMB missing from dump
    Dump taken on  7-AUG-2015 12:06:30.75 using version V8.4
    INVEXCEPTN, Exception while above ASTDEL

    %SDA-W-PROCNOTDUMPED, process WASD:80-123 not dumped; process private memory not accessible

    Something to look into at home.
    So I did, but when I logged in into my (still open) Motiv session, the system crashed – again.
    This gave me a clue on the other crashes: it may have to do with the new memory.

    So the next step was to take Diana down and test the state of memory:
    >>> show mem

    showed 2048 Mg of memory, no bad pages.

    Let the system run for some time , and than it happened: The number of bad pages increased, got stable, decreased, stabilized, increased again – like a yoyo. This explained the odd behaviour: That the system ran fine for quite some time and suddenly found itself in bad memory – causing errors above ASTDEL (IPL=2).
    But which one of the four was faulty – if any?
    So I first installed the original 512 Mb in bank 0 and let it run. No problems.
    Installed 2 of the new DIMMs in bank 0 and let it run for 30 minutes, no problem
    Swapped one othe the two with another one. Same test, same result.
    Swapped the other one, so no I had all four of them tested: No problem either.

    This was odd.

    So I installed all four of them, and booted. At least: Tried to, but now there wasn’t any response from the keyboard – and the sequence shown on the console didn’t show anything else.
    Only that DKA0 (the local disk) was missing, but that I knew already.

    Checked that as well, moved the disk from under the floppy drive to the bottom of the right-side cage (already holding CD and DBD drives), tested connectors and found that this issue seemed to be caused by a bad cable….

    Too late to continue tonight.