18-Aug-2015

Stable – up to now
It looks fine up to nowL the system has been up and running continuously for 48 hours, without a glitch. The ususal peak on Monday just after midnight (processing last week’s logfiles) didn’t trouble it, nor does the occasional higher dem,and on resources – it seems that the last two days, the system has been quite busy between 14:30 and 16:30, acccording paaks in HyperSpy output:

CPU and memory (both physical and virtual)
CPU and memory (both physical and virtual)

Paging and buffered IO
Paging and buffered IO

Direct IO (disk) and network
Direct IO (disk) and network

The peaks also show up in the WASD graph over the last 72 hours (starting when the system was booted – the part before holds no data, of course):

WASD Graph
WASD Graph

Next is to contact the supplier for a replacement DIMM, for the one that seems to be bad.

16-Aug-2015

Hung – again
Two things were wrong this morning: To begin, I couldn’t connect to Wifi. It was switched off – my mistake yesterday: attempting to re-connect a Wifi-connected printer using WPS, I switched it off, apparently. Once that was settled, I could authenticate but it was impossible to get an IP address.
Which in my case means that the server was – again – not responding. Not crashed, since that would cause an auto-restart which would enable DHCP as well, within 5-10 minutes.
And sure enough: no response on ^P nor the HALT button – meaning I had to stop the machine the hard way.
I exchanged one of the DIMMS and restarted – and the system has run ever since. At least, longer that yesterday, since HyperSpy shows the system has been running for just a few hours:
2015-08-16_20-54-34
and silently stopped working – for some reason – just after 19:00: The Errorlog doesn’t show ANYTHING after the last volume change,
but the last control-entry.
However, startup does show , yesterday as well, and probably more often on startup, but for that I’ll need to retrace these moments.

These problems occur when I install extra memory, 512 Mb for each DIMM in bank 1; with the original 512Mb (2 x 256 in bank 0) there is nothing wrong. But since the system has run now for 12 hours without a glitch and both HALT and ^P both work and SRM’s show mem doesn’t show any bad pages so far, it seems I found the bad DIMM.
Hopefully.
Anyway, I’m now running on three times the memory I had originally.

Although there are still a few questions to be answered. The console manual I have differs from what I see on screen and what I can specify as parameters. But that will most likely be related to the firmware version (7.2-1) and system spec; the manual seems to be related to either the EV6@466Mhz and EV67&600Mhz, where Diana is EV67&617Mhz. Which may explain the differences in DIP-switch settings…

15-Aug-2015

Retry more memory
Because it was foreseen that there would be no external activity for the weekend, I stopped the server and added 1 Gb of memory to the 512 that has been I stalled from the beginning, and tried to test the system.
Well, of course I could start the test:
>>>test
It tells runtime is 150 seconds, and that CTRL-C may stop testing once started
And that’s about it. Let it run for about 10 minutes, lost confidence:

  • NO output, whatsoever.
  • ^C doesn’t work
  • ^P doesn’t work
  • HALT button doesn’t work
  • No output may have to do with environment variable d_verbose being zero.
    So stopped the machine by the power-off button, and started it again using the same.

    >>> more el

    doesn’t say anything but the last startup-sequence so it is of no use at all.
    Hence the next step was
    >>>set d_verbose 1
    >>>test

    With the very same result.

    Restarted, let the system run for some time, then did
    >>>show mem
    and that gave me no bad pages, and 1.5 Gb of memory.
    No surprises 🙂
    But since the keyboard was completely unresponsive, there might have been a bad DIMM? So I exchanged them, and repeated the whole sequence, with exactly the same results.
    So since I would be around if thins would go wrong, I started VMS (knowing that the startup-sequence has now be tested several times and just worked 🙂 ) to see where this would bring me.

    11-Aug-2015

    Check
    A simple check this morning:
    * It all works. No crashes, no freezes since yesterday evening
    * One stack overflow in the PHP executor – so this definitively is caused by lack of internal memory

    The plan is to do some testing spread over a few days: Add 1 GB at a time to the 512 that will stay in place and see what happens. Let it run for a night and check. If all is well, I could assume these DIMMs are both Ok. Swap one of them with one of the other pair and test again; If still Ok, the one not yet tested should be the bad one.
    There is one weird thing though that I noticed yesterday evening: Outlook won’t send a message – a reply to an older message; It is kept in the Outbox and Outlook complains that I’m not connected when showing the activity. But all other internet related activity just works. I restarted SMTP but that didn’t help at all.

    10-Aug-2015

    Memory in error
    Because heat may contribute to the problems, I reorganized the installation.
    I had the DS10 under the HSZ50 and three BA356 cabinets; these have their airflow directed downward so that may block hot air from the server to escape.
    So I turned the units upside-down so they now blow upward. And I gave the DS10 a bit more room around.
    Second, I set console to serial – I already have the HSZ50 hooked up to a VT420 terminal, the server could be hooked to the second session. That shows a bit more information on boot.

    Did A memtest – it now worked where it, for some reason, refused because 0 wasn’t a right address… But now it did. I did a test over a large portion of memory and again, the system froze. Not even CTRL-P on the console, or the HALT button gave a response.

    Restarted – and memtest failed again on the same message..
    Restarted again – and memtest did run. tested to the max possible, no errors.
    Did sho mem, several times – and the system froze again.
    Restarted, sho mem gave no problems over a longer period – so I booted. All seemed Ok, until about 30 minutes, the system froze again.

    There is definitely something wrong with this memory. My guess: one of the DIMMs breaks the system, the question is which one?
    There isn’t any message – nowhere – that gives a hint of memory problems. and once VSM is running, there is no way to get this information. I could, of course, write a program that checks pages on the fly but that would cause havoc on performance. Plus that the kernel will intervene to prevent problems with system data and structures – what this program wants to detect…

    In the end, I decided to remove the 2Gb memory and return to 512, for the time being. I could, of course, add 1 Gb at a time and see if that works. But only when I’m at the site…