SYSMGR in the attic – Page 59 – What goes on behind the doors of the (private) data center

18-Aug-201520-Aug-2015

18-Aug-2015

Stable – up to now
It looks fine up to nowL the system has been up and running continuously for 48 hours, without a glitch. The ususal peak on Monday just after midnight (processing last week’s logfiles) didn’t trouble it, nor does the occasional higher dem,and on resources – it seems that the last two days, the system has been quite busy between 14:30 and 16:30, acccording paaks in HyperSpy output:

CPU and memory (both physical and virtual)

The peaks also show up in the WASD graph over the last 72 hours (starting when the system was booted – the part before holds no data, of course):

Next is to contact the supplier for a replacement DIMM, for the one that seems to be bad.

16-Aug-2015

Hung – again
Two things were wrong this morning: To begin, I couldn’t connect to Wifi. It was switched off – my mistake yesterday: attempting to re-connect a Wifi-connected printer using WPS, I switched it off, apparently. Once that was settled, I could authenticate but it was impossible to get an IP address.
Which in my case means that the server was – again – not responding. Not crashed, since that would cause an auto-restart which would enable DHCP as well, within 5-10 minutes.
And sure enough: no response on ^P nor the HALT button – meaning I had to stop the machine the hard way.
I exchanged one of the DIMMS and restarted – and the system has run ever since. At least, longer that yesterday, since HyperSpy shows the system has been running for just a few hours:

and silently stopped working – for some reason – just after 19:00: The Errorlog doesn’t show ANYTHING after the last volume change,
but the last control-entry.
However, startup does show , yesterday as well, and probably more often on startup, but for that I’ll need to retrace these moments.

These problems occur when I install extra memory, 512 Mb for each DIMM in bank 1; with the original 512Mb (2 x 256 in bank 0) there is nothing wrong. But since the system has run now for 12 hours without a glitch and both HALT and ^P both work and SRM’s show mem doesn’t show any bad pages so far, it seems I found the bad DIMM.
Hopefully.
Anyway, I’m now running on three times the memory I had originally.

Although there are still a few questions to be answered. The console manual I have differs from what I see on screen and what I can specify as parameters. But that will most likely be related to the firmware version (7.2-1) and system spec; the manual seems to be related to either the EV6@466Mhz and EV67&600Mhz, where Diana is EV67&617Mhz. Which may explain the differences in DIP-switch settings…

16-Aug-2015

15-Aug-2015

Retry more memory
Because it was foreseen that there would be no external activity for the weekend, I stopped the server and added 1 Gb of memory to the 512 that has been I stalled from the beginning, and tried to test the system.
Well, of course I could start the test:
>>>testIt tells runtime is 150 seconds, and that CTRL-C may stop testing once started
And that’s about it. Let it run for about 10 minutes, lost confidence:

NO output, whatsoever.

^C doesn’t work

^P doesn’t work

HALT button doesn’t work

No output may have to do with environment variable d_verbose being zero.
So stopped the machine by the power-off button, and started it again using the same.

>>> more el

doesn’t say anything but the last startup-sequence so it is of no use at all.
Hence the next step was
>>>set d_verbose 1 >>>test
With the very same result.

Restarted, let the system run for some time, then did
>>>show mem
and that gave me no bad pages, and 1.5 Gb of memory.
No surprises 🙂
But since the keyboard was completely unresponsive, there might have been a bad DIMM? So I exchanged them, and repeated the whole sequence, with exactly the same results.
So since I would be around if thins would go wrong, I started VMS (knowing that the startup-sequence has now be tested several times and just worked 🙂 ) to see where this would bring me.

11-Aug-201516-Aug-2015

11-Aug-2015

Check
A simple check this morning:
* It all works. No crashes, no freezes since yesterday evening
* One stack overflow in the PHP executor – so this definitively is caused by lack of internal memory

The plan is to do some testing spread over a few days: Add 1 GB at a time to the 512 that will stay in place and see what happens. Let it run for a night and check. If all is well, I could assume these DIMMs are both Ok. Swap one of them with one of the other pair and test again; If still Ok, the one not yet tested should be the bad one.
There is one weird thing though that I noticed yesterday evening: Outlook won’t send a message – a reply to an older message; It is kept in the Outbox and Outlook complains that I’m not connected when showing the activity. But all other internet related activity just works. I restarted SMTP but that didn’t help at all.

10-Aug-2015

Memory in error
Because heat may contribute to the problems, I reorganized the installation.
I had the DS10 under the HSZ50 and three BA356 cabinets; these have their airflow directed downward so that may block hot air from the server to escape.
So I turned the units upside-down so they now blow upward. And I gave the DS10 a bit more room around.
Second, I set console to serial – I already have the HSZ50 hooked up to a VT420 terminal, the server could be hooked to the second session. That shows a bit more information on boot.

Did A memtest – it now worked where it, for some reason, refused because 0 wasn’t a right address… But now it did. I did a test over a large portion of memory and again, the system froze. Not even CTRL-P on the console, or the HALT button gave a response.

Restarted – and memtest failed again on the same message..
Restarted again – and memtest did run. tested to the max possible, no errors.
Did sho mem, several times – and the system froze again.
Restarted, sho mem gave no problems over a longer period – so I booted. All seemed Ok, until about 30 minutes, the system froze again.

There is definitely something wrong with this memory. My guess: one of the DIMMs breaks the system, the question is which one?
There isn’t any message – nowhere – that gives a hint of memory problems. and once VSM is running, there is no way to get this information. I could, of course, write a program that checks pages on the fly but that would cause havoc on performance. Plus that the kernel will intervene to prevent problems with system data and structures – what this program wants to detect…

In the end, I decided to remove the 2Gb memory and return to 512, for the time being. I could, of course, add 1 Gb at a time and see if that works. But only when I’m at the site…

April 2024
M	T	W	T	F	S	S
« Nov
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30