April 2008 – Page 2 – SYSMGR in the attic

12-Apr-2008

Cluster issues
Some time ago, I tried to start a second Alpha into the cluster, using the shared SCSI connection. But whatever I tried, at some point the the connection with the system disk is lost, and the console software doesn’t like this at some point ans starts spitting out tons of data.
Tonight I decided to give it another try and hook my AlphaServer 400 system onto the shared SCSI and boot it from the common system disk. What I already found out to cause a problem: it seems that the choice of the system disk being the quorum disk wasn’t a good idea. Therefore, I changed a 36Gb disk in the storage shelves for an old 4.3 one, and defined that as the quorum disk. It meant I had to do something in the HSZ50 as well, to give it the proper size. After that, I autogen’d Diana, with no feedback because the data was too old and Autogen complained.
Rebooting Diana was no problem but the quorum disk was lost and regained, lost and reegained, and so on: the disk must have been bad. After replacing it by another one, Diana hung and had to be crashed. My mistake: the machine would have to be stopped before I started hassling with the quorum disk…But starting Diana after resetting the controller, the system kept waiting for the internal SCSI to poll. It started working immediately after I switched off the AlphaServer and returned to more normal activity.

Starting the Alphaserver using the local disk (VMS 8.3 as well, but without all the patches) was no problem, but all licenses were expired. Next I tried to mount the disk on which all system files are stored, but that caused havoc again on Diana: The disk were found to be improperly dismounted so mount verigfication started – and Diana lost contact with the disks.

I must have a look into the controllers. I now have installed KZPBY-CY and that _should_ work, but I’ll have to try with KZPSA because that is said to work properly…
Luckily, I have now good listings of the node that won’t start, to show the experts.

MySQL will have to reside on this machine for the time being. Or I have to setup Dido as a standalone machine.

7-Apr-20087-Apr-2008

07-Apr-2008

Updates
Last update contained a patch that caused trouble debugging images. Today I installed a patch that corrected the problem, and a number that were released after that last big update. Within 5 minutes, Diana was up and running again, but when the webserver came alive again, there were several streams accessing the database – and I was accessing the Wiki to see if that was still working. It was – extremely fast (to my experience) that wasn’t a real surprise afterwards: MySQL couldn’t cope with the sudden load. But since the watchdog was due fo run within a minute, it was back in no time…
Another advantage of reboot: Pagefile usage was about 25 % in the last days. Rather high and I haven’t found out yet what caused it. Now it’s back to more normal levels.
Normal access seems to work fine. Still some things to check (CIFS, for instance, but there is somewhat more trouble there) but these are of less importance. Except, of course, I ran into a problem. (Of course that happened…due to the fact that one logical wasn’t defined, the homepage didn’t show up. But the problem is now solved)
Hopefully, debugging an image now works again.

4-Apr-20084-Apr-2008

04-Apr-2008

System performance view gone
A day usually starts checking the operator logs, system performance and some other things that passed yesterday. This morning, the HyperSpi views on the system performnace were missing – the logical used wasn’t defined properly. Quite likely that it wasn’t working yesterday either – because I used the WASD startup procedure where HyperSpi are defined as well. This probably messed up the logical. That means there are no HyperSpi statistics from the time I ran the procedure after upgrading the webserver. After I corrected the logical, statistics resumed. So there is a gap of 2 days. But I do have the T4 data!

MySql keeps raising issues.
The watcher procedure lists when MySQL is restarted, and the result so far, for 2008:

MYSQL Restarted 2008-01-14 07:31:00.23 MYSQL Restarted 2008-01-21 17:25:00.07 MYSQL Restarted 2008-01-21 22:10:00.23 MYSQL Restarted 2008-02-06 03:35:00.43 MYSQL Restarted 2008-02-17 19:50:00.08 MYSQL Restarted 2008-02-17 20:35:00.63 MYSQL Restarted 2008-03-08 21:35:00.35 MYSQL Restarted 2008-03-10 07:35:00.27 MYSQL Restarted 2008-03-13 21:20:00.51 MYSQL Restarted 2008-03-23 20:51:00.08 MYSQL Restarted 2008-03-24 23:06:00.11 MYSQL Restarted 2008-03-28 07:06:00.24 MYSQL Restarted 2008-03-30 20:36:00.25 MYSQL Restarted 2008-04-01 18:21:00.26 MYSQL Restarted 2008-04-02 21:21:00.35 MYSQL Restarted 2008-04-04 11:51:00.51

WordPress 2.5 IS indeed faster than 2.3, and though it looks like the database engine is used more efficiently (staements are checked more rigourously before actually executing them) it still happens quite often that MySQL runs out of core – and so does PHP. But I’ll need to dig far deeper to find out exactly. A specific T4 examination may show something.
Funny, though, that teh Wiki – being a Python application) seems to have no problem at all. It’s not fast but so far I’ve seen no weird messages stating that free memory is low, or any weird messages. It simply – works.

Perhaps installing MySQL 5.1-22 might help, but I’m afraid it will require an even bigger memory footprint than the current average of 30Mb (according WASD’s Admin SYSTEM output). It’s only feasable if it’s more stable under stress 😉

Spam filter message
There was one funny thing in the operator log:

%%%%%%%%%%% OPCOM 3-APR-2008 09:30:39.87 %%%%%%%%%%% Message from user SYSTEM on DIANA %PTSMTP-E-SHAREPRIV, workers require SHARE privilege; setsockopt of UXC$C_SHARE failed -SYSTEM-F-PROTOCOL, network protocol error

There was no big load on the mail system at this moment, so where this came from – I don’t know. PTSMPT – the spam filter – continued without a problem, with 2 worker processes; a thirrd was not needed at this point, unless there have been messages rejected because of DNS blacklisted senders, or relay attempts, So I ran the statistics procedure:

PMAS statistics for 04 Total messages : 460 = 100.0 o/o DNS Blacklisted : 305 = 66.3 o/o (Files: 3) Relay attempts : 3 = .6 o/o (Files: 1) Processed by PMAS : 152 = 33.0 o/o (Files: 3) Discarded : 45 = 29.6 o/o (processed), 9.7 o/o (all) Quarantained : 47 = 30.9 o/o (processed), 10.2 o/o (all) Delivered : 60 = 39.4 o/o (processed), 13.0 o/o (all)

305 blacklisted messages, in 3 days (today’s files haven’t been renamed yet), is about 3000 in one month. Well, more than usual, but not really extreme. But possible. I would have to scrutenize the logfile, I know the time:

3-APR-2008 09:27:03.63: Address (88.227.51.122) blacklisted (4) 3-APR-2008 09:28:08.69: Address (89.218.166.171) blacklisted (4) 3-APR-2008 09:36:36.56: Address (81.5.15.231) blacklisted (4) 3-APR-2008 09:36:44.76: Address (219.148.119.178) blacklisted (4)

Nor weirdness in other logs.

It’s been considered just an incident.

April 2008
M	T	W	T	F	S	S
« Mar				May »
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30