12-Apr-2008

Cluster issues
Some time ago, I tried to start a second Alpha into the cluster, using the shared SCSI connection. But whatever I tried, at some point the the connection with the system disk is lost, and the console software doesn’t like this at some point ans starts spitting out tons of data.
Tonight I decided to give it another try and hook my AlphaServer 400 system onto the shared SCSI and boot it from the common system disk. What I already found out to cause a problem: it seems that the choice of the system disk being the quorum disk wasn’t a good idea. Therefore, I changed a 36Gb disk in the storage shelves for an old 4.3 one, and defined that as the quorum disk. It meant I had to do something in the HSZ50 as well, to give it the proper size. After that, I autogen’d Diana, with no feedback because the data was too old and Autogen complained.
Rebooting Diana was no problem but the quorum disk was lost and regained, lost and reegained, and so on: the disk must have been bad. After replacing it by another one, Diana hung and had to be crashed. My mistake: the machine would have to be stopped before I started hassling with the quorum disk…But starting Diana after resetting the controller, the system kept waiting for the internal SCSI to poll. It started working immediately after I switched off the AlphaServer and returned to more normal activity.

Starting the Alphaserver using the local disk (VMS 8.3 as well, but without all the patches) was no problem, but all licenses were expired. Next I tried to mount the disk on which all system files are stored, but that caused havoc again on Diana: The disk were found to be improperly dismounted so mount verigfication started – and Diana lost contact with the disks.

I must have a look into the controllers. I now have installed KZPBY-CY and that _should_ work, but I’ll have to try with KZPSA because that is said to work properly…
Luckily, I have now good listings of the node that won’t start, to show the experts.

MySQL will have to reside on this machine for the time being. Or I have to setup Dido as a standalone machine.

10-Apr-2008

Procedures available for download
Controlling the system and access is something to be done regularly. Two of these tasks are scanning the web access log for deliberate abuse attempts (weekly) and the mail statistics (monthly).
Checking the weblogs means scanning hundreds of lines, and the number of actual attempt is rather low, and getting the mail statistics is just a matter of counting records in the logs. Both very boring activities the system can do faster and with less errors than any human can.
So I wrote DCL procedures to do it: WeblogScan.com for scanning the WASD logfiles, and pmas.com for counting mail statistics from the PMAS logfiles. You cannot user them without chnaging some information, but what to adapt has been noted well in the scripts.
Another handy procedure – that requires some localilty settings adapted as well – is my MYSQL watchdog. It allows MySQL to be down for a maximum of 15 minutes after a crash.

All files are free, you can use them to make your life a bit easier.

Don’t rely on them blindly. They work for me and I’m satisfied with them. That they work for me doesn’t automatically mean they work for you. No guarantee, no support (well, if you knwo a way to enhance them, please supply feedback so everyone can benefit of your experience).

You may pass them on but please leave the header intact.

07-Apr-2008

Updates
Last update contained a patch that caused trouble debugging images. Today I installed a patch that corrected the problem, and a number that were released after that last big update. Within 5 minutes, Diana was up and running again, but when the webserver came alive again, there were several streams accessing the database – and I was accessing the Wiki to see if that was still working. It was – extremely fast (to my experience) that wasn’t a real surprise afterwards: MySQL couldn’t cope with the sudden load. But since the watchdog was due fo run within a minute, it was back in no time…
Another advantage of reboot: Pagefile usage was about 25 % in the last days. Rather high and I haven’t found out yet what caused it. Now it’s back to more normal levels.
Normal access seems to work fine. Still some things to check (CIFS, for instance, but there is somewhat more trouble there) but these are of less importance. Except, of course, I ran into a problem. (Of course that happened…due to the fact that one logical wasn’t defined, the homepage didn’t show up. But the problem is now solved)
Hopefully, debugging an image now works again.

04-Apr-2008

System performance view gone
A day usually starts checking the operator logs, system performance and some other things that passed yesterday. This morning, the HyperSpi views on the system performnace were missing – the logical used wasn’t defined properly. Quite likely that it wasn’t working yesterday either – because I used the WASD startup procedure where HyperSpi are defined as well. This probably messed up the logical. That means there are no HyperSpi statistics from the time I ran the procedure after upgrading the webserver. After I corrected the logical, statistics resumed. So there is a gap of 2 days. But I do have the T4 data!

MySql keeps raising issues.
The watcher procedure lists when MySQL is restarted, and the result so far, for 2008:

MYSQL Restarted 2008-01-14 07:31:00.23
MYSQL Restarted 2008-01-21 17:25:00.07
MYSQL Restarted 2008-01-21 22:10:00.23
MYSQL Restarted 2008-02-06 03:35:00.43
MYSQL Restarted 2008-02-17 19:50:00.08
MYSQL Restarted 2008-02-17 20:35:00.63
MYSQL Restarted 2008-03-08 21:35:00.35
MYSQL Restarted 2008-03-10 07:35:00.27
MYSQL Restarted 2008-03-13 21:20:00.51
MYSQL Restarted 2008-03-23 20:51:00.08
MYSQL Restarted 2008-03-24 23:06:00.11
MYSQL Restarted 2008-03-28 07:06:00.24
MYSQL Restarted 2008-03-30 20:36:00.25
MYSQL Restarted 2008-04-01 18:21:00.26
MYSQL Restarted 2008-04-02 21:21:00.35
MYSQL Restarted 2008-04-04 11:51:00.51

WordPress 2.5 IS indeed faster than 2.3, and though it looks like the database engine is used more efficiently (staements are checked more rigourously before actually executing them) it still happens quite often that MySQL runs out of core – and so does PHP. But I’ll need to dig far deeper to find out exactly. A specific T4 examination may show something.
Funny, though, that teh Wiki – being a Python application) seems to have no problem at all. It’s not fast but so far I’ve seen no weird messages stating that free memory is low, or any weird messages. It simply – works.

Perhaps installing MySQL 5.1-22 might help, but I’m afraid it will require an even bigger memory footprint than the current average of 30Mb (according WASD’s Admin SYSTEM output). It’s only feasable if it’s more stable under stress 😉

Spam filter message
There was one funny thing in the operator log:

%%%%%%%%%%% OPCOM 3-APR-2008 09:30:39.87 %%%%%%%%%%%
Message from user SYSTEM on DIANA
%PTSMTP-E-SHAREPRIV, workers require SHARE privilege; setsockopt of UXC$C_SHARE failed
-SYSTEM-F-PROTOCOL, network protocol error

There was no big load on the mail system at this moment, so where this came from – I don’t know. PTSMPT – the spam filter – continued without a problem, with 2 worker processes; a thirrd was not needed at this point, unless there have been messages rejected because of DNS blacklisted senders, or relay attempts, So I ran the statistics procedure:

PMAS statistics for 04
Total messages    :  460 = 100.0 o/o
DNS Blacklisted   :  305 =  66.3 o/o (Files:  3)
Relay attempts    :    3 =    .6 o/o (Files:  1)
Processed by PMAS :  152 =  33.0 o/o (Files:  3)
        Discarded :   45 =  29.6 o/o (processed),   9.7 o/o (all)
     Quarantained :   47 =  30.9 o/o (processed),  10.2 o/o (all)
        Delivered :   60 =  39.4 o/o (processed),  13.0 o/o (all)

305 blacklisted messages, in 3 days (today’s files haven’t been renamed yet), is about 3000 in one month. Well, more than usual, but not really extreme. But possible. I would have to scrutenize the logfile, I know the time:

3-APR-2008 09:27:03.63: Address (88.227.51.122) blacklisted (4)
3-APR-2008 09:28:08.69: Address (89.218.166.171) blacklisted (4)
3-APR-2008 09:36:36.56: Address (81.5.15.231) blacklisted (4)
3-APR-2008 09:36:44.76: Address (219.148.119.178) blacklisted (4)

Nor weirdness in other logs.

It’s been considered just an incident.

03-Apr-2008

Webpage program getting on
Tonight I took some time to finish the first stage: All modules – though most are simply RETURN statements, because they will be coded later – do now compile and the programs links. But I cannot continue at this moment for one, stupid reason: Last update cycle installed SYS7-update and that breaks DEBUG: I cannot run an image built /DEBUG….This was first mentioned on ITRC, and the offending patch has been superseeded by SYS8; I got that already, but had no time to install it.
A second thing to be done than was an update of the home page program. I don’t want to update texts in the image itself, because it requires the program to be recompiled, linked, tested and copied. In stead, it’s easier to create a text file and have it read by the program.
It was rather simple, a very primitive (noy yet perfect) scheme: create a file “dd-mmm-yyyy_A_Title_You_want_to_be_displayed”, fill it with the text you want (including hyperlinks, and other HTML stuff), define a logical to refer this file (/SYSTEM) and that’s it: The program reads teh logical (and fails if it doesn’t exist – testing IS is a requirement), extract date and tile, replacing all underscores by spaces, and read all lines into an array, and output it using the reporting tools.

Biggest trouble was reading a file, created by EVE. But I though too much “error aware”. It wasn’t so troublesome after all.

The result is the current page – including the extension “.txt” in the header – I already said it isn’t perfect 🙂 – finished after midnight.