30-Dec-2006

Updates
have been applied to both the OS and the forum software. Diana (and by that, potentially all machines that will boot from this common systemdisk) will now rin OpenVMS version 8.3. PHPBB2 has been updated to version 2.0.22, as was suggested on the administartion panel. That wasn’t as smooth as expected, you’ll need to follow the manual to the letter, and for each of the forums. For this site, it meant a double update since the databases are different, and these updates require some access into the database (I didn’t look into what has been updated exactly, but that it seems required was evident).
But. as expected, the OpenVMS update was a piece of cake.
DNS problems
It looks as if DNS is not properly re-initiated after a system crash. At least, I found that TCPIP SHOW HOST gave some “zone transfer error” messages that could only be removed by removing some jpournal files (*.DB_JNL) and give the DNS datafiles a new Version entry. A normal shutdown of the system would work,but crashes is quite another story.
I added a new zone: Grootersnet.local, to be used in stead of the currently used intra.grootersnet.nl (no problem showing this: it’s a 192.168.*.* network zo cannot be accessed (except for Diana and I simply trust that OPenVMS box ;))), but I kept this for the moment. I’m thinking of setting up a list of CNAME entries in a grootersnet.nl zone, all refering to www.grootersnet.nl which will reside with my ISP so the basic reference will be correct and beyond my control (The IP-address is owned by my ISP, it was just assigned to me). But that’s still in the bucket.
MySQL games
MySQL seems to play funny tricks as well in case of a crash: each time after a crash, it seems it looses what presentation model has been chosen before on this blog, it will revert to the default skin after a crash. I’m not at all sure whether a normal shudown would work properly – shutting down MySQL in a non-interactive manner is somewhat – let’s call it “obscure”. The fun part however is that PHPBB2 has no such trouble at all. So it might be a WordPress issue after all.

26-Dec-2006

Boot Ok
The ID of the card was all right after all: it was set to 5 before. But it did not solve the problem of the error.
The full output:

>>>b dkb100 -flags 1,0
(boot dkb100.1.0.11.0 -flags 1,0)
block 0 of dkb100.1.0.11.0 is a valid boot block
reading 1168 blocks from dkb100.1.0.11.0
bootstrap code read in
base = 1f2000, image_start = 0, image_bytes = 92000
initializing HWRPB at 2000
initializing page table at 1e4000
initializing machine state
setting affinity to the primary CPU
jumping to bootstrap code
%APB-I-FILENOTLOC, Unable to locate SYSBOOT.EXE
%APB-I-LOADFAIL, Failed to load secondary bootstrap, status = 00000910

halted CPU 0

halt code = 5
HALT instruction executed
PC = 20003d94
warning -- HWRPB is invalid.
>>>

I searched for error code 910:
$ write sys$output f$message(%x910)
%SYSTEM-W-NOSUCHFILE, no such file
$

This is weird!
So I decided to take a look in AXP082:[SYS1], the root used, and compared that to AXP082:[SYS0], the one from which Diana boots. apart form it’s obvious place ([VMS$COMMON]), SYSBOOT.EXE can be founf in [SYS0.SYSCOMMON] and in [SYS1.syscommon]. I always thought case didn’t matter, but it seems it does!
Next, I simply renamed the directory to it’s uppercase equivalent and behold:

b dkb100 -flags 1,0
waiting for pkb0.6.0.11.0 to poll...
waiting for pkb0.6.0.11.0 to poll...
waiting for pkb0.6.0.11.0 to poll...
(boot dkb100.1.0.11.0 -flags 1,0)
block 0 of dkb100.1.0.11.0 is a valid boot block
reading 1168 blocks from dkb100.1.0.11.0
bootstrap code read in
base = 1f2000, image_start = 0, image_bytes = 92000
initializing HWRPB at 2000
initializing page table at 1e4000
initializing machine state
setting affinity to the primary CPU
jumping to bootstrap code
%SYSBOOT-W-NOERLDUMP, Unable to locate SYS$ERRLOG.DMP
%SYSBOOT-W-NODUMP, unable to locate system dump file

OpenVMS (TM) Alpha Operating System, Version V8.2
© Copyright 1976-2005 Hewlett-Packard Development Company, L.P.

%DECnet-I-LOADED, network base image loaded, version = 05.12.00

%DECnet-W-NOOPEN, could not open SYS$SYSROOT:[SYSEXE]NET$CONFIG.DAT

but then trouble begins.

%SYSINIT-I- waiting to form or join an OpenVMS Cluster
%VMScluster-I-LOADSECDB, loading the cluster security database
%SYSTEM-I-MOUNTVER, $116$DKA100: (DIDO PKB) is offline. Mount verification in progress.

%SYSTEM-I-MOUNTVER, $116$DKA100: (DIDO PKB) has completed mount verification.

%EWA0, FastFD mode set by console
%EWA0, Link state: UP
%CNXMAN, Sending VMScluster membership request to system DIANA
%CNXMAN, Sending VMScluster membership request to system DIANA
%SYSTEM-I-MOUNTVER, $116$DKA100: (DIDO PKB) is offline. Mount verification in progress.

%SYSTEM-I-MOUNTVER, $116$DKA100: (DIDO PKB) has completed mount verification.

%MSCPLOAD-I-CONFIGSCAN, enabled automatic disk serving
%CNXMAN, Sending VMScluster membership request to system DIANA
...
%CNXMAN, Have connection to system DIANA
%CNXMAN, Sending VMScluster membership request to system DIANA
...
and this continues, and continues.Still, Diana does show a new member is coming in:

+———————————————————————-————————+
| SYSTEMS | MEMBERS |
|————————-——————————————+———-—-————|
| NODE | SOFTWARE | STATUS |
|————————+——————————————+—————————|
| DIANA | VMS V8.2 | MEMBER |
| DIDO | VMS V8.2 | NEW |
+————————-——————————————-—————————+

So the blunt way: I removed DIDO als a cluster member, that removes the whole SYS$SYSROOT tree for that node as well.. After, I re-added it the node, which creates a [SYS1] directory, including the case error, and will then wait for the new member to boot.
After having corrected the case error, booted the new server, with the same result: It will get a connect to Diana but no data seems te be received. So it won’t never get into the cluster….
Booted it one more, but locally, it will start to be a cluster of it’s own, and have not even an attempt to connect to Diana – but the system – and all disks, even the one on the shared scsi bus, are fully accesable.
It will just fail to form a cluster with Diana!
Is DECNet the part to blame??
SRM updated
The SRM has been updated to the latest version (7.0) but that did not solve the problems.
VMS 8.3
had been downloaded some time ago, but today I finished creating CD’s so I can install it on whatever system. The new AlphaServer400, perhaps?

26-Dec-2006

All quiet on the mail and login front

No remarks in operator.log, no newly banned acesses in accounting.dat… Just the increasingly bigger amount of unsolicited mail that gets blocked – and the increasingly bigger amount of spoofed (or hacked) addresses that cannot be filtered out without scanning the subject and content of the messages.
Phishing
Two phishing attempts – both said to come from EBay. But examining the code showed that some links refer to a coded (and therefore suspicious) address.
Webserver abuse
The web seems one again a nice challenge for someone:

211.239.241.23 - - [15/Dec/2006:19:31:15 +0100] "GET / HTTP/1.0" 200 3147
211.239.241.23 - - [15/Dec/2006:19:31:22 +0100] "OPTIONS / HTTP/1.0" 200 172
211.239.241.23 - - [15/Dec/2006:19:31:23 +0100] "OPTIONS /" 501 694
211.239.241.23 - - [15/Dec/2006:19:31:23 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:31:29 +0100] "GET /nice%20ports%2C/Tri%6Eity.txt%2ebak HTTP/1.0" 404 864
211.239.241.23 - - [15/Dec/2006:19:31:29 +0100] "- -" 400 870
211.239.241.23 - - [15/Dec/2006:19:31:30 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:31:35 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:31:41 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:31:46 +0100] "HELP -" 400 870
211.239.241.23 - - [15/Dec/2006:19:31:47 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:31:52 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:31:57 +0100] "default -" 400 870
211.239.241.23 - - [15/Dec/2006:19:31:58 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:32:03 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:32:08 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:32:14 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:32:19 +0100] "- -" 0 0
211.239.241.23 - - [15/Dec/2006:19:32:24 +0100] "< NTP/1.0" 501 694 211.239.241.23 - - [15/Dec/2006:19:32:25 +0100] "- -" 0 0 211.239.241.23 - - [15/Dec/2006:19:32:30 +0100] "- -" 0 0

According WHOIS this is a Korean address, but there seems to be no domain connected to is. DIG won't find it either:
$ dig -x 211.239.241.23

; < <>> DiG 9.2.1 < <>> -x 211.239.241.23
;; global options: printcmd
;; Got answer:
;; ->>HEADER< <- opcode: QUERY, status: SERVFAIL, id: 33783 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;23.241.239.211.in-addr.arpa. IN PTR ;; Query time: 2715 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Tue Dec 26 11:32:53 2006 ;; MSG SIZE rcvd: 45 $

SERVFAIL? No, this is NOT an error on Diana. Digging the address of www.hp.com, for instance, does give a valid answer. So the address has likely not been registered in any DNS.

25-Dec-2006 (2)

Oops

Just had a thought: the SCSI-ID be of the newly installed KZPBA should be “6”, or “5”, or whatever – except “7”. That one is in use by Diana! It _should_ be something easy in SRM to change: PKB0_ID (or what the variable is named) to be set to 6.

25-Dec-2006

Installing a new server

prooved to be more challenging than expected.

Yesterday, I put the AlphaStation400 in place of the two AlphaStation 200’s and booted from it’s local system disk. It did form a cluster with Diana when booted from it’s local disk when it was placed near the workbench, but this time it kept staying a cluster on it’s own. That is: the machine does request to form a cluster with Diana, but that machine does not respond, or the response isn’t received or handled:
$
%%%%%%%%%%% OPCOM 24-DEC-2006 20:02:28.51 %%%%%%%%%%%
20:02:28.50 Node DIANA (csid 00010001) received VMScluster membership request from node DIDO

$
%%%%%%%%%%% OPCOM 24-DEC-2006 20:02:28.51 %%%%%%%%%%%
20:02:28.51 Node DIANA (csid 00010001) proposed addition of node DIDO

$
%%%%%%%%%%% OPCOM 24-DEC-2006 20:02:28.51 %%%%%%%%%%%
20:02:28.51 Node DIANA (csid 00010001) completed VMScluster state transition

It might have been the hub in between causing trouble, but even when connected directly to the switch. I had removed the two single-ended SCSI cards (NEC8100 – Symbiont brand) and the DE450 NIC, and put a KZPBA-CY differential SCSI card (required to access the shared SCSI bus) and a DE500 NIC because that would allow 100Mb Full Duplex. The latter set to just 10Mb – which is the limit for the hub, did not help, even half duplex was no solution. The problem continued…

But it did boot, though it could not access the disks on the shared SCSI without causing havoc. Quite obvious – since to be able to, I had to do some work on the basic configuration first: allow access to the shared SCSI bus by adding it’s PKB-slot in SYS$SYSTEM:SYS$CONFIG.DAT to have the same allocation class as Diana (116) and in the system configuration, have the DEVICE_NAMING parameter set to 1. After that, I could indeed access the disks over the shared SCSI and copy whatever I needed – causing Diana to spit out messages like:

%%%%%%%%%%% OPCOM 25-DEC-2006 21:26:03.96 %%%%%%%%%%%
Device $116$DKA100: (DIANA PKB, DIDO) is offline.
Mount verification is in progress.

%%%%%%%%%%% OPCOM 25-DEC-2006 21:26:03.97 %%%%%%%%%%%
Mount verification has completed for device $116$DKA100: (DIANA PKB, DIDO)

and these occur on the new server as well.

At some point, it got even worse: the volume on DKA100 (it’s system disk) is said to have the wrong volume located, stated on the terminal screen (and not(!) in operator log) causing the session, and the machine, to hang. Even stopping AS400 didn’t help anymore at that point. 

It required Diana to be rebooted – or reset, because acces to the system was now impossible. It happend one or twice….

But once AS400/DIDO was properly setup, and the addition of DIDO as a clustermember re-initiated on Diana, the procedure on Diana would ask for DIDO to be booted. But this time, there is a VERY SEVERE problem:%APB-I-FILENOTLOC, Unable to locate SYSBOOT.EXE
%APB-I-LOADFAIL, Failed to load secondary bootstrap, status = 00000910

halted CPU 0

halt code = 5
HALT instruction executed
PC = 20003d94
warning -- HWRPB is invalid.

Though I doubt it would be the problem, I reckon an update of the SRM console, software (currently 6.9) is no big deal. So I got the lates available (7.0), as wel as the latest for the AlphaServer 1000 (at least, expected it to retrieve, but the site or the file is no longer available) and the AlphaServer 2100 (to be entering the datacentre on Thursday)