Monthly maintenance on April 2st shows no surprises, except the low number of relay attempts:
PMAS statistics for March
Total messages    :   2696 = 100.0 o/o
DNS Blacklisted   :      0 =    .0 o/o (Files:  0)
Relay attempts    :    218 =   8.0 o/o (Files: 31)
Accepted by PMAS  :   2478 =  91.9 o/o (Files: 31)
  Handled by explicit rule
         Rejected :   1741 =  70.2 o/o (processed),  64.5 o/o (all)
         Accepted :    173 =   6.9 o/o (processed),   6.4 o/o (all)
  Handled by content
        Discarded :    422 =  17.0 o/o (processed),  15.6 o/o (all)
     Quarantained :    126 =   5.0 o/o (processed),   4.6 o/o (all)
        Delivered :     16 =    .6 o/o (processed),    .5 o/o (all)

There was just one log of relay attempts significantly larger than the rest: Most were either empty or just 4 blocks in size (the minimum , could take just a few lines). This one was 56 blocks (28 KB) holing 190 records, sent from
21:10:43.65 to 21:14:46.53, sent from address, sender “{fake, of course)@grootersnet.nl” trying to reach locotrones1029@gmail.com. Again, the address seems to belong to HostsWinds.com.

Software updates
Triggered by the fact that the secured sites had access issues yesterday (invalid certificate date) and messages on the WASD mailing list concerning issues with WCME (entry 32 and up) I knew I had to update WCME to renew the certificates. Mark had also posted a mail on a new WASD version (and therefore, ALAMODE, the real-time monitor); Checking the site I found that there was also a newer release of MonDesi (the real-time web-based system monitor) so I decided to get WCME directly and install and run it.
Checking the system however, there was no WCME-program running – and there should be one: WCME-overseer. The last log file it would create as from January this year, and I restarted the system in February So the process hadn’t run for two months. No problem, since all it does is check the date of the certificates and if these are about to expire, run the WCME-program to create new ones.
And that hadn’t been done, obviously.
To find out why, I executed the line in the server-startup procedure to get it running. But time after time it failed; No logfile, though I requested one. Next step is to analyze the audit log – and behold:
Security alarm (SECURITY) and security audit (SECURITY) on DIANA, system id: 21505
Auditable event:          Detached process login failure
Event time:                4-APR-2018 19:51:24.72
PID:                      2020FDA4
Process name:             WCME-startup
Username:                 HTTP$NOBODY
Process owner:            [HTTP$NOBODY]
Image name:               $116$DKA100:[SYS0.SYSCOMMON.][SYSEXE]LOGINOUT.EXE
Posix UID:                -2
Posix GID:                -2 (%XFFFFFFFE)
Status:                   %RMS-E-PRV, insufficient privilege or file protection violation

I changed the command line to use SYSTEM – and that caused WCME-overseer to run. Auto-renewal should now happen – overnight. This morning, I indeed got the message that certificate renewal had been successful – but the sites still had problems. Reason: The new certificates were not copied to the right spot, I have a procedure to take care of that, it may have run but was looking to the wrong location. Changed that, and the certificates are now on the right spot. But I needed to restart the server to accept the new licences: Only after an update of the server, this could be done without restarting the server….Anyway, after that the sites were accessible as they should be.

So now came the update of WASD, OPENMSSL and ALAMODE – which went flawlessly as before – only that $ httpd/do=exit didn’t work as expected, as I found out after I started the new version: Still got the old server, when accessing the admin pages….Killed it using stop/id, the new server (running but in ‘starting’ mode) got on directly and was running fine. Alamode worked fine as well, once it was properly installed.

The next to be updated was the system monitor MonDeSi – np problems here either, except that, for some reason, a $ httpd/do-restart was required to get the new version running. Probably an issue with caching …

The next update is WordPress – version 4.9.5 – which should cause no problems.

Wordpress has been updated to 4.9.5 – as well as Akismet (4.0.3)


Database running on IRIS
Downloaded the latest version of MariaDB from Mark Berryman (5.5.59) which installed flawlessly (as expected) and running the MySQL_Install_db script went fine, until I (again) encountered an error, but now it shows the reason:

180310 18:43:30 [Note] $3$lda2:[000000.mysql055.][bin.ia64]mysqld.exe;1 (mysqld 5.5.59-MariaDB) starting as process 555746351 ...
180310 18:43:30 InnoDB: The InnoDB memory heap is disabled
180310 18:43:30 InnoDB: Mutexes and rw_locks use InnoDB's own implementation
180310 18:43:30 InnoDB: Compressed tables use zlib 1.2.11
180310 18:43:31 InnoDB: Initializing buffer pool, size = 128.0M
180310 18:43:32 InnoDB: Assertion failure in thread 2070851264 in file [freeware.mariadb-5^.5^.59.storage.xtradb.os]os0sync.c;1 line 123
InnoDB: Failing assertion: pthread_cond_init(cond, NULL) == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
%DEBUGBOOT-W-EXPGFLQUOTA, exceeded pagefile quota

though the user (myqsl051_Svr) should have enough (2.000.000) blocks. Doubling it made no difference…

Checking sys$system:pagefile.sys showed the reason: it was just 604064 blocks in size. WAY to small, so I had to increase it’s size to match the requirement – for now, this is 3000000. Perhaps a second pagefile should have been netter, but this was the easy fix. Required a reboot of the system, after which there was no problem at all running the script. At least, up to starting the final startup of the server: This failed, tgime after time. So added a logfile to be created (the SUBMIT statement has “/nolog” option, changed that to “/log=mysql055_root:[MySQL_server]” to get it there) But: No file created….
Mind-wave: take a look into accounting – and behold:

10-MAR-2018 19:17:21 LOGFAIL             MYSQL051_SRV 2140042D          00D3810C

Same for every attempt.
$ ACC/FULL/TYPE=LOGFAIL to get more detail:

Username:          MYSQL051_SRV      UIC:               [MYSQL051,MYSQL051_SRV]
Account:                      Finish time:       10-MAR-2018 19:17:21.34
Process ID:        2140042D          Start time:        10-MAR-2018 19:17:21.29
Owner ID:                            Elapsed time:                0 00:00:00.04
Terminal name:                       Processor time:              0 00:00:00.01
Remote node addr:                    Priority:          4
Remote node name:                    Privilege <31-00>: 00108000
Remote ID:                           Privilege <63-32>: 00000000
Remote full name:
Posix UID:         -2                Posix GID:         -2 (%XFFFFFFFE)
Queue entry:       2                 Final status code: 00D3810C
Queue name:        SYS$BATCH
Job name:          start_mysqld
Final status text: %LOGIN-F-DISUSER, account is disabled
Page faults:              164        Direct IO:                 12
Page fault reads:           4        Buffered IO:               14
Peak working set:        3056        Volumes mounted:            0
Peak page file:        173680        Images executed:            1

Of course. Creating a user requires “/flags=NODISUSER” to activate. And I forgot that one.
mc authorize mod mysql051_srv/flags=nodisuser

and redo startup. And this time:

$ sho sys/proc=mar*
OpenVMS V8.4  on node IRIS   10-MAR-2018 20:00:40.07   Uptime  0 00:50:34
  Pid    Process Name    State  Pri      I/O       CPU       Page flts  Pages
21400434 MariaDB_Server  HIB      6     2914   0 00:00:00.96     14466  15845 M

So now I do have a database running, to be filled: Get the backup from Diana (which is an SQL script) and run it. (tomorrow – than this entry will be there as well).

Now the database is running, get WASD – 11.2 (latest), which I will install on DIANA as well.


Setting up IRIS
Getting on setting up IRIS, one of the big Itanium boxes – to hold the same software as DIANA (the DS10), involved new software to be installed. Started with the MariaDB database (5.5.58) as downloaded (and installed on DIANA). Setting up the software is no problem, but when starting the script to create an empty database, this failed when creating certificates: the script refers to SSLROOT: which doesn’t exist, and later on, the server program crashes. But I made a mistake in the beginning so no database was created.
Missing SSLROOT is easily solved:
and I restarted the script – now without errors. Creating certificates was no problem – but the server still crashed:

Running MySQL for the first time...
Using Mailbox MBA564:

180304 20:10:41 InnoDB: The InnoDB memory heap is disabled
180304 20:10:41 InnoDB: Mutexes and rw_locks use InnoDB's own implementation
180304 20:10:41 InnoDB: Compressed tables use zlib 1.2.6
180304 20:10:42 InnoDB: Initializing buffer pool, size = 128.0M
180304 20:10:43 InnoDB: Assertion failure in thread 2070851264 in file [freeware.mariadb-5^.5^.25.storage.xtradb.os]os0sync.c;1 li4
InnoDB: Failing assertion: 0 == pthread_mutex_init(fast_mutex, MY_MUTEX_INIT_FAST)
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
%SYSTEM-F-OPCCUS, opcode reserved to customer fault at PC=FFFFFFFF848DDC60, PS=0000001B
%TRACE-F-TRACEBACK, symbolic stack dump follows
%TRACE-I-END, end of TRACE stack dump
%RMS-E-EOF, end of file detected

There is a new version of the software available (5.5.59), so that may solve these problems. And a new version of WASD webserver as well, to be installed on both AXP as Itanium, later this week.

Diana’s fan now has stalled…But it’s pretty cool up here, and the system isn’t very busy, so it won’t be much of a problem for a short time. A new fan has been ordered – waiting to be delivered this week.


Monthly maintenance in itself showed no problems. Mail was as usual:

PMAS statistics for February
Total messages    :   2456 = 100.0 o/o
DNS Blacklisted   :      0 =    .0 o/o (Files:  0)
Relay attempts    :   1280 =  52.1 o/o (Files: 28)
Accepted by PMAS  :   1176 =  47.8 o/o (Files: 28)
  Handled by explicit rule
         Rejected :    842 =  71.5 o/o (processed),  34.2 o/o (all)
         Accepted :    116 =   9.8 o/o (processed),   4.7 o/o (all)
  Handled by content
        Discarded :    161 =  13.6 o/o (processed),   6.5 o/o (all)
     Quarantained :     49 =   4.1 o/o (processed),   1.9 o/o (all)
        Delivered :      8 =    .6 o/o (processed),    .3 o/o (all)

but there has been something strange in the antirelay logfiles: between 10-Fen and today, these are a mixture of PMAS.log content and PMAS_ANTIRELAY.LOG. All files were therefore over 60 blocks in size. Searching through these files for the antirelay status (571) held just two files that actually required inspection: the ones of 11th and 25th:

11-FEB-2018 between 21:24:54.19 and 21:28:00.87, from address (190 entries)
25-FEB-2018 between 20:29:42.78 and 25-FEB-2018, from address (146 entries).

In both cases, used a bogus user from this domain, trying to relay to locotrones1029@gmail.com. Once again, hosted by Hostwinds.com; so I’ll warn them again.

Last Itanium system into cluster
The two ‘big’ servers (IRIS and INDRA) are added to the cluster already, but one (INGE) was still to do.
Other than I initially thought, these two big servers have two single-core processors, without hyperthreading. Being similar in hardware (and from the same supplier), it looks these are the older ones of the three. That only CPU #1 is added to the set of CPU’s, is therefore obvious.
INGE, once booted, shows that CPU’s #1, 2 and 3 are added, so it is a newer one. However, it is the one causing problems. I had the machine running when I added the other systems but I couldn’t do that one because before, there as an issue with the SCSI controller so the machine didn’t boot when the system disk was in slot 0 (PUN0, LUN 0) – where is was when I installed VMS on it. Moved it to slot # 2 (PUN2, LUN 0) where I could boot is when accessing EFI and started it by accessing this disk directly and start FSO:\EFI\VMS\VMS_LOADER). But on the EFI boor menu, it was still residing on PUN0 LUN0. So I need to do it interactively.

However, trying to access the ILO yesterday failed: The ILO was non-responsive to PING, TELNET and HTTP on the designated address. So today it was a matter of finding out the cause. Thinking it might be the battery failing (though the supplier said he replaced it) so I took the IKO out, put in a new battery, and re-installed it. Since DHCP is enabled by default, I could figure out what address it would have – but on DIANA, $ DHCPDBDUMP didn’t show any address that did work. $ TCPIP SHO HOST gave me a hint – since the name of the management port is MP<macaddress> now I could access MP, set the configuration to what it should be, and work from there.: Booted the machine from bay#2 – and that worked. Next, shutdown the machine to find out what message was given if the disk was in slot#0 – where it should be. This time, there was nothing wrong; perhaps because when the ILO was out, I could check the connector, may have moved it a bit… SO I added the machine to the cluster. Now being able to access the disk containing the new licenses, this machine is set to work until 12-Mar-2019 – like all others.
Hopefully, this wail keep working – fingers crossed.
Next stage is setting this machine up (as well as INDRA) as DIANA and IRIS, and a quorum node on a laptop running a 64-bit version of Windows (so not the current one).

Fan bearings gone
The main fan of the DS10 server (DIANA) is getting very loud – sounds the bearings are gone. A new one has been ordered but it is uncertain when it will arrive….


Cluster completed
One big problem to create the cluster was the existence of the quorum disk, that both Itanium system cannot access directly. Plus, on one of these nodes, I set EXPERCTED_VOTES to 3. Even without specifying the quorum disk, that system waited, and waited… to join the cluster where the other one (with EXPECTED_VOTES set to 1) would happily join.

Probably not the right thing to do – but now all systems have NO quorum disk and expected_votes of 1 (each contributing one vote). Now I got my cleuster after rebooting all systems:

View of Cluster from system ID 21505  node: DIANA                                                              24-FEB-2018 20:28:18
│        SYSTEMS        │ MEMBERS │
│ DIANA  │ VMS V8.4     │ MEMBER  │
│ IRIS   │ VMS V8.4     │ MEMBER  │
│ INDRA  │ VMS V8.4     │ MEMBER  │

Similar on the other nodes.
Now Daphne (the small Alpha system) and Inge (to be renamed to another (female) deity as all other systems are) need to be updates for the removed quorum disk. Plus I’ll set up a FreeAXP process on the consoie laptop to function as quorum node.

Startup procedures for IRIS are the same as on Diana – except (for now) that TCPIP is started in the main procedure, instead of the batch one. I will put that back where I defined it: The FTP-issue has also been solved: because SYLOGIN.COM was not accessible (owned by [SYSTEM] but protection states (W:) – so no access) login fails. Changed that to (W:RE) and it’s OK now…So it will probably as well when started in the batch-procedure – as well as SSH (which didn’t start either)
Should have thought about that; this is the one file that I changed in between….

Startup of INDRA will be copied from IRIS (and changed accordingly).

Fun part of reboot of Diana: it speeds up the blogs ….

Found out after publishing this post the first time: SHUTDOWN of both Itaniums caused loss of quorum on Diana, so I had to restart one of the Itaniums, invoke SHUTDOWN1 to add option “REMOVE_NODE”. It stops the server, but allows Diana to continue – since now quorum is adjusted. This should be added to the SHUTDOWN command on all systems – except Diana – to allow that system to continue, whatever happens to the other servers. Or adjust votes for the Itanium boxes to 0..
Or both.


Re-installation – a second time
Made a mistake in re-installation: set the system up as a cluster member, added the wrong data (cluster password) so Iris didn’t start – hung on joining the cluster – obviously. So I re-installed VMS again from scratch (INIT), now without clustering to begin with, and configured TCPIP. Now the problems with FTP were gone.
Next, I copied the saved general directory that contains all startup files (amongst other things) and the system-specific files that call these local procedures, being sure I covered all that is installed (and bypassing everything that isn’t yet) and rebooted. It didn’t work- because the queue manager needed to be defined and started, and queues defined. Once that was done, reboot went fine – as expected – except that, once again, FTP was said to be started in the log, but the server process (TCPIP$FTP_1) was still missing. So I moved startup of TCPIP from the local procedure (started on queue SYS$STARTUP) to SYSTARTUP_VMS.COM (which is started by the STARTUP process), and now FTP (and other services that failed to start) now do run.

But in the process I encountered something weird.
Previously, on startup, the EFI startup procedure would add CPU’s 1,2 and 3 to join the pool. CPU #0 is the startup-processor (Monarch). Now, it’s just CPU #1 that is added. is that the second code on the first CPU – or the first on the second CPU? EFI shell cpuconfig shows me 2 CPU’s running and active – and when trying to enable hyperthreading, the system responds that the CPU’s don’t support it.. So one CPU must be down… Something to dive into…

Anyway: I can now access the system using FTP so I can move all software that is needed, onto the system. and re-install compilers and development environments

NASTY!!!! MariaDB keeps causing problems…Lost connection again…


TCPIP trouble on IRIS
Starting TCPIP-services like FTP and SMTP keep failing on IRIS. The log file states that the procedure for the captive account could not be accessed sue to lack of privilege. There is no apparent for this, the user-entry in SYSUAF is OK, as well as the protection of the file including the full path. I compared it to the data on Diana, it’s the same except for a few system files on disk: These are owned by [1,1] (aka [SYSTEM] on Diana but by [1,1] on IRIS – which may cause a problem? I set all files to match the ownership as on Diana, but that won’t work for INDEXF.SYS since the file is locked as soon as the disk is mounted…
So the only option left is re-install OpenVMS on an INITIALIZEd disk – meaning I will have to redo all installations again. To prevent redoing all work, I made an image backup of the system disk first, so I can restore the files I changed in stead of re-do the edits.


IRIS set up
I want to configure Iris (the big Itanium server) the same way as Diana – the Alpha system. Apart from the obvious differences because of architecture, and a minor change in naming the centralized location (to be moved to a shared disk once I get my external storage (anticipated end of May)) it should be the same.
So I copied the startup-environment, completely, adapted SYSTARTUP_VMS.COM to do just the basic stuff, and run the rest on batch. Of course the files have been changed to fit what’s currently available. However, though submitted to the queue, it never started. Found out that SYSUAF and RIGHTSLIST – both authorization files – were the wrong ones: on Diana, these are kept on the shared environment, and the definitions in SYLOGICALS.COM refer to that environment; and that caused problems starting the procedure – and a lot more problems – it’s renamed on IRIS and the drive the account resides on is false – in these SYSUAF…
After I changed all references to that location (renamed on IRIS, but the files should not be used here anyway!) restart did start the batch procedure. But still not everything runs fine: FTP for instance will not start, there is something wrong with access to the login file. Same for SMTP, NTP and BIND…
Removed the services, and the users and identifiers, will probably have to re-install TCPIP – completely…
Anyway: Most files are now available, eventually I could pass then via Diana (since IRIS boots into the cluster, the shared-SCSI disks are accessible via MSCP – even the bootdrive of Diana.

To be continues 🙂