Iris setup continued
Startup procedures have been adapted – there were still a few things wrong: Since there was no real database yet (I created one but didn’t load it with any data) shutdown would fail since the login-information – using password – was wrong: There were no passwords yet. WASD wouldn’t start because of a typo in one of the startup-procedures, and the PHP-environment had to be set up – and adapted – as well.
All these things have now been done: The database can be started (and stopped), WASD starts up – there is one more thing to check: LIBZ must still be located and installed – and PHP (7.0, to begin with) has been installed and is available: All required logicals are now set.
Next, I copied the MySQL backup from Diana, which is actually a SQL file. Sourced it into the database (have to revert the changes in startup to hold the original login-information which has been restored…) and installed the latest version of WordPress (4.9.5), and copied the security information to Iris as well.

Next stage: change startup-procedure, install the blogs and tracking-images from Diana, and the WASD configuration files – change the so I can access the blogs – have to set things up in the router as well to address the right servers.


Maintenance jobs
No suprises in the monthly run. Mail is OK:

PMAS statistics for April
Total messages    :   2932 = 100.0 o/o
DNS Blacklisted   :      0 =    .0 o/o (Files:  0)
Relay attempts    :    410 =  13.9 o/o (Files: 30)
Accepted by PMAS  :   2522 =  86.0 o/o (Files: 30)
  Handled by explicit rule
         Rejected :   1916 =  75.9 o/o (processed),  65.3 o/o (all)
         Accepted :    140 =   5.5 o/o (processed),   4.7 o/o (all)
  Handled by content
        Discarded :    274 =  10.8 o/o (processed),   9.3 o/o (all)
     Quarantained :    163 =   6.4 o/o (processed),   5.5 o/o (all)
        Delivered :     29 =   1.1 o/o (processed),    .9 o/o (all)

Even the number of relay attempts is low: 2nd and 26th had about 190 attempts – both files are 56 blocks in size:
Again, using a dummy user within this account, and trying to reach a gmail-address:
02-Apr-2018 189 attempts between 06:28:36.69 and 06:32:07.00, address, to 1029mandaditos@gmail.com
26-Apr-2018 191 attempts between 17:11:35.46 and 17:15:35.00, address to 1029mandaditos@gmail.com
The address is hosted as hostswinds.com – again….

Webserver update gone wrong
05-Apr-2018 the webserver had been updated but it was only last Monday I noted something was very wrong: None of the imeages in the Tracks environmemt was accessable: It all ran into 404 error (File not found). It started when the old version (that failed to react on exit command…) kept the newly installed version from starting. Nothing was wrong at that moment – until I forced it to stop: And from one moment to another, all accesses to these files were gone. You could note it in the access log: - - [05/Apr/2018:10:45:57 +0000] "GET /Tracks/donauradweg2/11-05/slides/38-Pavement.html HTTP/1.1" 200 3652 - - [05/Apr/2018:10:46:45 +0000] "GET /Tracks/donauradweg2/18-05/CastleHill/55-TheFront.jpg HTTP/1.1" 200 1111326 - - [05/Apr/2018:10:47:01 +0000] "GET /Tracks/USA2004/31-Jul/slides/04-BlueUoachitaRange.html HTTP/1.1" 200 4382 - - [05/Apr/2018:10:47:09 +0000] "GET /tracks/italy2009/26-05-SwissWalk/slides/23-HangingGarden.html HTTP/1.1" 200 3638 - - [05/Apr/2018:10:47:22 +0000] "GET /Tracks/Scotland2007/10-Jun/slides/08-Quai-orig.html HTTP/1.1" 200 3870 - - [05/Apr/2018:10:47:38 +0000] "GET /Tracks/GroeneHart/Woerden_IJsselstein/slides/10-MontfoortEntrance.html HTTP/1.1" 200 4289 - - [05/Apr/2018:10:48:36 +0000] "GET /tracks/rheinsteig-2/03-09/slides/02-Tree.html HTTP/1.1" 200 4725 - - [05/Apr/2018:10:49:26 +0000] "GET /tracks/havezatenpad/dalfsen-zwolle/31-Stadion_jpg_orig.html HTTP/1.1" 200 1283

Here (10:49:42, according server log, I killed WASD by STOP/ID, the new version starts up. Next lines are of this server. - - [05/Apr/2018:10:49:46 +0000] "GET /Tracks/Scotland2007/08-Jun/slides/06-LochTulla.html HTTP/1.1" 404 921 - - [05/Apr/2018:10:49:53 +0000] "GET /Tracks/USA2004/23-Jul/09-GoingRound/slides/04-Tower-orig.html HTTP/1.1" 404 921 - - [05/Apr/2018:10:50:41 +0000] "GET /Tracks/USA2004/21-Jul/slides/08-KimLooksOut.html HTTP/1.1" 404 921 - - [05/Apr/2018:10:52:09 +0000] "GET /tracks/rheinsteig-2/01-09/slides/13-StateBorder-orig.html HTTP/1.1" 404 921 - - [05/Apr/2018:10:52:22 +0000] "GET /tracks/italy2009/24-05-MainLand/slides/41-AlongTheCanal.html HTTP/1.1" 404 921 - - [05/Apr/2018:10:52:54 +0000] "GET /Tracks/Havezatenpad/Oldenzaal-Drienerlo/slides/06-WayUp.html HTTP/1.1" 404 921 - - [05/Apr/2018:10:53:50 +0000] "GET /tracks/lahnsteig/0410/slides/17-Fruit.html HTTP/1.1" 404 921 - - [05/Apr/2018:10:54:17 +0000] "GET /Tracks/Scotland2007/13-Jun/JedburghAbbey/slides/24-Artwork.html HTTP/1.1" 404 921

I notified Mark Daniel and he located the bug within days (he’s on vacation…(, sent out a fix to be applied and this was applied this morning, restarted this new version. The problem os now solved.

However, he found something in IP traffic that causes delay: RTT of about 300Ms – probably caused by the fact that about 50% of thepackets are dropped.

What causes this: router, system??? Something to be investigated…


WCME issue
Since WCME has been updated, all seems well except that renewal of the certificate for the main site (this one) fails day after day. Reason might be that https: is not yet enabled, or that things have not been set up yet. Or not properly. However, this has a somewhat lower priority – but needs to be addressed…

By the way: I _think_ I found a reason for losing connection now and then: There is a line in the WASD mapping that disallows access for processes that have over 10 connections. That is valid – but should not be applied to my own address – the server may use that for it’s won processing…: I couldn’t connect to the normal site, got “too many connections” in the WATCH output, and the server returned the (intended) 503 error….
Changed that to be applicable for any address – except my own. It seems to influence the blog performance as well.

Second, I started working on the procedures and programs to scan the log files and relate the data in terms of time and source. For file transfer and web the logs of these services have the data in them, but for mail, I have to add router data in order to find out who is trying to pass mail – which will be discarded by the spam filter – but the signalling if incoming mail does not show the originator – I need both PMAS and router logs to find the culprits.


Monthly maintenance on April 2st shows no surprises, except the low number of relay attempts:
PMAS statistics for March
Total messages    :   2696 = 100.0 o/o
DNS Blacklisted   :      0 =    .0 o/o (Files:  0)
Relay attempts    :    218 =   8.0 o/o (Files: 31)
Accepted by PMAS  :   2478 =  91.9 o/o (Files: 31)
  Handled by explicit rule
         Rejected :   1741 =  70.2 o/o (processed),  64.5 o/o (all)
         Accepted :    173 =   6.9 o/o (processed),   6.4 o/o (all)
  Handled by content
        Discarded :    422 =  17.0 o/o (processed),  15.6 o/o (all)
     Quarantained :    126 =   5.0 o/o (processed),   4.6 o/o (all)
        Delivered :     16 =    .6 o/o (processed),    .5 o/o (all)

There was just one log of relay attempts significantly larger than the rest: Most were either empty or just 4 blocks in size (the minimum , could take just a few lines). This one was 56 blocks (28 KB) holing 190 records, sent from
21:10:43.65 to 21:14:46.53, sent from address, sender “{fake, of course)@grootersnet.nl” trying to reach locotrones1029@gmail.com. Again, the address seems to belong to HostsWinds.com.

Software updates
Triggered by the fact that the secured sites had access issues yesterday (invalid certificate date) and messages on the WASD mailing list concerning issues with WCME (entry 32 and up) I knew I had to update WCME to renew the certificates. Mark had also posted a mail on a new WASD version (and therefore, ALAMODE, the real-time monitor); Checking the site I found that there was also a newer release of MonDesi (the real-time web-based system monitor) so I decided to get WCME directly and install and run it.
Checking the system however, there was no WCME-program running – and there should be one: WCME-overseer. The last log file it would create as from January this year, and I restarted the system in February So the process hadn’t run for two months. No problem, since all it does is check the date of the certificates and if these are about to expire, run the WCME-program to create new ones.
And that hadn’t been done, obviously.
To find out why, I executed the line in the server-startup procedure to get it running. But time after time it failed; No logfile, though I requested one. Next step is to analyze the audit log – and behold:
Security alarm (SECURITY) and security audit (SECURITY) on DIANA, system id: 21505
Auditable event:          Detached process login failure
Event time:                4-APR-2018 19:51:24.72
PID:                      2020FDA4
Process name:             WCME-startup
Username:                 HTTP$NOBODY
Process owner:            [HTTP$NOBODY]
Image name:               $116$DKA100:[SYS0.SYSCOMMON.][SYSEXE]LOGINOUT.EXE
Posix UID:                -2
Posix GID:                -2 (%XFFFFFFFE)
Status:                   %RMS-E-PRV, insufficient privilege or file protection violation

I changed the command line to use SYSTEM – and that caused WCME-overseer to run. Auto-renewal should now happen – overnight. This morning, I indeed got the message that certificate renewal had been successful – but the sites still had problems. Reason: The new certificates were not copied to the right spot, I have a procedure to take care of that, it may have run but was looking to the wrong location. Changed that, and the certificates are now on the right spot. But I needed to restart the server to accept the new licences: Only after an update of the server, this could be done without restarting the server….Anyway, after that the sites were accessible as they should be.

So now came the update of WASD, OPENMSSL and ALAMODE – which went flawlessly as before – only that $ httpd/do=exit didn’t work as expected, as I found out after I started the new version: Still got the old server, when accessing the admin pages….Killed it using stop/id, the new server (running but in ‘starting’ mode) got on directly and was running fine. Alamode worked fine as well, once it was properly installed.

The next to be updated was the system monitor MonDeSi – np problems here either, except that, for some reason, a $ httpd/do-restart was required to get the new version running. Probably an issue with caching …

The next update is WordPress – version 4.9.5 – which should cause no problems.

Wordpress has been updated to 4.9.5 – as well as Akismet (4.0.3)


Database running on IRIS
Downloaded the latest version of MariaDB from Mark Berryman (5.5.59) which installed flawlessly (as expected) and running the MySQL_Install_db script went fine, until I (again) encountered an error, but now it shows the reason:

180310 18:43:30 [Note] $3$lda2:[000000.mysql055.][bin.ia64]mysqld.exe;1 (mysqld 5.5.59-MariaDB) starting as process 555746351 ...
180310 18:43:30 InnoDB: The InnoDB memory heap is disabled
180310 18:43:30 InnoDB: Mutexes and rw_locks use InnoDB's own implementation
180310 18:43:30 InnoDB: Compressed tables use zlib 1.2.11
180310 18:43:31 InnoDB: Initializing buffer pool, size = 128.0M
180310 18:43:32 InnoDB: Assertion failure in thread 2070851264 in file [freeware.mariadb-5^.5^.59.storage.xtradb.os]os0sync.c;1 line 123
InnoDB: Failing assertion: pthread_cond_init(cond, NULL) == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
%DEBUGBOOT-W-EXPGFLQUOTA, exceeded pagefile quota

though the user (myqsl051_Svr) should have enough (2.000.000) blocks. Doubling it made no difference…

Checking sys$system:pagefile.sys showed the reason: it was just 604064 blocks in size. WAY to small, so I had to increase it’s size to match the requirement – for now, this is 3000000. Perhaps a second pagefile should have been netter, but this was the easy fix. Required a reboot of the system, after which there was no problem at all running the script. At least, up to starting the final startup of the server: This failed, tgime after time. So added a logfile to be created (the SUBMIT statement has “/nolog” option, changed that to “/log=mysql055_root:[MySQL_server]” to get it there) But: No file created….
Mind-wave: take a look into accounting – and behold:

10-MAR-2018 19:17:21 LOGFAIL             MYSQL051_SRV 2140042D          00D3810C

Same for every attempt.
$ ACC/FULL/TYPE=LOGFAIL to get more detail:

Username:          MYSQL051_SRV      UIC:               [MYSQL051,MYSQL051_SRV]
Account:                      Finish time:       10-MAR-2018 19:17:21.34
Process ID:        2140042D          Start time:        10-MAR-2018 19:17:21.29
Owner ID:                            Elapsed time:                0 00:00:00.04
Terminal name:                       Processor time:              0 00:00:00.01
Remote node addr:                    Priority:          4
Remote node name:                    Privilege <31-00>: 00108000
Remote ID:                           Privilege <63-32>: 00000000
Remote full name:
Posix UID:         -2                Posix GID:         -2 (%XFFFFFFFE)
Queue entry:       2                 Final status code: 00D3810C
Queue name:        SYS$BATCH
Job name:          start_mysqld
Final status text: %LOGIN-F-DISUSER, account is disabled
Page faults:              164        Direct IO:                 12
Page fault reads:           4        Buffered IO:               14
Peak working set:        3056        Volumes mounted:            0
Peak page file:        173680        Images executed:            1

Of course. Creating a user requires “/flags=NODISUSER” to activate. And I forgot that one.
mc authorize mod mysql051_srv/flags=nodisuser

and redo startup. And this time:

$ sho sys/proc=mar*
OpenVMS V8.4  on node IRIS   10-MAR-2018 20:00:40.07   Uptime  0 00:50:34
  Pid    Process Name    State  Pri      I/O       CPU       Page flts  Pages
21400434 MariaDB_Server  HIB      6     2914   0 00:00:00.96     14466  15845 M

So now I do have a database running, to be filled: Get the backup from Diana (which is an SQL script) and run it. (tomorrow – than this entry will be there as well).

Now the database is running, get WASD – 11.2 (latest), which I will install on DIANA as well.


Setting up IRIS
Getting on setting up IRIS, one of the big Itanium boxes – to hold the same software as DIANA (the DS10), involved new software to be installed. Started with the MariaDB database (5.5.58) as downloaded (and installed on DIANA). Setting up the software is no problem, but when starting the script to create an empty database, this failed when creating certificates: the script refers to SSLROOT: which doesn’t exist, and later on, the server program crashes. But I made a mistake in the beginning so no database was created.
Missing SSLROOT is easily solved:
and I restarted the script – now without errors. Creating certificates was no problem – but the server still crashed:

Running MySQL for the first time...
Using Mailbox MBA564:

180304 20:10:41 InnoDB: The InnoDB memory heap is disabled
180304 20:10:41 InnoDB: Mutexes and rw_locks use InnoDB's own implementation
180304 20:10:41 InnoDB: Compressed tables use zlib 1.2.6
180304 20:10:42 InnoDB: Initializing buffer pool, size = 128.0M
180304 20:10:43 InnoDB: Assertion failure in thread 2070851264 in file [freeware.mariadb-5^.5^.25.storage.xtradb.os]os0sync.c;1 li4
InnoDB: Failing assertion: 0 == pthread_mutex_init(fast_mutex, MY_MUTEX_INIT_FAST)
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.5/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
%SYSTEM-F-OPCCUS, opcode reserved to customer fault at PC=FFFFFFFF848DDC60, PS=0000001B
%TRACE-F-TRACEBACK, symbolic stack dump follows
%TRACE-I-END, end of TRACE stack dump
%RMS-E-EOF, end of file detected

There is a new version of the software available (5.5.59), so that may solve these problems. And a new version of WASD webserver as well, to be installed on both AXP as Itanium, later this week.

Diana’s fan now has stalled…But it’s pretty cool up here, and the system isn’t very busy, so it won’t be much of a problem for a short time. A new fan has been ordered – waiting to be delivered this week.


Monthly maintenance in itself showed no problems. Mail was as usual:

PMAS statistics for February
Total messages    :   2456 = 100.0 o/o
DNS Blacklisted   :      0 =    .0 o/o (Files:  0)
Relay attempts    :   1280 =  52.1 o/o (Files: 28)
Accepted by PMAS  :   1176 =  47.8 o/o (Files: 28)
  Handled by explicit rule
         Rejected :    842 =  71.5 o/o (processed),  34.2 o/o (all)
         Accepted :    116 =   9.8 o/o (processed),   4.7 o/o (all)
  Handled by content
        Discarded :    161 =  13.6 o/o (processed),   6.5 o/o (all)
     Quarantained :     49 =   4.1 o/o (processed),   1.9 o/o (all)
        Delivered :      8 =    .6 o/o (processed),    .3 o/o (all)

but there has been something strange in the antirelay logfiles: between 10-Fen and today, these are a mixture of PMAS.log content and PMAS_ANTIRELAY.LOG. All files were therefore over 60 blocks in size. Searching through these files for the antirelay status (571) held just two files that actually required inspection: the ones of 11th and 25th:

11-FEB-2018 between 21:24:54.19 and 21:28:00.87, from address (190 entries)
25-FEB-2018 between 20:29:42.78 and 25-FEB-2018, from address (146 entries).

In both cases, used a bogus user from this domain, trying to relay to locotrones1029@gmail.com. Once again, hosted by Hostwinds.com; so I’ll warn them again.

Last Itanium system into cluster
The two ‘big’ servers (IRIS and INDRA) are added to the cluster already, but one (INGE) was still to do.
Other than I initially thought, these two big servers have two single-core processors, without hyperthreading. Being similar in hardware (and from the same supplier), it looks these are the older ones of the three. That only CPU #1 is added to the set of CPU’s, is therefore obvious.
INGE, once booted, shows that CPU’s #1, 2 and 3 are added, so it is a newer one. However, it is the one causing problems. I had the machine running when I added the other systems but I couldn’t do that one because before, there as an issue with the SCSI controller so the machine didn’t boot when the system disk was in slot 0 (PUN0, LUN 0) – where is was when I installed VMS on it. Moved it to slot # 2 (PUN2, LUN 0) where I could boot is when accessing EFI and started it by accessing this disk directly and start FSO:\EFI\VMS\VMS_LOADER). But on the EFI boor menu, it was still residing on PUN0 LUN0. So I need to do it interactively.

However, trying to access the ILO yesterday failed: The ILO was non-responsive to PING, TELNET and HTTP on the designated address. So today it was a matter of finding out the cause. Thinking it might be the battery failing (though the supplier said he replaced it) so I took the IKO out, put in a new battery, and re-installed it. Since DHCP is enabled by default, I could figure out what address it would have – but on DIANA, $ DHCPDBDUMP didn’t show any address that did work. $ TCPIP SHO HOST gave me a hint – since the name of the management port is MP<macaddress> now I could access MP, set the configuration to what it should be, and work from there.: Booted the machine from bay#2 – and that worked. Next, shutdown the machine to find out what message was given if the disk was in slot#0 – where it should be. This time, there was nothing wrong; perhaps because when the ILO was out, I could check the connector, may have moved it a bit… SO I added the machine to the cluster. Now being able to access the disk containing the new licenses, this machine is set to work until 12-Mar-2019 – like all others.
Hopefully, this wail keep working – fingers crossed.
Next stage is setting this machine up (as well as INDRA) as DIANA and IRIS, and a quorum node on a laptop running a 64-bit version of Windows (so not the current one).

Fan bearings gone
The main fan of the DS10 server (DIANA) is getting very loud – sounds the bearings are gone. A new one has been ordered but it is uncertain when it will arrive….


Cluster completed
One big problem to create the cluster was the existence of the quorum disk, that both Itanium system cannot access directly. Plus, on one of these nodes, I set EXPERCTED_VOTES to 3. Even without specifying the quorum disk, that system waited, and waited… to join the cluster where the other one (with EXPECTED_VOTES set to 1) would happily join.

Probably not the right thing to do – but now all systems have NO quorum disk and expected_votes of 1 (each contributing one vote). Now I got my cleuster after rebooting all systems:

View of Cluster from system ID 21505  node: DIANA                                                              24-FEB-2018 20:28:18
│        SYSTEMS        │ MEMBERS │
│ DIANA  │ VMS V8.4     │ MEMBER  │
│ IRIS   │ VMS V8.4     │ MEMBER  │
│ INDRA  │ VMS V8.4     │ MEMBER  │

Similar on the other nodes.
Now Daphne (the small Alpha system) and Inge (to be renamed to another (female) deity as all other systems are) need to be updates for the removed quorum disk. Plus I’ll set up a FreeAXP process on the consoie laptop to function as quorum node.

Startup procedures for IRIS are the same as on Diana – except (for now) that TCPIP is started in the main procedure, instead of the batch one. I will put that back where I defined it: The FTP-issue has also been solved: because SYLOGIN.COM was not accessible (owned by [SYSTEM] but protection states (W:) – so no access) login fails. Changed that to (W:RE) and it’s OK now…So it will probably as well when started in the batch-procedure – as well as SSH (which didn’t start either)
Should have thought about that; this is the one file that I changed in between….

Startup of INDRA will be copied from IRIS (and changed accordingly).

Fun part of reboot of Diana: it speeds up the blogs ….

Found out after publishing this post the first time: SHUTDOWN of both Itaniums caused loss of quorum on Diana, so I had to restart one of the Itaniums, invoke SHUTDOWN1 to add option “REMOVE_NODE”. It stops the server, but allows Diana to continue – since now quorum is adjusted. This should be added to the SHUTDOWN command on all systems – except Diana – to allow that system to continue, whatever happens to the other servers. Or adjust votes for the Itanium boxes to 0..
Or both.


Re-installation – a second time
Made a mistake in re-installation: set the system up as a cluster member, added the wrong data (cluster password) so Iris didn’t start – hung on joining the cluster – obviously. So I re-installed VMS again from scratch (INIT), now without clustering to begin with, and configured TCPIP. Now the problems with FTP were gone.
Next, I copied the saved general directory that contains all startup files (amongst other things) and the system-specific files that call these local procedures, being sure I covered all that is installed (and bypassing everything that isn’t yet) and rebooted. It didn’t work- because the queue manager needed to be defined and started, and queues defined. Once that was done, reboot went fine – as expected – except that, once again, FTP was said to be started in the log, but the server process (TCPIP$FTP_1) was still missing. So I moved startup of TCPIP from the local procedure (started on queue SYS$STARTUP) to SYSTARTUP_VMS.COM (which is started by the STARTUP process), and now FTP (and other services that failed to start) now do run.

But in the process I encountered something weird.
Previously, on startup, the EFI startup procedure would add CPU’s 1,2 and 3 to join the pool. CPU #0 is the startup-processor (Monarch). Now, it’s just CPU #1 that is added. is that the second code on the first CPU – or the first on the second CPU? EFI shell cpuconfig shows me 2 CPU’s running and active – and when trying to enable hyperthreading, the system responds that the CPU’s don’t support it.. So one CPU must be down… Something to dive into…

Anyway: I can now access the system using FTP so I can move all software that is needed, onto the system. and re-install compilers and development environments

NASTY!!!! MariaDB keeps causing problems…Lost connection again…