15-Sep-2022

Issues with Dione
Just before we would take a short stay abroad, Dione stalled for some reason, but this was noticed hours later. The system did not respond to any input, not even ^P (to return to console). So the only fast way out was to cycle power (via RMS) and reboot. This did work but shortly after, the system stalled again. So I started Diana – which was still missing a number of required updates in configuration-based files, but at least e-mail did work properly, the websites were accessible, the only problem found was that the blogs didn’t work, but only this blog was updated and not yet copied; most of the images of the “trips, tracks and travels” site could be seen but the blog was not accessible either. I could do some investigation and found that the PHP environment was missing data, but repairing had to wait until the end of our trip.

During the trip, the LetsEncrypt certificates on Diana have been updated.

The first thing after return was to repair the problem in the PHP directory: copying the environment from a logical disk to real hardware caused a [.000000] directory that contained the files that should reside on the root of the environment: the issue was solved my moving these files one level up and delete 000000.dir;. Now the blogs worked again.

Another things that needed to be updated on Diana required Dione to be started, but that stalled during boot. So booted it minimal, to check what was going on. No problem at all! Next, usingĀ  DFU I checked operator.log files to see if that contained any resaon: and behold: though DIR showed operator.log;7035 (the previous one) it could not be found using TYPE. So I checked all disks with DFU VERIFY /Directory and /FIX that found a number of inconsistencies, and fixed them, but the files was still there. Next DFU/REPAIR which again found a number of inconsistencies but the file was now gone. Checked the other disks as well (using /REBUILD), reset Startup_P1 to “” and rebooted. Now the system started immediately.

Next action is about the PMAS configuration files where I store my own lists of accepted, quarantined, discarded and rejected messages. This has been updated just before the trip and so needed to be moved to Diana as well. I left the files that Process sends in because I had no way of knowing which one are really new. Next compiled the files so these are fine as well.

Finally, I copied the latest MySQL backup from Dione to Diana to have this blog up-to-date and add this entry.

To finalize the action, I had to submit batch-jobs – the PMAS jobs were started and running when Diana was started and email-activity activated, but the maintenance jobs were missing, so I added them, but there still may be a number of issues with them, to be solved tomorrow.

On Dione, I stopped most jobs – but left the PMAS jobs run: there is no problem with that, it will keep the PMAS files up-to-date. Since Diana now holds the cluster IP address (where all Internet traffic is forwarded to) there is no problem for both systems to run active-active.

Update (19-Sep-2022)
I put the issue with Dione on the VMS-SIG mailing list, asked for causes. It seems that if one of the disks in the RAID configuration starts to fail, writes to the RAID set is stalled: actually blocking all activity until the write finally succeeds. However, the controller should remove the failing disk from the set. It can also be caused by the accelerator-option failing, or the cache batteries. So I checked Diana first:
$ mc msa$util
MSA> set contr pkc0
MSA> sho cont

Adapter: _PKC0: (DEFAULT)
Smart Array 5300 (c) COMPAQ P2313ABFFOA1F6 Software 2.86
SCSI_VERSION = X3.131:1994 (SCSI-2)
Supported Redundancy Mode: Sym Active/Active Asym Active/Active
Current Redundancy mode: Not Available/Not currently redundant
Current Role: Standby
Cache:
45 megabyte read cache 179 megabyte write cache
Cache is enabled and Cache is GOOD.
No unflushed data in cache.
Battery:
Battery is fully charged.
Controller Mode:
Controller is in HBA Mode.
MSA> sho disk

Parallel SCSI device [Disk]
Disk 0: SCSI bus 0 id 0 size 279.40 [300.00] GB
Disk 000, # 0, size 52420095 blocks, (25.00 [26.84] GB), Unit 0.
Disk 000, # 1, size 52420095 blocks, (25.00 [26.84] GB), Unit 1.
Disk 000, # 2, size 52420095 blocks, (25.00 [26.84] GB), Unit 2.
Disk 000, # 3, size 428662395 blocks, (204.40 [219.48] GB), Unit 3.
Disk 000, # 4, size 13732 blocks, (6.71 [7.03] MB), Unused.

Parallel SCSI device [Disk]
Disk 1: SCSI bus 0 id 1 size 279.40 [300.00] GB
Disk 001, # 0, size 52420095 blocks, (25.00 [26.84] GB), Unit 0.
Disk 001, # 1, size 52420095 blocks, (25.00 [26.84] GB), Unit 1.
Disk 001, # 2, size 52420095 blocks, (25.00 [26.84] GB), Unit 2.
Disk 001, # 3, size 428662395 blocks, (204.40 [219.48] GB), Unit 3.
Disk 001, # 4, size 13732 blocks, (6.71 [7.03] MB), Unused.
MSA> sho unit

Unit 0:
In PDLA mode, Unit 0 is Lun 0.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 0:
Disk 0: Partition 0; (SCSI bus 0, SCSI id 0)
Disk 1: Partition 0; (SCSI bus 0, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 25.00 [26.84] GB

Unit 1:
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 1:
Disk 0: Partition 1; (SCSI bus 0, SCSI id 0)
Disk 1: Partition 1; (SCSI bus 0, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 25.00 [26.84] GB

Unit 2:
In PDLA mode, Unit 2 is Lun 2.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 2:
Disk 0: Partition 2; (SCSI bus 0, SCSI id 0)
Disk 1: Partition 2; (SCSI bus 0, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 25.00 [26.84] GB

Unit 3:
In PDLA mode, Unit 3 is Lun 3.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 3:
Disk 0: Partition 3; (SCSI bus 0, SCSI id 0)
Disk 1: Partition 3; (SCSI bus 0, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 204.40 [219.48] GB
MSA>

and after Dione was started, did the same on that one:

$ mc msa$util
MSA> set contr pkc0
MSA> sho cont

Adapter: _PKC0: (DEFAULT)
Smart Array 5300 (c) COMPAQ P2313ABFF0683X Software 3.56
SCSI_VERSION = X3.131:1994 (SCSI-2)
Supported Redundancy Mode:
Not currently Redundant
Current Role: Active
Cache:
45 megabyte read cache 179 megabyte write cache
Cache is enabled and Cache is GOOD.
No unflushed data in cache.
Battery:
Battery is fully charged.
Controller Mode:
Controller is in RAID Mode.
MSA> sho disk

Parallel SCSI device [Disk]
Disk 100: SCSI bus 1 id 0 size 279.40 [300.00] GB
Disk 100, # 0, size 104856255 blocks, (50.00 [53.69] GB), Unit 0.
Disk 100, # 1, size 104856255 blocks, (50.00 [53.69] GB), Unit 1.
Disk 100, # 2, size 104856255 blocks, (50.00 [53.69] GB), Unit 2.
Disk 100, # 3, size 271353915 blocks, (129.39 [138.93] GB), Unit 3.
Disk 100, # 4, size 13732 blocks, (6.71 [7.03] MB), Unused.

Parallel SCSI device [Disk]
Disk 101: SCSI bus 1 id 1 size 279.40 [300.00] GB
Disk 101, # 0, size 104856255 blocks, (50.00 [53.69] GB), Unit 0.
Disk 101, # 1, size 104856255 blocks, (50.00 [53.69] GB), Unit 1.
Disk 101, # 2, size 104856255 blocks, (50.00 [53.69] GB), Unit 2.
Disk 101, # 3, size 271353915 blocks, (129.39 [138.93] GB), Unit 3.
Disk 101, # 4, size 13732 blocks, (6.71 [7.03] MB), Unused.
MSA> sho unit

Unit 0:
In PDLA mode, Unit 0 is Lun 0.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 0:
Disk 100: Partition 0; (SCSI bus 1, SCSI id 0)
Disk 101: Partition 0; (SCSI bus 1, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 50.00 [53.69] GB

Unit 1:
In PDLA mode, Unit 1 is Lun 1.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 1:
Disk 100: Partition 1; (SCSI bus 1, SCSI id 0)
Disk 101: Partition 1; (SCSI bus 1, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 50.00 [53.69] GB

Unit 2:
In PDLA mode, Unit 2 is Lun 2.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 2:
Disk 100: Partition 2; (SCSI bus 1, SCSI id 0)
Disk 101: Partition 2; (SCSI bus 1, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 50.00 [53.69] GB

Unit 3:
In PDLA mode, Unit 3 is Lun 3.
Cache status : enabled
Max Boot Partition: Unknown
Volume status : VOLUME OK
2 Data Disk(s) used by lun 3:
Disk 100: Partition 3; (SCSI bus 1, SCSI id 0)
Disk 101: Partition 3; (SCSI bus 1, SCSI id 1)
Spare physical drives:
No spare drives are designated.
Logical Volume Raid Level: RAID 1. Mirroring
stripe_size=128.0KB
Logical Volume Capacity : 129.39 [138.93] GB
MSA>

Apart from the unit sizes (Dione was setup this way when I got it, on Diana I defined the sizes myself), the major difference is that on Diana, the unit is in HBA mode, where on Dione, the controller is in RAID mode. And there is no indication of a failing, or failed disk. Or could the RAID mode on Dione cause the problem? Something to ask on VMS-SIG.