State of procedures

(Triggered by an article by Jim Duff).

My remark on the issue – a failing disk in s shadow set was unnoticed – is mainly a matter of inappropiate system monitoring. Disks are prone to error and MUST be monitored. At least: watched in regular intervals. Well, very in 6 moths IS regular – but I mean shorter ones. Jim published a procedure that can do a good job, but some other things need to be considered as well.

A fair example is backup jobs. As are monitoring scripts. Any other procedure that runs automatically and is’more or less critical.
It is often assumed these jobs have 2 states: they have run successfully, or they haven’t. In a lot of cases, if not most, the system manager is only notifies if something went wrong – and no message usually means the procedure finished succesfully. So everyone relies on the absence of a message assuming backup has finished corerectly – until the moment of truth arrrives and a disk needs to be restored – wand it was found that the backup was actually non-existing.
This is No Horror Story. It happens…..

Principle procedures have 3 states:

  • it has run and finsihed succesfully
  • it has run but failed
  • it has not run at all
  • The first two must be brought to the attenmtion of the ssytem manager – or responsible operator, if that is appropiate. As soon as possible, preferrably – but it largely depends on priority. In some cases, failure of backup is not that important. In other cases, immediate action is a requirement.
    One medium – often used, I guess – is mail, or paging. In case of success, the message does not have to be a full body. In case of failure however, it might be required that more information on the error is added. In case of e-mail, the logfile, for instance.
    The third state is ‘delivered’ by absence of such a message. No message means Big Trouble.
    You can stretch this even further: signal each event separately. But it largely depends on the significance of each step – and the importance of the whole process.
    A watchdog program or procedure could add a monitoring facility – but mostly adds just one element in the chain. If that one brakes, it breaks monitoring totally, and you, as a system manager, will have no idea what’s gone wrong and right, and you will have to revert to the logfiles and other traces of activity (accounting, audit).

    The same applies to procedures mionitoring the system. Again, the feasibilty of logging depends. Scanning disks for failures is good, but you must be sure the job HAS run, and what the outcome has been. Even if there were no errros found: send a message the disks seem to be Ok.

    But it depends. Sometimes, a message on each run isn’t feasable. I have a job run every 15 minutes or so, scanning the system for teh MySQL server, to restart it when it has failed. I DID have an issue with that job, where resubmitting itself silently failed, and when MySQL actually crahed, I had to wait all day before I was able to restart it – days after…It would have been noticed if I had received a message of the failure (Ok, I got it – in the logfile that I should have checked…) This has been altered – and the logfile is now in the operator webspace, like all other logs and utilities.

    A difference in perception?

    Hoff thinks it’s funny, I think it’s pathetic:

    LIB$SIGNAL, JAVA style

    If this is coding, modern style, I pity those that will have to mainain that code in the future. Or is this a way to make yourself irreplacable, or pump up your income (if you get paid by the characters typed)…

    If Java had a only a function like LIB$SIGNAL, life would be much easier. If this is humour, it’s just to know our life as VMS programmers is much, much easier.

    11-Nov-2006

    We’re here!

    Starting the SYSMGR blog on www.grootersnet.nl went flawless (well, almost) within a minute! The only thing still to do is to import what’s on Blogger – and just link to here. I don’t know yet if I’ll keep that.

    Stay tuned. There’s more to come.

    Most data wil probably in System’s Logbook, because that’s what I’m talking about.

    About

    It’s all about what goes on in the datacenter. That is: my datacenter.

    The kernel systems: Diana, and, in time, Daphne and Dido, provide all connectivity to the outside world known as The Internet, with Io looking after them. They are by nature hardened enough to withstand allmost all Evil that lurks out there – running OpenVMS 8.3. Their task is enlightened by Cerberus, and sometimes Charon - the firewalling routers between The Internet and the Local Area Network, that is inhabited by Aphrodite, Irene and Demeter – all running WindowsXP SP2, and Pallas/Athene – that may publish herself based on the user preferences as a Windows XP machine, or Suse 10.1 – so Linux.
    Then there is Demeter – the company laptop – running WindwsXP as well, and her offspring Persephone, a Personal Alpha instance, emulating an alphaStation 3000 with 96 Mb memory, and,depening on the configuration, 1,2,3 or 4 disks up to 4Mb each, of which one can be a CD, or perhaps memory stick on the USB bus (yeah, that works!), with out without network, and that could be over 1Gb UTP or even 54Mb wireless.