(Triggered by an article by Jim Duff).
My remark on the issue – a failing disk in s shadow set was unnoticed – is mainly a matter of inappropiate system monitoring. Disks are prone to error and MUST be monitored. At least: watched in regular intervals. Well, very in 6 moths IS regular – but I mean shorter ones. Jim published a procedure that can do a good job, but some other things need to be considered as well.
A fair example is backup jobs. As are monitoring scripts. Any other procedure that runs automatically and is’more or less critical.
It is often assumed these jobs have 2 states: they have run successfully, or they haven’t. In a lot of cases, if not most, the system manager is only notifies if something went wrong – and no message usually means the procedure finished succesfully. So everyone relies on the absence of a message assuming backup has finished corerectly – until the moment of truth arrrives and a disk needs to be restored – wand it was found that the backup was actually non-existing.
This is No Horror Story. It happens…..
Principle procedures have 3 states:
it has run and finsihed succesfully
it has run but failed
it has not run at all
The first two must be brought to the attenmtion of the ssytem manager – or responsible operator, if that is appropiate. As soon as possible, preferrably – but it largely depends on priority. In some cases, failure of backup is not that important. In other cases, immediate action is a requirement.
One medium – often used, I guess – is mail, or paging. In case of success, the message does not have to be a full body. In case of failure however, it might be required that more information on the error is added. In case of e-mail, the logfile, for instance.
The third state is ‘delivered’ by absence of such a message. No message means Big Trouble.
You can stretch this even further: signal each event separately. But it largely depends on the significance of each step – and the importance of the whole process.
A watchdog program or procedure could add a monitoring facility – but mostly adds just one element in the chain. If that one brakes, it breaks monitoring totally, and you, as a system manager, will have no idea what’s gone wrong and right, and you will have to revert to the logfiles and other traces of activity (accounting, audit).
The same applies to procedures mionitoring the system. Again, the feasibilty of logging depends. Scanning disks for failures is good, but you must be sure the job HAS run, and what the outcome has been. Even if there were no errros found: send a message the disks seem to be Ok.
But it depends. Sometimes, a message on each run isn’t feasable. I have a job run every 15 minutes or so, scanning the system for teh MySQL server, to restart it when it has failed. I DID have an issue with that job, where resubmitting itself silently failed, and when MySQL actually crahed, I had to wait all day before I was able to restart it – days after…It would have been noticed if I had received a message of the failure (Ok, I got it – in the logfile that I should have checked…) This has been altered – and the logfile is now in the operator webspace, like all other logs and utilities.