State of procedures

(Triggered by an article by Jim Duff).

My remark on the issue – a failing disk in s shadow set was unnoticed – is mainly a matter of inappropiate system monitoring. Disks are prone to error and MUST be monitored. At least: watched in regular intervals. Well, very in 6 moths IS regular – but I mean shorter ones. Jim published a procedure that can do a good job, but some other things need to be considered as well.

A fair example is backup jobs. As are monitoring scripts. Any other procedure that runs automatically and is’more or less critical.
It is often assumed these jobs have 2 states: they have run successfully, or they haven’t. In a lot of cases, if not most, the system manager is only notifies if something went wrong – and no message usually means the procedure finished succesfully. So everyone relies on the absence of a message assuming backup has finished corerectly – until the moment of truth arrrives and a disk needs to be restored – wand it was found that the backup was actually non-existing.
This is No Horror Story. It happens…..

Principle procedures have 3 states:

  • it has run and finsihed succesfully
  • it has run but failed
  • it has not run at all
  • The first two must be brought to the attenmtion of the ssytem manager – or responsible operator, if that is appropiate. As soon as possible, preferrably – but it largely depends on priority. In some cases, failure of backup is not that important. In other cases, immediate action is a requirement.
    One medium – often used, I guess – is mail, or paging. In case of success, the message does not have to be a full body. In case of failure however, it might be required that more information on the error is added. In case of e-mail, the logfile, for instance.
    The third state is ‘delivered’ by absence of such a message. No message means Big Trouble.
    You can stretch this even further: signal each event separately. But it largely depends on the significance of each step – and the importance of the whole process.
    A watchdog program or procedure could add a monitoring facility – but mostly adds just one element in the chain. If that one brakes, it breaks monitoring totally, and you, as a system manager, will have no idea what’s gone wrong and right, and you will have to revert to the logfiles and other traces of activity (accounting, audit).

    The same applies to procedures mionitoring the system. Again, the feasibilty of logging depends. Scanning disks for failures is good, but you must be sure the job HAS run, and what the outcome has been. Even if there were no errros found: send a message the disks seem to be Ok.

    But it depends. Sometimes, a message on each run isn’t feasable. I have a job run every 15 minutes or so, scanning the system for teh MySQL server, to restart it when it has failed. I DID have an issue with that job, where resubmitting itself silently failed, and when MySQL actually crahed, I had to wait all day before I was able to restart it – days after…It would have been noticed if I had received a message of the failure (Ok, I got it – in the logfile that I should have checked…) This has been altered – and the logfile is now in the operator webspace, like all other logs and utilities.

    12-Dec-2007

    Updates
    Last patches have been installed tonight – and I ran AUTOGEN to recalculate the system settings based on feedback. It caused NPAGEDYN be increased signficantly, and after last reboot (there were two: one after Update 5, and one after the other three) freepages was lower than before. While typing this, I have a program running that monitors MySQL to see where it breaks – if it does. It certainly looks faster.
    The MySQL problems might be caused by PTHREAD or CRTL – both are heavily used by MySQL according the VMS-author (Jean Francois Pieronne) and Update 5 contains updates for both. I thought I had them installed, the PThread patch one anyway.
    So far, nothing’s wrong – it hasn’t run for a long time so it might be fine today and cause trouble later on. The watcher is running so if things go wrong, I’ll just have to wait for a short time (15 minutes at most).

    Current state: about 60% of WSQUO is used (77000 out of 131072) and it has been more (70%, or 92100). Well, time will tell what comes out of this 🙂

    Clamav antivirus built
    which wasn’t as straight forward as described. Expanding .TAR-GZ and untarring the resulting .TAR was no problem (the originally loaded .GZ was definitely broken) but building the products was another matter.
    First, I started building BZIP2. There is a CONFIGURE procedure that ran fine and craetes a BUILD procedure, but I got compilation warnings that caused MMS to give up – and building failed, completely. But after I removed all files and untarred the package again, building went flawless. So now I got that libary 😀
    Building Clamav has the same strcuture: First run the configure script and the Build one thereafter. However, Configure siganlled a fatl condition: I was still missing a library that had to be downloaded as well. After that had been installed, the configure script went on and I could start a BUILD. That gave a number of warnings and these may cause problems when running the programs. It might have to do with the C-compiler version:

    $ cc/version
    HP C V7.3-009 on OpenVMS Alpha V8.3

    where the original may have been built with 6.5 (or earlier).
    Now I only need to install it to test it – but that will have to wait until I’ve time enough: I still have to figure out how to integrate it with PMAS.

    Web content
    Some older tracks have been published and in the coming weeks, I hope to have done the backlog.
    The program to replace the PHP code is getting along. All modules compile on both Alpha and Itanium (I got myself an account on HP’s testdrive) and I’m curently do some rework on the most important wones for this work. The testprogram compiles does link still fails. But we’re getting somewhere! (Walking is not that fast either)

    09-Dec-2007

    Fine so far
    The system has been working fine since the last MySQL crashes and I mailed to the mailing list for assistance. Not any process in outswapped mode (LEFO or HIBO) found druing the last days, even when PHP was running. and for the rest there seems to be no trouble either. This is what the system should run like 🙂
    I downloaded latest patches, I may do the update tonight as well, but since it requres one or two reboots, I prefer to have some more time.
    I also wanted to install ClaMAV – an OpenSource virus scanner that is ported to VMS but with the currently installed executables I cannot extract the files from the file – it’s a .GZ file and Gunzip seems to locate more than on e entry so won’t process the file. There maight be something wrong, whatever I try it fails. See if I can get a decent (more recent) version of Gunzip somewhere – freeware CD, most likely.
    Another thing to do is install CFIS – Samba – but again, too late in the day to start with it. I;ll have to spin up a disk and keep it online. No problem since I got one in this shelf available, but seeting it all up needs to be thought over and prepared well. Maybe some tim ethis week – if programming doesnt take all time (which is quite likely).
    I found that the new package I installed on the forums (User_Shield) will actually allow a bunch of inactive users be activated or deleted in one push – but the buttons show up in the SubSilver template only, to in order to flush the fake members, it’s needed to change the forum’s look. I think I’l have to update the PHP code of the theme in order to have the buttons available in the Galaxian theme used.
    Some more data might be added to Trips, Track and Travels – there is still a number of tracks to publish, even where MapSource refuses to startup at the moment. Garmin has been notified, waiting for the answer. Likely it will be that a files needs to be re-installed.
    But No….
    When storing this post, MySQL crashed again when I tried to access the page:

    071209 21:54:32 InnoDB: Error: Write to file /database/mysql41/ibdata1 failed at offset 0 1048576.
    InnoDB: 212992 bytes should have been written, only -1 were written.
    InnoDB: Operating system error number 12.
    InnoDB: Check that your OS and file system support files of this size.
    InnoDB: Check also that the disk is not full or a disk quota exceeded.
    InnoDB: Error number 12 means 'not enough core'.
    InnoDB: Some operating system error numbers are described at
    InnoDB: http://dev.mysql.com/doc/mysql/en/Operating_System_error_codes.html
    071209 21:54:32InnoDB: Assertion failure in thread 72258240 in file MYSQL_ROOT:[innobase.fil]fil0fil.c;1 line 3922
    InnoDB: Failing assertion: ret
    InnoDB: We intentionally generate a memory trap.
    InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
    InnoDB: If you get repeated assertion failures or crashes, even
    InnoDB: immediately after the mysqld startup, there may be
    InnoDB: corruption in the InnoDB tablespace. Please refer to
    InnoDB: http://dev.mysql.com/doc/mysql/en/Forcing_recovery.html
    InnoDB: about forcing recovery.
    mysqld got signal 10;
    This could be because you hit a bug. It is also possible that this binary
    or one of the libraries it was linked against is corrupt, improperly built,
    or misconfigured. This error can also be caused by malfunctioning hardware.
    We will try our best to scrape up some info that will hopefully help diagnose
    the problem, but since we have already crashed, something is definitely wrong
    and this may fail.

    key_buffer_size=8388608
    read_buffer_size=131072
    max_used_connections=2
    max_connections=100
    threads_connected=2
    It is possible that mysqld could use up to
    key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 72191 K
    bytes of memory
    Hope that's ok; if not, decrease some variables in the equation.

    InnoDB: Thread 72159936 stopped in file MYSQL_ROOT:[innobase.os]os0sync.c;1 line 501
    InnoDB: Thread 72209088 stopped in file MYSQL_ROOT:[innobase.include]sync0sync.ic;1 line 111
    %CMA-F-EXIT_THREAD, current thread has been requested to exit
    InnoDB: Thread 82752192 stopped in file MYSQL_ROOT:[innobase.include]sync0sync.ic;1 line 111
    %TRACE-F-TRACEBACK, symbolic stack dump follows
    image module routine line rel PC abs PC
    PTHREAD$RTL 0 000000000004367C FFFFFFFF80E6367C
    PTHREAD$RTL 0 000000000006E9E8 FFFFFFFF80E8E9E8
    0 FFFFFFFF8016F054 FFFFFFFF8016F054
    0 FFFFFFFF80376E0C FFFFFFFF80376E0C
    DECC$SHR_EV56 0 00000000001D2DC4 FFFFFFFF8106ADC4
    DECC$SHR_EV56 0 00000000001D2A8C FFFFFFFF8106AA8C
    0 FFFFFFFF801877CC FFFFFFFF801877CC

    (Today’s crash data)

    It might be this file is too small:

    $ dir database:[mysql41]/siz

    Directory DATABASE:[mysql41]

    ibdata1.;1 53248
    ib_logfile0.;1 4096
    ib_logfile1.;1 4096

    Total of 3 files, 61440 blocks.

    that is:

    $ dir database:[mysql41]/siz=unit=bytes

    Directory DATABASE:[mysql41]

    ibdata1.;1 26MB
    ib_logfile0.;1 2MB
    ib_logfile1.;1 2MB

    Total of 3 files, 30MB

    Should not be a problem. Or is it because the file cannot be extended:

    $ dir database:[000000]/sec

    Directory DATABASE:[000000]

    000000.DIR;1 [SYSTEM] (RWED,RWED,RE,E)
    BACKUP.SYS;1 [SYSTEM] (RWED,RWED,RE,)
    BADBLK.SYS;1 [SYSTEM] (RWED,RWED,RE,)
    BADLOG.SYS;1 [SYSTEM] (RWED,RWED,RE,)
    BITMAP.SYS;1 [SYSTEM] (RWED,RWED,RE,)
    CONTIN.SYS;1 [SYSTEM] (RWED,RWED,RE,)
    CORIMG.SYS;1 [SYSTEM] (RWED,RWED,RE,)
    INDEXF.SYS;1 [SYSTEM] (RWED,RWED,RE,)
    mysql41.DIR;1 [SYSTEM] (RWE,RWE,RE,E)
    SECURITY.SYS;1 [SYSTEM] (RWED,RWED,RE,)
    VOLSET.SYS;1 [SYSTEM] (RWED,RWED,RE,)

    Set the directory to W:RE (alas, PHP dos not (yet) take ACL in account) and make MySQL_SERVER the owner. If it happens again, failure to extend the file should not be the cause.

    04-Dec-2007

    L’Histoire se repète….
    This morning, trying to access the Bootcamp blog, MySQL crashed again. Same (or fairly similar) cause:

    071204 7:52:03 InnoDB: Error: the OS said file flush did not succeed
    071204 7:52:03 InnoDB: Operating system error number 12 in a file operation.
    InnoDB: Error number 12 means 'not enough core'.
    InnoDB: Some operating system error numbers are described at
    InnoDB: http://dev.mysql.com/doc/mysql/en/Operating_System_error_codes.html
    InnoDB: File operation call: 'flush'.
    InnoDB: Cannot continue operation.

    It could be the case that multiple PHP_WASD processes were activated. I screwed the throttle back – and lowered PHP’s memory requirements from 16M to 10M. As far as I can see, it still works – with an occasional “Not enough core” wtithin PHP – MySQL is running without a problem….

    03-Dec-2007

    MySQL down – again
    MySQL server crashed again this morning:

    Version: '4.1.14-log' socket: '' port: 3306 Source distribution
    071203 7:35:45 InnoDB: Error: the OS said file flush did not succeed
    071203 7:35:45 InnoDB: Operating system error number 12 in a file operation.
    InnoDB: Error number 12 means 'not enough core'.
    InnoDB: Some operating system error numbers are described at
    InnoDB: http://dev.mysql.com/doc/mysql/en/Operating_System_error_codes.html
    InnoDB: File operation call: 'flush'.
    InnoDB: Cannot continue operation.
    %DELETE-W-SEARCHFAIL, error searching for MYSQL_ROOT:[MYSQL_SERVER.TMP]*.*;*
    -RMS-E-FNF, file not found
    MYSQL_SERVER job terminated at 3-DEC-2007 07:35:47.95

    The watcher wasn’t running, but since logging ws disabled, There is no way to locate the reason. The last log is of 26-Nov-2007, 23:32 so it might be that changing data may have caused an error. But even in acconting, it cannot be found.
    So I just started it again – with logging enabled, to find out – eventually.

    In general this means that MySQL 4.1 does not have the required stability….
    Adjusting management procedures
    The >anti-spam engine creates a large number of log files when storing statistics information and these files must be deleted “manually”. That is: in the local clean-up procedure that is called in the nightly batch, it is cheecked if such files exist and if so, all are deleted:

    if f$search("sqlite*.*") .nes. "" then delete sqlite*.*;*

    I’ve added this line some time ago but still there is quite a number of files present. No wonder: I’d better specified where to look:

    if f$search("sys$manager:sqlite*.*") .nes. "" then delete sys$manager:sqlite*.*;*

    👿 Duh 👿
    The backup procedures have been enhanced a bit as well. Logging of the MySQL backup went to the wrong place, and only the public web is backed up – until today. I chnaged both procedures: have the logging come to the right spot, and keep only the latest web backup available, otherwise the disk wqill be crowded with unused files. They are big enough, but all tyhese backups are not needed so I can clean them up anyway – if backup succeeds.
    Why did mysql-watcher not run? Well, the answer is found in the logfile:

    Resubmitted 30-Nov-2007 23:32:00.00
    Resubmitted 30-Nov-2007 23:47:00.00
    Resubmitted 31-Nov-2007 00:02:00.00

    😳 Invalid date…