SYSMGR in the attic – Page 144 – What goes on behind the doors of the (private) data center

13-Nov-2007

Closed a security hole
The blogs reside on the public area, and when scanning the report on unwanted accessed as derived from the webserver logs, I notices that some way, the full path has been exposed. This means any user has access to the configuration file – that contains database access information. The file needs to be accessable by the web-user (HTTP$NOBODY) – READ at least – to allow database access. There is no esy way to get around it, and certainly not in a short period.
So the only solution was to get the database access data out of that file and store it in a safe place, that cannot be accessed using the webserver directly. It’s a minor change in the PHP code: just include a file containing the sensitive data.
This has been done now -a dn the access data has changed.

This is a change to be proposed to the wordpress team.

MySQL slower – but more stable
At least, I didn’t notice any MySQL breakdown in the past weeks. It’s quite a bit slower – all buffers half as big as before – but considereing the problems encountered, the current preference is stablity above speed.
The batchjob that keeps an eye on MySQL runs every 15 minutes but hasn’t find a MySQL-Server process missing, so far.

5-Nov-20075-Nov-2007

05-Nov-2007

File corruption
I made an image backup of the webdisk to another – clean – disk in batch last night, and the log shows it did succeed with parity errors when reading the disk container: the one used previously containing the webs:

... %BACKUP-S-CREATED, created $116$DKA201:[000000]USER.DIR;1 %BACKUP-S-CREATED, created $116$DKA201:[000000]VOLSET.SYS;1 %BACKUP-S-CREATED, created $116$DKA201:[000000]webdisk.dsk;1 %BACKUP-E-READVERR, virtual read error on file [000000]webdisk_old.dsk;1 at block 4193050 -SYSTEM-F-PARITY, parity error %BACKUP-E-READVERR, virtual read error on file [000000]webdisk_old.dsk;1 at block 4193051 -SYSTEM-F-PARITY, parity error ... -SYSTEM-F-PARITY, parity error %BACKUP-E-READVERR, virtual read error on file [000000]webdisk_old.dsk;1 at block 4193795 -SYSTEM-F-PARITY, parity error %BACKUP-S-CREATED, created $116$DKA201:[000000]webdisk_old.dsk;1 %BACKUP-E-READVERR, virtual read error on file []webdisk_old.dsk;1 at block 4193050 -SYSTEM-F-PARITY, parity error ... %BACKUP-E-READVERR, virtual read error on file []webdisk_old.dsk;1 at block 4193795 -SYSTEM-F-PARITY, parity error %BACKUP-S-CREATED, created $116$DKA201:[]webdisk_old.dsk;1 SYSTEM job terminated at 5-NOV-2007 03:45:22.66

So the file has been copied, but it’s quite likely that somehow the original backup was broken a;ready. And when a disk container is broken, the disk it represents is. No wonder it couldn’t be opened!

The other containers were fine. No trouble whatsoever.

I connected the original container to LD, but mount failed with too many party errrors reading header information. I disconnected the containerfile, and deleted it from the original disk.
Quite likely the dis wasn;t broken after all, so I put it back online, mounted it and examined the container file on that disk. Same problem! So I examined the file using LD/TRACE and mount it, but that didn’t get me much further. It just stated a lot of READ, some WRITE actions to disk on MOUNT, but that didn’t help much. So I quit the examinaton, dismounted the disk forgot to disconnect – an error only found when I tried to conenct the copied container on the same LD device. Thsi failed – and I disconnected the device, but from the wrong process.
Now, two processes got stuck in RWAST state: the one that allocated the real disk, and the one where I would have to disconnect the LD device…. All of a suden, the new webdisk was stated to have the wrong volume loaded – where it did work nicely before. I rebooted the system to prevent any more damage, the webdisk was found to be in mount verification mode so that took some time. Nevertheless, the system came up nicely, without errors.

Keep it under close watch, but hopefully this trouble is now over – at least for some time.

Mail statistics October
Processed October’s mail statistics:

Total number of messages: 3213
Blacklisted: 2271 (70.5%) – Average 73, min 34, max 149
Relayblock: 126 (4%) – Average 4, min 0, max 76
Filtered : 424 (13.1%) – Average 13.7, min 7, max
Delivered : 413 (12.9%) – Average 13.3, min 1, max 33

This does not take into account:
– False positives (just a few)
– False negatives (a few as well)
– known and expected relay attemps (check by ISP)

The program to analyse thes is still waiting completion 😉 but I’ve seen the peak on relay attempts is from just one address – probably trying to down the server (all within a minute!).

4-Nov-20074-Nov-2007

03-Nov-2007

Severe problems
Normally, it’s a god thing to combine fully unrelated issues if possible. In this case, updates to Diana require a reboot, and some work needed to be done on the electrical system, and that meant power down.
So I installed the patches, and shutdown Diana – and the disk controller. Next, take power down, do the work to be done and power the whole bunch up.
That however, proved a bit more cumbersome than I thought: firing up both the Alpha AND the disk controller AND the fully loaded diskshelves gave a spike on the pwoer grid that caused the security switch to fall and so power did not return. I switched all computer hardware off power, and installed power afterwards – unit by unit.

The problems werre revealed soon. VERY soon..

It looked as af the disks shelves and the HSX50 controller came up nice – except for one disk, that failed to spin up.
Took that one off the shelve, and restarted the controller.
Next, start Diana. Same problem as a few weeks ago: come up, and suddenly stop responding. Switching the system off and on several times, restarting the HSZ controller – finally it worked, all of a sudden. So I kept it running.
Booting revealed a failure starting MySQL – and the webs didn’t work either. It turned out that the disk containing the webs could not be mounted due to parity errors in accessing SECURITY.SYS. That means that the disk had to be re-initialized, and I would have to restore everything from backup. I did make a full backup of the whole disk some time ago, but the public web is copied each Monday, and nothing has changed in these static pages last week. Just the operator logs (and their zips) would – once again – be lost.

So I did an INIT of the webdisk, and restored the backup. That took all evening yesterday and part of the night.
In thge mean time, I took the opportunity to do some ‘de’-tuning of MySQL so it requires about half the memory it did take. It will slow down all blogs, but stability is more important – so if it doesn’t require so many pages, it might less easily run into “not enough core”.

This morning the webdisk could be mounted, the diskcontaines could be connected to the LD driver, but mount simply failed. Completely. The error count on the disk went up, so cahnces are the disk is really broken.
To speed things up, I installed the backup disk, and again, Mount of the disk containers wasn’t fully free of errors, but they could be mounted. That meant I had the webs back (though somewhat outdated), and I could restore the datases.
That’s what I thought.

I knew I had to prepare something: create an empty database, and execute the full backup SQL file. It took soemwhat more work: I had te refer back to the original installation page, and redo some work. I found the databases were still found but accessing tables failed. No wonder: the data dictionary was still intact and MySQL refers to these .FRM files! Renamed them, and created new databases using PHPMyAdmin. Now, the databases were back in order. This has been executed on a new containr, only 4Gb in size – big enough to hold the databases.

As a follow-up, I cleared the old database diskcontainer and copied all webs to that, and after that, restored the backup PUBLIC web onto that, and relabeled the volume – this is curently in progress.
After dismounting and disconnecting the smaller container, I will rename the container files, name the former database container to be the web container, connect it to LD and mount the volume. Hopefully, It will be up-to-date after that.

Backing up this whole disk will take the rest of the night.

Now what caused all this? It’s mainly hardware failure – it can very well have to do with the power returning to the systems. I’ll need to investigate Diana for this, but that means: no alternative – yet. The only ones I have as an alternative at the moment is the Alphaserver 400 or one of the AlphaSattion 200’s. Both lack memory, and are much slower – not just clockticks, but processor as well (EV4, compared with EV56 in Diana).
The Alpha1000 and 2000 don’t start at the moment, and require to be set up to use the shared SCSI as well. I do have cards left ….

One more MySQL failure
MySQL once again ran into an ACCVIO error….

30-Oct-200711-Dec-2007

30-Oct-2007

Low memory
One of the problems encountered is shortage of memory. Or better: a lack of free pages. The problem became clear yesterday, when trying to FTP a 500Kb file from Demeter (WindowsXP) to Diana (VMS). The connection reconnected – and reconnected — and kept reconnecting, while, running MONITOR on a telnet session, free pages dropped from about 16K to zero – in seconds. Even in this moment – where MySQL has done some work, and the PHP engine is happily started doing some work as well (and it will be triggered once in a while to get this text saved while not yet commited), it has dropped from 12K to 1.5. A massive 10K pages in use (I have MONITOR running).

The next two images show the memory usage and paging yesterday, when I did the FTP connection at approximately 21:00 (All times GMT)

The blue line is pagefile usage – and I have a massive 3GB online. Being used for about 75% at some point – guess what THAT would have caused if I had my previous 1Gb file….

It has been so bad, that HyperSpy (the monitor that allows web-based monitoring) didn’t get trhough at some pint – that explains the gap in the memory graph.

What I don’t understand: I reversed the changes I made to accomodate Distributed Netbeans and Webes. Before these changes, there wasn’t really a perfomance issue. At least, I never ran into it. But it’s well possible I overlooked something. Another weird thing: I uploaded a complete set of photograhs and pages this weekend, all in all about 22M – 40 times as much – but that went nice and smooth. The only difference: that was over a 100Mb LAN, not over a 54Mb Wireless access point )on the same network).

He who understands, please explain…

First things first.
MODPARAMS.DAT still contained settings for Advanced Server, that I never got to work, so it was never really installed after I upgraded the system, so I got rid of these settings. Some othter things need to be addressed as well, 256 Gb should be way enough – and I don’t plan to use Java again, on Diana anyway.
So I lurked on the HP site for performance documentation – and the manual on the 8.3 set is VMS 7.3. But I guess most is still valid.

That means: back to the drawing board and calculator, to get at least a somewhat better free list. Having some processes swapped – no big deal if these aren’t heavily used :(. It’s been quite a while since my last tuning job. about 20 years, I reckon….(But that has been a much easier problem)

And of course: MySQL server just crashed again. Not enough core. I’ve seen it.

29-Oct-200729-Oct-2007

29-Oct-2007

Updates postponed
Due to other priorities, no updates could be installed last weekend. None of them is really critical so they can postponed without problems. It will have to be done some weeks later, there is little space in the coming weeks.
MySQL crash
When accessing the blog this afternoon, the MySQL server crashed – again. Why does the server crash if one thread fails?
Perhaps the value of some variables should be lowered, but why did it work rather well in the past? The server has crashed before, but stability has decreased without obvious reason: just one system parameter changed? It doesn’t make sense….
upload of the logfile fails:
%HTTPD-W-NOTICED, 29-OCT-2007 18:33:41, CGI:1969, not a strict CGI response -NOTICED-I-SERVICE, http://www.grootersnet.nl:80 -NOTICED-I-CLIENT, 192.168.0.33 -NOTICED-I-URI, POST (72 bytes) /sysblog/wp-admin/upload.php?style=inline&tab=upload&post_id=-1193677653 -NOTICED-I-SCRIPT, /sysblog/wp-admin/upload.php sysblog:[wp-admin]upload.php (cgi_exe:phpwasd.exe) SYSBLOG:[wp-admin]upload.php -NOTICED-I-CGI, 2553595354454D2D462D485041524954482C206869676820 (129 bytes) %SYSTEM-F-HPARITH, high performance arithmetic trap, Imask=00000000, Fmask=00000002, summary=02, PC=00000000001E9C94, PS=0000001B -NOTICED-I-RXTX, err:0/0 raw:7643/0 net:1182/0

where it did work uploading a .JPG file this afternoon (before the server crashed). Well, see if I fet the data uploaded some other time.

April 2024
M	T	W	T	F	S	S
« Nov
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30