08-Aug-2015

More on the crashes
Little time to do some more testing.
But whatever I tried: I couldn’t get Diana working again. Perhaps I had the bad DIMM in slot 0? But since we had a small tripped planned, I decided to take the server down for the weekend – and see what I could do on return. There are still two PWS systems available that could do the job, or part of it, if all else would fail.
Could it be something in the motherboard that caused the system to be unresponsive, in that case I would need to get a replacement…

07-Aug-2015

Crashes
All of a sudden, Diana was completely unresponsive a few days ago. Not even the HALT button (that would normally halt the CPU and return to console mode) didn’t work, nor did CTRL_P so I had to power off the system and restart it. All seemed well after that.
Until today.

When looking to the server stats this morning to see if there were more PHP errors, I found none AT ALL: The server log showed the webserver had been running for about 30 minutes. That means the system was completely restarted after a crash; not a power outage since that would stop the disks as well and the system would not restart (the boot process would not be able to access the disks since the HSZ50 isn’t ready yet).
So I checked HyperSpy, and that showed two of them. Since I was at work, investigations had to be postponed till after work.
When This could be done, I logged in remotely and found a number of crashes today:
$ dir operator*/dat

Directory SYS$SYSROOT:[SYSMGR]

$ dir/dat oper*

Directory SYS$SYSROOT:[SYSMGR]

OPERATOR.LOG;4369     7-AUG-2015 12:11:09.78
OPERATOR.LOG;4368     7-AUG-2015 07:25:40.54
OPERATOR.LOG;4367     7-AUG-2015 06:03:43.25
OPERATOR.LOG;4366     7-AUG-2015 02:11:16.71
OPERATOR.LOG;4365     7-AUG-2015 00:21:40.78
OPERATOR.LOG;4364     7-AUG-2015 00:00:00.83

Using ANA/ERR/ELV translate/since=today, I found that one was caused by INVTQEFMT in the NULL process, and it seems all others were caused by INVEXCEPTN on a webserver process – either the server itself, or a worker process, all where the current IPL was 8. Thw last crashe told me:

$ ana/crash

OpenVMS system dump analyzer
...analyzing an Alpha compressed selective memory dump...

%SDA-W-LMBMISSING, "S2 Space" LMB missing from dump
Dump taken on  7-AUG-2015 12:06:30.75 using version V8.4
INVEXCEPTN, Exception while above ASTDEL

%SDA-W-PROCNOTDUMPED, process WASD:80-123 not dumped; process private memory not accessible

Something to look into at home.
So I did, but when I logged in into my (still open) Motiv session, the system crashed – again.
This gave me a clue on the other crashes: it may have to do with the new memory.

So the next step was to take Diana down and test the state of memory:
>>> show mem

showed 2048 Mg of memory, no bad pages.

Let the system run for some time , and than it happened: The number of bad pages increased, got stable, decreased, stabilized, increased again – like a yoyo. This explained the odd behaviour: That the system ran fine for quite some time and suddenly found itself in bad memory – causing errors above ASTDEL (IPL=2).
But which one of the four was faulty – if any?
So I first installed the original 512 Mb in bank 0 and let it run. No problems.
Installed 2 of the new DIMMs in bank 0 and let it run for 30 minutes, no problem
Swapped one othe the two with another one. Same test, same result.
Swapped the other one, so no I had all four of them tested: No problem either.

This was odd.

So I installed all four of them, and booted. At least: Tried to, but now there wasn’t any response from the keyboard – and the sequence shown on the console didn’t show anything else.
Only that DKA0 (the local disk) was missing, but that I knew already.

Checked that as well, moved the disk from under the floppy drive to the bottom of the right-side cage (already holding CD and DBD drives), tested connectors and found that this issue seemed to be caused by a bad cable….

Too late to continue tonight.

06-Aug-2015

First upgrade
I’ve done some testing on the preview of Windows10, using the old Athena box and didn’t find anything disturbing, so I am confident enough to upgrade my workstation. And so I did. No issues with upgrade, all seems to work fine; the only thing I needed to re-install were the NVidea Geforce drivers; But even that went without a problem.
There still are a few Windows 7 systems, but these couldn’t be upgraded to Windows 8 since HP had some software installed that I do use, and that is not compatible with Windows 8 – and so I think it will cause problems with WEindows10 as well. These are to be scanned first, before I decode to upgrade, or not.

Oh yes, there is this DKA0 disk…Well, next week.

03-Aug-2015

Maintenance
No real surprises, in first glance:
PMAS statistics for July
Total messages    :   3010 = 100.0 o/o
DNS Blacklisted   :      0 =    .0 o/o (Files:  0)
Relay attempts    :   1680 =  55.8 o/o (Files: 31)
Accepted by PMAS  :   1330 =  44.1 o/o (Files: 31)
Handled by explicit rule
Rejected :    762 =  57.2 o/o (processed),  25.3 o/o (all)
Accepted :    227 =  17.0 o/o (processed),   7.5 o/o (all)
Handled by content
Discarded :    170 =  12.7 o/o (processed),   5.6 o/o (all)
Quarantained :    139 =  10.4 o/o (processed),   4.6 o/o (all)
Delivered :     32 =   2.4 o/o (processed),   1.0 o/o (all)

but there have been quite some attempts to relay:
PTSMTP_ANTIRELAY.LOG-2015-07-11 is 118 blocks: check file ANTIRELAY.-2015-07-11
It contains 486 records, most from din0017@163.com (183.14.9.249, 460), and, far less, from
zurdocore19@gmail.com 193.0.200.136, 28)

PTSMTP_ANTIRELAY.LOG-2015-07-12 is 70 blocks: check file ANTIRELAY.-2015-07-12
This file contains 288 records, all mentioning din998@126.com (183.14.9.249);

This month’s champion however is another Chinese script:
PTSMTP_ANTIRELAY.LOG-2015-07-28 is 196 blocks: check file ANTIRELAY.-2015-07-28
Containing 704 attempts by xiaonanzi11162@sina.com (14.215.136.26), except for two.

But this month’s reports may be lost.
There is no mention of it anywhere, not even by infozip (used to store the files there), that DKA0:[LOGSARCHIVE] cannot be found. As it turned out, it seems the disk isn’t connected and so the system cannot locate it. Looking back, I already noticed I had one pagefile where I used to have two.
Looking into the error log (ANA?ERR/EVL TRANSLATE/SINCE=22-Jul/OUT=x.x) I noticed the disk was properly dismounted, but there was no mentioning on startup. At least, no error/ Though, If I had looked to the overfew of free blocks in the system, I would have realized something was wrong:

%STARTUP-I-DSKINFO, Determining free diskspace
Device Name   Volume Label   Used Blocks     Free Blocks      Total
------------- ----------- -------------- --------------- ---------
$116$DKA100:  AXP084      51799491 (73%)  19315132 (27%)  71114623
$116$DKA101:  WEB2006     45337251 (64%)  25777372 (36%)  71114623
$116$DKA102:  USERDISK    27580063 (39%)  43534560 (61%)  71114623
$116$DKA105:  L500        27042046 (39%)  44072577 (61%)  71114623
$116$DKA106:  QUORUM           348 ( 1%)   8377680 (99%)   8378028
$1$LDA1:      WEB_DISK2    3302336 (**%)    297664 ( 0%)  33600000
$1$LDA4:      HYPERSPI      307004 (16%)   1692996 (84%)   2000000
$1$LDA10:     JFPLIB0004A   206448 (83%)     43552 (17%)    250000
$1$LDA11:     JFPPY0100A    747488 (68%)    352512 (32%)   1100000

No DKA0 shown here.
So I will have to re-open the system, check connectors to DKA0 and restart…

23-Jul-2015

Results of more memory
The main question is: Did this memory expansion any good? It should be visible in the output of the performance tools.
First: Hyperspy++ (that comes with WASD): It does show memory usage is much lower, but consistent with the prevous size (about a quarter) and paging hasn’t changed much:
2015-07-23_08-24-39
T4 output shows similar freelist size but the modified page list has increased:
2015-07-23_08-30-00
and here as well, paging statistics haven’t changed much:
2015-07-23_08-30-37
which is to be expected, since memory utilization is pretty much the same:
2015-07-23_08-33-09
Seems odd, but is to be expected: Since I still have to do a reconfiguration of the paging parameters – that’s what AUTOGEN is all about. It is planned to be done somewhere next week.

But on the subjective side, there are improvements: it is obvious that the increased size of the modified page list allows faster execution of PHP code. No ACCVIO or stack overflows so far; just a few:

%HTTPD-W-NOTICED, 22-JUL-2015 19:52:40, CGI:1997, not a strict CGI response
-NOTICED-I-SERVICE, http://www.grootersnet.nl:80
-NOTICED-I-CLIENT, 82.161.236.244
-NOTICED-I-URI, GET (100 bytes) /sysblog/wp-admin/admin-ajax.php?action=dashboard-widgets&widget=dashboard_primary&pagenow=dashboard
-NOTICED-I-SCRIPT, /sysblog/wp-admin/admin-ajax.php sysblog:[wp-admin]admin-ajax.php (phpwasd:) SYSBLOG:[WP-ADMIN]admin-ajax.php
-NOTICED-I-CGI, 504850205761726E696E673A20204D6F64756C652027646F (2048 bytes) PHP Warning: Module 'dom' already loaded in Unknown on line 0.
-NOTICED-I-RXTX, err:0/0 raw:1154/0 net:1154/0

yesterday from the workstation, and one today. But it all seems to work without problems.

SCSI hardware is already present
When changing the memory, I also realized there are two SCSI cards in the box: One serves the internal disk (that holds the page- and swap files), and the other connects to the external storage unit. The second is surely a KZPBA, the first one probably is as well, it too has an 68-pin outlet at the back. So there is no real need to install another KZPBA – it’s all there.
The one thing to find out is to see whether terminators are there, or not. Access is not that easy so I didn’t check last night. So there well be another (short) outage, probably in a weekend.

Hot, hot, hot
Tonight everything was stalled. Not even the console triggered a reaction. Normally, if the screen has gone blank, there are two leds flashing on the keyboard but even these were unlit. The reset button that would normally invoke ^P, had no response. So I stopped the machine the hard way (powered it off) and powered it o again, and there I found the system temperature was 44 degrees. So the machine was probably halted due to high temps. Restarted it (and found a bug in WASD?), cooled it somewhat by pushing air through the vents in front. Now hoping it won’t get that hot. (Yes, I know. I need airco in here. As soon as I have the bucks, I’ll get one)