08-Sep-2013

A family member sent a message – by SMS and Skype – that mail bounced. No mention of the reason,
A brief look into the system revealed that just the main process of PMAS was running – no worker processes. So I stopped PMAS – and got a number of error messages – mainly that logicals were not defined.
Next, I looked into the PMAS log files. The first strange thing was that there was a PMAS.LOG-2013-09-04, and a PMAS.LOG – dated 05-Sep-2013. I would expect files PMAS.LOG-2013-09-05, PMAS.LOG-2013-09-06 and PMAS.LOG-2013-09-07, but these were all missing.PMAS.LOG however ended with the last message at 22:26:34.06, a message being quarantined. But after that, it seems no new messages were received!
Looking into PMAS_STARTUP.LOG, that contains all messages also sent to OPCOM (and therefore, be logged in OPERATOR.LOG) revealed why:
5-SEP-2013 23:13:23.51 **
%PTSMTP-E-WORKERDIED, worker PTSMTP 005D (2126D8E0) terminated unexpectedly
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 62.175.202.16,17314

This specific address (62.175.202.16) continues sending messages at a high rate, until
** 7-SEP-2013 08:57:25.13 **
%PTSMTP-E-WORKERDIED, worker PTSMTP 022F (2126B0B5) terminated unexpectedly
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 62.175.202.16,50436

and after that, a number of other addresses took over – it might have been the same source using a anonymiser. These addresse included: 182.18.222.91, 201.240.162.58, 200.37.163.163, 123.16.92.3,
181.67.6.108, 113.172.23.195, and more.
At some point, the main program notices:
** 6-SEP-2013 00:12:25.18 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 9310 connections accepted; 0 workers currently active

This happens when someone tries to block the system for the same reason – as, for example in Jily, but the main process was able to restart a new workerprocess:
** 19-JUL-2013 02:12:24.88 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 4439 connections accepted; 0 workers currently active

** 19-JUL-2013 04:12:24.88 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 4441 connections accepted; 1 worker currently active
Pid Process Name State Connections Serviced
21257C5F PTSMTP 001D idle 2

and at some moments, the maximum of worker-processes (10) is reached, but it never caused a problem:
** 23-JUL-2013 23:05:44.23 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 4891 connections accepted; 10 workers currently active
Pid Process Name State Connections Serviced
21257F2D PTSMTP 0024 idle 226
212530A6 PTSMTP 0025 idle 27
21261FE7 PTSMTP 0026 exiting 1
21257068 PTSMTP 0027 exiting 1
2125FE69 PTSMTP 0028 exiting 1
2126236A PTSMTP 0029 exiting 1
2125D3ED PTSMTP 002A exiting 1
212602EE PTSMTP 002B exiting 1
2125D002 PTSMTP 002C exiting 1
2125F883 PTSMTP 002D exiting 1

since the surplus simply dies out.
But this aabuser had sent so many messages that the last ‘normal’ signal was written early 04-Sep-2013:
** 4-SEP-2013 04:12:25.17 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 8050 connections accepted; 2 workers currently active
Pid Process Name State Connections Serviced
2126DD69 PTSMTP 0055 idle 174
21265CA4 PTSMTP 0056 idle 16

After that, one of the server processes crashed, and a second was never brought into existence:

** 4-SEP-2013 06:12:25.17 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 8057 connections accepted; 1 worker currently active
Pid Process Name State Connections Serviced
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 117.217.5.97,10830

** 4-SEP-2013 06:12:25.17 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 8057 connections accepted; 1 worker currently active
Pid Process Name State Connections Serviced
21265CA4 PTSMTP 0056 idle 18

From that moment on, only one worker process remains – but when that one died, another was started. and when the bomb started, one after the other that was created went down:
** 5-SEP-2013 23:13:23.51 **
%PTSMTP-E-WORKERDIED, worker PTSMTP 005D (2126D8E0) terminated unexpectedly
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 62.175.202.16,17314

** 5-SEP-2013 23:13:34.03 **
%PTSMTP-E-WORKERDIED, worker PTSMTP 005A (2126A8DC) terminated unexpectedly
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 62.175.202.16,22436

** 5-SEP-2013 23:13:43.06 **
%PTSMTP-E-WORKERDIED, worker PTSMTP 005E (212733E4) terminated unexpectedly
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 62.175.202.16,24835

** 5-SEP-2013 23:14:35.74 **
%PTSMTP-E-WORKERDIED, worker PTSMTP 0060 (21272867) terminated unexpectedly
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 62.175.202.16,6114

** 5-SEP-2013 23:14:47.17 **
%PTSMTP-E-WORKERDIED, worker PTSMTP 0061 (2126C1EC) terminated unexpectedly
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 62.175.202.16,37761

until at some moment, none was active:
** 6-SEP-2013 00:11:58.20 **
%PTSMTP-E-WORKERDIED, worker PTSMTP 007A (2126ED0A) terminated unexpectedly
-PTSMTP-I-WORKEREXIT, worker exiting
-PTSMTP-E-SOCKETFAILED, failed to create socket - -SYSTEM-W-DEVALLOC, device already allocated to another user
-PTSMTP-I-WORKERCONN, while processing connection from 62.175.202.16,60929

** 6-SEP-2013 00:12:25.18 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 9310 connections accepted; 0 workers currently active

and it continues. The last statusmessage – that is normally sen every 2 hours – was found to be:
** 8-SEP-2013 14:12:25.19 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 10417 connections accepted; 0 workers currently active

until I stopped the program:
** 8-SEP-2013 16:09:23.24 **
%PTSMTP-I-CTRLEXIT, PTSMTP exiting
-PTSMTP-I-EXITREQUESTED, exit requested by operator

** 8-SEP-2013 16:09:23.24 **
%PTSMTP-I-DBSTATUS, status for configuration PMAS_ROOT:[DATA]PTSMTP.CONF:
Total of 10464 connections accepted; 0 workers currently active

In the end – when I stopped the service, it has ‘processed’ about 2500 message in 3 days ; these are the ones that actually came through. The number of SYSTEM-W-DEVALLOC messages in those days is about 850, and the number of messages that the service limit has been exceed, is about 90.
The real amount is found in the router logging: from the beginning until I stopped the service, the mail filter had to handle over 2400 message – of which over 1100 sent by 62.175.202.16, several every second.
Too much for PMAS to handle.
From a certain point, several other senders kick in but My guess is these are trying this for a long time; and because PMAS had been overloaded and is unable to start new worker processes, these encounter the same problem as other legitimate senders.

In hypewrSpy – on eof the tools in which I check the system performance – there is no real clue on what was going on. I still have to look into the T4 data but I doubt there will some clue.

In retrospect, last Friday I also noted that there was very little mail coming in – either deliverable of blocked by the system (well – the spam filter).

I have informed Process.com of the issue: That PMAS was unable – for some reason – to create a new process, although the system has slots available (the maximum seen is 67 where I have room for 110 processes), and that PMQAS should block any repeated requests over a certain amount from one address, like I have set for WASD; and signal – by email, for instance, that a site tries to block access to ail.
And, of course, the offending address – owned by ono.com in Spain – has been denied access in the router – and I would like to inform ONO.COM if only I found an address. But the site is already mentions a a potentially dangerous one. Probably trying to relay – but even that cannot be determined. The last log is of 4-Sep-2013 – so there hasn’t been an attempt since than – or could not be written?

13-Jan-2012

DoS attacks on blogs – part 2
One of the things I did after the server was restarted, was to define s throttle on the blogs, the wiki and the download area. This wioll oprevent this number of concurrent accesses to them, limiting the risks. It may mean that when I get a lot of requests, some will have to wait a bit longer, or will be queued for some time, or get a “Server busy” error. The issue is clearly visisble in yesterday’s history. Not so much in CPU – there are some spikes up to 25%:

At times there have been peaks in CPU, normally it’s just a few\percent, these spikes up to 25 are remarkable, esepcially in the timeframe in which they occur…
In memory usage – especially pagefile usage – the problem is clear:

Free memory is exhausted, but there is space enough avaliable in the pagefile, and far more cqan be paged into the files; but since none of the processes clearly ran out of virtual memory, there is something else that blocked processing. The only culprit in that case, might be MySQL….
The number of processes:

runs in to the roof – in steps, and it follows memory usage: starting to rise at six to stablize just a few minutes later, until the numbert of processes increases again at 7, again stabilizes at 9, and again increasing in steps until the system of out of slots at 10; in the next half hour, is seems some processes com to an end but immediately, that free slot is taken again….From that moment on, until the system is rebootes at 21:00, no new processes can be created; once in a while one ends, but another will take the slot immedeately…
Paging show2s a massive peak at 6:00 and 7:00.

These will be the moments that the large amount of processes are created. It matches the graph seen yesterday on the amount of requests on these moments.
The graph of buffered IO shows the same peaks:

So the problem started at 6:00, stabalized until 7:00, and then contibued until the system ran out of resources at about 10:30 (local time, which is UTC without DST on the system).

Armed with these data, I searched the access log.
At 4:00, 66.249.66.186 requested a few pages in the SYSMGR blog, but there were a few minutes between them and that is hardly noticble. Nothing weird, actually, this happens more often.
But at 6:00, 204.11.219.95 kicks in. Within 16 minutes this address fires 50 subsequent GETs on the Tracks blog index; the fist 20 succee; he get’s 503 errors on the next 5; the next 5 succeed, the next 6 result in 502 errors followed by succeeding.
About 15 minutes later, at 6:39, addresses 66.249.66.186 and 66.249.71.152 start accessing the SYSMGR blog index with straight requests. Several other addesses do the same for RSS feeds, others scanning for monthly indices to be displayed. Not that often – verey few minutes, but continuously. Most, of course, ending in a 500-style error after 10:30….

These 66.xxx.xxx.xxx addresses are owned by Google, and so do some of the RSS feeds. This is no big deal, it’s rather nomal. These come with a larger interval. The real culprit seems to be 204.11.219.95:
$ dig -x 204.11.219.95

; < <>> DiG 9.3.1 < <>> -x 204.11.219.95
;; global options: printcmd
;; Got answer:
;; ->>HEADER< <- opcode: QUERY, status: NXDOMAIN, id: 6130
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;95.219.11.204.in-addr.arpa. IN PTR

;; ANSWER SECTION:
95.219.11.204.in-addr.arpa. 43200 IN CNAME 95.219.11.204.in-addr.networkvirtue.com.

;; AUTHORITY SECTION:
networkvirtue.com. 2560 IN SOA a.ns.networkvirtue.com. hostmaster.networkvirtue.com. 1305226632 16384 2048 1048576
2560

;; Query time: 5843 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jan 13 13:10:14 2012
;; MSG SIZE rcvd: 149

$
$ whois 204.11.219.95
----Server: whois.arin.net [AMERICAS] response for 204.11.219.95
#
# Query terms are ambiguous. The query is assumed to be:
# "n 204.11.219.95"
#
# Use "?" to get help.
#

#
# The following results may also be obtained via:
# http://whois.arin.net/rest/nets;q=204.11.219.95?showDetails=true&showARIN=false&ext=netref2
#

Peak Web Hosting Inc. PEAK-WEB-HOSTING (NET-204-11-216-0-1) 204.11.216.0 - 204.11.223.255
Gal Halevy GAL-HALEVY-NETWORK (NET-204-11-219-64-1) 204.11.219.64 - 204.11.219.127

#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#

So I know where to signal abuse.
By the way: Mail reception stalled as well:

%%%%%%%%%%% OPCOM 12-JAN-2012 10:14:27.38 %%%%%%%%%%%
Message from user INTERnet on DIANA
INTERnet ACP SMTP Abort Request from Host: 192.168.0.2 Port: 54120

%%%%%%%%%%% OPCOM 12-JAN-2012 10:14:27.38 %%%%%%%%%%%
Message from user INTERnet on DIANA
INTERnet ACP AUXS failure Status = %SYSTEM-F-NOSLOT

Same problem….

12-Jan-2012

DoS attack on blogs
This morning, the web was unreliable in both speed of access and success. Webmail, which normally works like a charm, would react slowly, cause a 503 error, or time-out. The same to the opertion desk: the home page is a plain HTML file – no withles nor bells. That would load fine, although slow, but access to functions within the menu could cause similar problems.
Luckily, the WASD web-pages were acecsable.
Looking at what could cause the problems, I looked at the activity on the system, and I notices a really large number of PHP-server processes: meaning that someone was trying to blow the blogs to pieces.
So the next action was to stop these processes, but it seemed an impossible task: for any process I killed, another one was created. Or processes were said to be ‘suspended’…
Next stop: restart the webserver – which normally causes all pending PHP-servers to disappear. But not so this time; that is: the server list of running processes showed them gone, but SHOW SYSTEM still had them….So I retried – wityh no luck.
At some point, an error 500 (unexpected server error) was returned whenever a rfequest was send that would the webserver require to create a new process; but since the admin apges are handled internally, WATCH could show me the reason: “no pcb available”. The system was simply out of gas….But not completely, wherever the webserver could handle requests itself, like the admin pages, the images beyond the Trips,Tracks&Travels blog, or download files, that worked as before. Also mail and other processes normally running kept running as usual; a bit slower, perhaps.
It was not until later in the afternoon that I had the ability to solve the problem, because of this lack of process blocks, login wasn’t possible either – I just had to work from the console.
To my luck, the DecWindows session on my console was still up and running, so from there, I could try to clean up the mess. Each slot tahat would normally be open, was now occupied by a ser4ver-subprocess running PHP, in either LEF or LEFO state. So I stopped each of them. Next re-showing what was running, the processes re-appeared, to I tried again – with no result.
The only solution I had than was to reboot the server. After that, the webs worked like they should.

Next thing is to examine what happened….

24-Jan-2011

More malicious attempts blocked at the gate
I expected it to happen: When one network is blocked, attempts will come from another source. So in the cause of a few days, I blocked a few more subnets. It seems to help, since the number of alerts dropped, and accessing the site is more constant and stable. I’m working on a method to report on abusive access – no matter in what way: DoS, Spam, FTP store, HTTP(s) attempts to break into software I don’t run (or on a different location). And than I’ll be able to publish them…
One package though needs closer attention, as I think chnaging it’ s locations to ReadOnly is trouiblesome, I need some advise from the specialist in that area. But I removed a possible cause of trouible – when looking to it’s name, I suspect it was – and cleaned quite a lot of rubbish tghat is must have left behind. Also, the users I previously disabled have now been deleted, and the caches cleared.

(The Dutch railways offer Internet in some of the fast trains – free until 2012, according their site. That’s where I wrote the entry)