How to get automated monitoring right

If you want to accurately and effectively monitor your systems, you need to avoid both false positives and false negatives.

This is a challenge I've faced a couple of times, at DataCash and at UK2.NET: how do you successfully monitor large automated systems? In slightly more detail: you need to find out as soon as possible if anything goes wrong, quickly and accurately. (The alternatives are fixing things before they break, which is impossible, and fixing things when your clients complain, which is unpalatable.)

Proper monitoring

The first issue is seemingly simple: you need to monitor everything. First of all, this means that you monitor your monitoring systems: the lack of any report is in itself a problem. (You do have a secondary data centre for this sort of purpose, right?) But this means that your monitoring systems produce two types of report: 1) a regular, everyday, behind the scenes "nothing to worry about" report that you only get to see if you really care, and 2) a "damn, something weird happened" report that gets sent to your email box / mobile phone / pager as soon as possible.

Incidentally, never underestimate the value of full, comprehensive, unobtrusive logging. If you're doing something on your desktop machine and it crashes, or otherwise behaves fairly strangely, you've got some chance of knowing why things went wrong. (Often the machine will tell you, to your face.) But if there's a problem with a machine like a web server, well, your best hope is that shortly before everything went wrong it left you a Cthulhu-esque diary of a series of increasingly frightening events that got weirder and weirder as they went along, until the machine's brain snapped and it collapsed into a non-responsive gibbering heap.

(Hmmm. Perhaps I'm creating the wrong image here. After all, my point is that we need logs so we can work out, in retrospect, what went wrong, but any seasoned Cthulhu player knows that the last thing you do is read the notes left behind by a now insane / dead lunatic, because chances are that, by reading those notes, the same thing will happen to you.)

Anyway. It also means that you have to do something more intricate that just checking whether the box in question is actually up. You need to automate connecting to a web server and fetching a web page; you need to connect to an FTP server, send email, check that the email went where it should have gone, and so on. Or, if you're not prepared to do that, you need to have scripts that monitor the output of your web server, FTP server, MTA or whatever, and squawk when something weird happens.

No false alerts

OK, so the first issue is making sure that you catch every standard problem. The second is as tricky: it's making sure that you only send out alerts for things that are actual problems. (It's the boy who cried wolf, but with technology.) Otherwise, when your staff come to work, read their email, and there's a whole bunch of random email saying that something might be vaguely wrong, well, they're going to ignore it, either deliberately (delete all such mails) or unwittingly (not pay as much attention to those emails, because you know what they're going to say).

Also, if you check whether everything's fine every 5 minutes, say, and something breaks at 1pm and stays broken until 3pm, well, your systems shouldn't scream every five minutes that all hell has broken loose. In a situation like that, either a) staff know about it, but don't have time to turn off the warnings; or b) they don't know about it, which means the warnings have been useless. You need to be able to sit still if staff know about the problem, or escalate it if there's no response (e.g. the guy on call isn't answering his phone.)

I'm not pretending UK2 get this right, by any means. But, if I had the opportunity to revamp our systems, these are the principles I'd bear in mind when I did.