Software failures are pretty common. If you've used software, you've seen software fail. Typically they're among the most common types of failures, with only power failures being more common. So, if you want your service to work reliably, something is going to have to monitor the software that implements it to see if it's working. Software outages are estimated at around 25% of the total of unplanned outages. Like most general estimates measured against other people's computers - YMMV - Your Mileage May Vary .
A sufficient condition for program triviality is that it have no bugs
So, it seems pretty obvious that this is a pretty big chunk of potential outages to watch over and try and eliminate or recover from if you can. Of course, before you can recover from a problem you have to know there's a problem. That's where monitoring comes in. Monitoring software is software that monitors (or watches) other software.
Local or Network Monitoring
There are lots of packages for monitoring software available in the wild. I tend to favor monitoring software that runs on the machine being monitored - it works even when the network is down, it minimizes the network traffic, and security concerns are greatly reduced. But, it can be relatively painful to administer since you have to install and configure software on each machine, and you have to separately monitor the servers. If you have thousands of servers you're monitoring in this way, it's potentially very painful (at least without the right tools for managing it centrally). Since I like clusters, I'll mention that if you're running a cluster, this monitoring can be administered centrally on the cluster, making that much simpler than it would otherwise be - in fact it may come "for free" with when you configure the services in the cluster.
How to Monitor Services?
You have software on your server which implements services. So, how are you going to monitor it? There are basically two different ways to monitor software externally - the easy way, and the not-so-easy way.
- A really simple way to monitor a service is to just look to see if it's still running. If you're on a UNIX-like machine, you can do a ps to see if it's still running, and if it is, say "Good service!", and if it's not, say "Bad service - Go to your room!". If the majority of your software failures result in the process exiting (leaving a core or not), then this will work just fine for you. This is how most init script status actions work. However, if the software hangs, or gives crazy or otherwise incorrect results, this method won't detect that kind of misbehavior for you - you'll need a less-simple method. It's also worth noting that, simple as this is, you have to be on the machine that's running the service. You can't directly see the process table of a remote machine. This technique is the basis for how the UNIX respawn init directive works - when a service exits it gets automatically restarted. Simple, but usually too simple - which is why it only sees limited use in practice. You could argue that good HA systems spend half of their time implementing the respawn directive right ;-).
- A not-so-simple-way to monitor a service is to use the service and see if the result you got was reasonable, and completed in a reasonable time. For example, if you are monitoring a web server, you might do an HTTP GET operation on the port the server is on, and see if you get reasonable-looking HTML and a non-error return code in a "reasonable" time. For example, the Linux-HA apache resource agent does exactly this using the wget command. If you have a database, you might do a short query whose answer is easily sanity-checked. This is what the Linux-HA DB2 resource agent does (as do several other database resource agents). Another advantage of this technique is that you can often perform it remotely - which works out well if you want to monitor everything centrally.
Is there some kind of standard for monitoring UNIX-like services?
- The Linux Standard Base defines a standard for init scripts which tells how to start, stop, and determine the status of a service. It's the status part that we care about here - because this implements a poor-man's simple monitoring operation.
- The Open Cluster Framework defines a standard for resource agents (RAs) which tells how to start, stop, and (lucky for us) monitor a service. In fact, the monitor action is interesting, because it defines multiple levels of monitoring. Maybe you want to do something lightweight frequently, and something heavier-weight less often. If so, then the OCF RA standard may be for you. For the interested, it's simple to convert an init script into an OCF resource agent - since the OCF RA specification was built on top of the LSB init specification.
Both of these standards only work for monitoring services locally. That is, you have to be on the server to use either. So, you can't use them if you want to monitor services remotely.
Quis custodiet ipsos custodes? - Who will watch the watchers?
If your monitoring software fails, how would you know? Since it's software, and like all software, might fail, who's monitoring it to make sure it works? As you can see, this is a potentially endless problem. In my experience, no one has any endless solutions - only endless problems. But, there is a reasonable way to approach it - create a hierarchy of watchers.
Applications (which can also watch themselves) are watched by a watcher program, which in turn registers with a watchdog driver. In Linux, that can be either the softdog watchdog driver or a hardware watchdog driver. To use a watchdog driver, the watching program has to check in periodically and tickle the watchdog driver periodically. If it doesn't, then the system reboots. Good reason to make the top level watcher as simple as possible. How to tickle a watchdog optimally is good fodder for a future post.
Unfortunately, the standard Linux softdog driver only allows one program at a time to use it. Unless you want to restrict yourself to only one watcher on a system, you have to have some kind of watchdog program that the watchers register with. The Linux-HA project provides the apphbd (application heartbeat daemon) designed expressly for this purpose. So, you register with apphbd, and apphbd registers with the watchdog driver - which watches it. Apphbd can watch as many programs as you want - but they have to register with it, and they have to tickle it (send it heartbeats) periodically. That means your application has to be modified to do it - which is different from the two earlier approaches. Just like the watchdog driver, you have to tickle apphbd periodically, or it notifies someone who will take a recovery action. Fortunately, however, apphbd doesn't reboot the machine - normally the application just gets restarted if it dies or stops sending heartbeats. Avoiding unnecessary reboots is generally thought to be a good thing ;-) - so most people like this a lot.
In the process of talking about how to watch watchers, we've added a third method of monitoring applications. The first two could be done with any application, and the third only by applications you've modified. To summarize, they are:
- Checking to see if the process is running (local only)
- Exercising an application API to check for application sanity (local or remote)
- "Tickling" a watchdog program or driver (typically local) - requires modifying the application.
But what if the kernel fails, or the driver fails, or the watchdog hardware fails, what do you do then? Really there's only one thing to do - that's let another machine watch this one - and give it the ability to kill your machine using something like STONITH. Voila! You've just made an HA cluster. How does that monitoring machine get monitored? You let the cluster members monitor each other - resulting in a sort of mutual monitoring society. Can this fail? Of course! - given enough problems anything can fail - but it's really unlikely. Trying to make this kind of problem go away completely helps you discover that paranoia can be a very expensive hobby.
This idea of hierarchical watchers is generally a good pattern to follow for monitoring. For example, have each machine monitor its own services, and monitor the machines in a cluster, then monitor the clusters centrally. This minimizes network traffic, and is scalable to very large environments.
I've covered the basic concepts of service monitoring, without quite getting around to talking about how to implement it in practice. This will be covered in a future posting.
 Gregory Pfister - In Search Of Clusters - Second Edition. p. 390, 1998, Prentice Hall
 http://linux-ha.org/STONITH or http://en.wikipedia.org/wiki/STONITH