For the purposes of this posting, I'm concentrating more on things that can tell you if a particular service is working or not working and recover a failed service, and less on a datacenter-wide view of service health.
As I discussed in my previous posting on this topic, there are three basic ways to monitor a service:
- See if the process providing the service is alive
- Use the service API to determine if the service is well (simplified version: check to see if someone is listening on the port)
- Instrument the service to periodically heartbeat (check in with) a watchdog program or device
There are a lot of programs to monitor services using these three methods. Many of the open source programs to provide these kinds of services are listed on the Linux-HA service monitoring page. Rather than reproduce that page here, I'll discuss a few of the better-known programs found there.
- Monit is very highly thought of by many sysadmins, and is often used for service monitoring and restart. The information in this post about monit was provided by the most excellent Christian Wilken.
Monit is a monitoring tool which can take the necessary action to ensure service availability. It can monitor services locally or remotely by polling a specific port (type 2 monitoring). It can monitor binary files. I.e. you might want monit to monitor your apache binary file or any other binary on the system. It checks the md5sum and the octal permissions of the file and warns the administrator if something has changed (and unmonitor the service). It can also check the uid and gid of the file.
Through the integrated web interface you can see uptime, cpu load and memory consumption for each service monitored. you can decide what to do when a service fails.
Here's an example: you want to monitor apache (port 80) and if it fails you want to restart the service. If it fails to restart apache (after let's say 3 tries) you can set it up to raise an alarm (and send a mail to the admins). you can also setup the apache service monitoring to depend on apache_bin (the binary apache file). That means if the apache binary has changed the apache service is unmonitored until you take action.
Monit also monitors its own binaries so if you (or someone else for that matter) updates the monit code you will also get an alarm and you have to take action. However, it does not monitor its own operation, or register itself with some other monitoring service like /dev/watchdog or apphbd. Monit is a very flexible monitoring/automation tool and its config file is simple and easy to learn. Monit is a simple yet effective tool.
- Mon is another well-thought-of monitoring tool. It is a straightforward Perl program which has goals very similar to Monit. Mon monitors services, and then takes actions when services are found to have failed. For example, it will notify an administrator, or restart a service. It comes with scripts which are capable of monitoring a large number of services. It has been around for many years, and has many happy users. It is simple to configure, normally monitors remotely, supports dependencies between monitors, typically is used as a type 2 monitoring service. Like most similar services, nothing normally monitors mon's operation to see if it is still running, or operating correctly.
- Nagios is a very flexible system management package which provides for a lot of capabilities for monitoring servers and services and taking a variety of actions when failures occur. It is designed to watch and manage your entire data center. For the most part, it uses SNMP to do that - however it can use other methods for monitoring as well. One of the more interesting features of Nagios is that it supports service dependencies. Normally, each service is defined by hand, and it can take a wide variety of actions when things fail, including escalation actions. However, it does not natively provide process restarting. Recently Gerard Petersen has documented a way of getting nagios to do this with a consistent set of rules. Understanding his method of doing this generically requires an extensive knowledge of Nagios and Gerard describes it as "somewhat cumbersome". Nevertheless, it does add this capability to Nagios in a generic way. It is a "type 1" monitoring system (done remotely with ssh) - it only checks if the process is alive. You can tie this capability together with real service functionality checks (type 2) using dependencies. However, if the service can't be stopped gracefully, and it has to do a kill -9 to stop it, then the service may not work properly afterwards. This is a pretty unsurprising caveat. However, it will attempt to kill the process and recover without manual intervention. Nagios works remotely and performs both type 1 and type 2 checks (remotely), and implements service dependencies.
It is worth noting that this Nagios technique will not work for servers which are part of an HA cluster, because there is no fixed association between a server name and which services are running on it. You could use it to monitor the HA server processes if you like, but usually the HA software will do that itself. Note that when you combine management services, you need to make sure you understand which management service is going to control what, know how each affects the other and keep them strictly out of each other's way.
Nagios does much more than just monitor services and restart daemons, however. It keeps statistics, and provides historical and near-real-time performance and load graphs, provides an extensive set of alerts, and performs a variety of other system management functions. I don't know if/how the central Nagios processes are monitored.
- apphbd is part of the Linux-HA suite of programs. It performs type 3 monitoring of properly instrumented services. That is, you can write a program which connects to apphbd, and sends it heartbeats periodically. When your application doesn't send a heartbeat message by the expected time, or exits without informing apphbd of its intention to exit, then apphbd will notify another application of this condition, which will restart your application. Apphbd normally registers with /dev/watchdog to send heartbeats to the kernel watchdog device. As a precaution against false reboots, apphbd locks itself into memory and runs with soft realtime priority.
- Heartbeat from the Linux-HA project is typically thought of as a clustering package. However, it is perfectly happy to manage a single system - which it thinks of as a cluster with one node. It is capable of doing both type 1 and type 2 checks, implements service dependencies, and will automatically restart services if they die, including services which depend on failed service. The configuration is in XML, and isn't difficult but can be tedious.
For the case of monitoring a single machine, I wrote a prototype script which looks at your init.d directories for all set of active services, and creates a cib.xml configuration file for the set of services on your machine. Heartbeat will then poll to monitor those services and ensure that they are running, and restart them (and any services which depend on them) automatically. Because the configuration this script creates relies solely on the init scripts, the monitoring is done by using the status actions of the LSB. As a result, for the most part these are type 1 monitoring. To do type 2 service monitoring, it is necessary to create an R2-style Heartbeat configuration and make sure the services to be monitored have a repeating monitor action declared for them. Of course, the script mentioned before makes sure all this happens.
If you wish to do more thorough monitoring, you can convert these scripts to OCF resource agents, modify the configuration and perform any kind of monitoring you wish. Heartbeat itself can be monitored by SNMP or CIM - which lets you easily include Heartbeat as being monitored over the data center via Nagios, OpenNMS or other package. In addition, if you wish to protect yourself against OS hangs and crashes and hardware problems, you can easily add servers to your one-node cluster, make some small rule changes, and now your service will continue running even if a server or OS dies. The process of changing from a 1-node to an n-node cluster isn't at all complicated. If one has things configured properly, and one uses the autojoin feature.
It is worth noting that the action which Heartbeat takes when a service can't be stopped is quite different from Nagios'. In Heartbeat, when a resource (service) can't be stopped, Heartbeat would normally reboot the machine. Although this sounds drastic, it can get most services running even when the problem is in the kernel. Since normally Heartbeat is managing a cluster, the service would normally be taken over immediately by another node in the cluster.
With regard to monitoring Heartbeat itself: it monitors its own operation, and it can be configured to heartbeat either with a watchdog device like softdog, or with apphbd (discussed above). When you put it into a cluster, then the nodes of the cluster monitor each other, and use STONITH to reboot machines which have stopped working. If you use Heartbeat then, you get as complete a set of monitoring as is available.
- cl_respawn is a simple tool which is packaged as part of the Heartbeat package. cl_respawn takes arguments which tell it how often to heartbeat with apphbd, and the value of a "magic" exit code. This exit code is one that when the application exits with this return code that cl_respawn does not restart the its child process. Although it only does type 1 monitoring, it doesn't poll to see if a process has died, ithe OS gives it a signal and it responds immediately. When you invoke a process with cl_respawn, it is necessary to make sure it does not fork off and run in the background.
- Supervise from D. J. Bernstein's daemontools replaces the usual init.d with a service directory, so that things are installed, stopped and started and otherwise managed in a non-standard way. However, as supervise is part of this whole process, service monitoring and restarting comes along for free. When a daemon is started it becomes a child of supervise. When a daemon dies, supervise is notified and it is automatically respawned immediately without waiting for a poll interval. Supervise itself does not appear to be monitored for failures by svscan, which appears to be unmonitored. Daemontools does not appear to support service dependencies. Like cl_respawn, when you invoke a process with daemontools it is necessary to make sure it does not fork off and run in the background.
- Watchdog drivers can either be implemented in software like softdog or they can be implemented in hardware. All the watchdog drivers that ship with Linux share the same API - that implemented by softdog. As a result, a program which works with softdog can work quite nicely with the hardware-based watchdog drivers. It's worth noting that kernel hangs which completely disable kernel timer services can cause softdog to malfunction. The good news that hangs like this are extremely rare. If one has a hardware based watchdog device, it takes a hardware failure for these to fail to reset a hung machine. This too, is extremely rare. If one is a true belt-and-suspenders sysadmin (with a enough time and budget), one could use both a software and hardware based watchdog
- ldirectord is a specialized monitoring system which is delivered as a subpackage of the Linux-HA suite. It monitors servers and services in a load balancing cluster and takes dead servers out of the load balancing rotation, and restarts services when they stop responding. It is a type 2 service, and because it's an HA resource, is normally monitored by Heartbeat itself.
Howtoforge articles on monitoring.
Related systems - not quite on topic
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state.
Although it doesn't attempt to repair problems, it is widely used in
high-performance clusters and might be of interest to people reading
Opennms is a Java-based enterprise grade open source network monitoring platform. It consists of a community supported open-source project as well as a commercial services, training and support organization.
The goal is for OpenNMS to be a truly distributed, scalable platform for all aspects of the FCAPS network management model, and to make this platform available to both open source and commercial applications.
OpenNMS is oriented towards responding to SNMP data and traps, but can be modified to monitor other things. It does not appear to me to be well-suited to recovering from a service or process hanging, nor for monitoring a single system. Where it shines is in managing an entire data center with a large number of SNMP-enabled servers and network devices.