Somehow people seem to think that heartbeats and pings are the same thing. They're not at all the same thing. Heartbeats are typically semi-intelligent elements in a hierarchy of watchers. Pings can't play the same role. This blog post talks about why, and how hierarchies of watchers work both in general, and more specifically in the case of the Assimilation Monitoring Project (AMP)- which breaks with tradition in a useful and novel way.
Let's start with pings. The ping command literal ICMP echo request packet is sent to from one machine to another machine. The ping is then returned as an ICMP echo reply when it is received. This might is often done at a very low level in the OS, or might even be done by the NIC itself - which only requires power to the NIC and a network connection. This means that an ICMP ping doesn't test very much at all. In addition, response to pings is often disabled for security reasons. So, this is a very poor method of determining even minimal health of a system. All you really know for sure is that it has power, and was booted at some time in the past. Not very helpful.
Heartbeats have two portions, which are semi-independent. There is the heartbeat sender, which is typically a user process which sends heartbeats at some interval as instructed. There is the hearbeat receiver which expects and listens for heartbeats. If it goes without hearing expected heartbeats for a certain period of time, it will report that fact upstream - in our case to the CMA.
Heartbeats are typically sent by user processes which are running under the OS, and if the OS stops scheduling user processes, heartbeats will stop being sent. In addition the packets they send go through the normal user process queuing system, and if that fails, heartbeats will not be sent. Many more things are validated as operational by sending and expecting heartbeats than by pings.
Quis custodiet ipsos custodes?
A key thing about monitoring systems for availability is that everything needs to be watched, and all the watchers need to be watched. In Latin, this comes out to be the classic question "Quis custodiet ipsos custodes?" - Who will watch the watchmen? The original question is more commomly thought about in political terms regarding corruption, but applies very well here where any computer can fail, and for reliability, you want to make sure that as many failures as possible are caught - so they can be repaired - whether by human or automatic means.
It is common in highly reliable systems like telephone systems to have a hierarchy of watchers, where the behavior of various elements is observed by other elements through a well-organized hierarchy - until the top level is reached. These techniques may also be connected to watchdog timers, which reboot the machine unless they are positively acknowledged in a fashion similar to heartbeats. These can also assist the watching hierarchy when properly used.
Once you reach the top of the hierarchy, hierarchies fail us - and then we wonder who will watch the watcher at the top. In addition, hierarchies don't scale with low effort - one has to keep growing the hierarchy as the number of systems grows.
Within a single nanoprobe in AMP, the hierarchy is pretty normal - our nanoprobes invoke resource agents which watch the resources that they are assigned to watch. The nanoprobes themselves set timers to watch the resource agents that are watching the resources. But now the problem gets more difficult. Who will watch the nanoprobes? Although watchdog timers could be of some help in recovering from certain kinds of failures, they don't solve the problem of failure detection, and are a bit drastic for most customers.
In a conventional highly available computer cluster such as Linux-HA or Pacemaker, complex multicast consensus algorithms such as PAXOS or Corosync ensure that failures on other systems can be properly observed. Unfortunately these algorithms require multicast and don't scale to thousands of systems.
Breaking With Tradition (oh my!)
It is at this point that AMP's architecture breaks with tradition. We arrange for the overwhelming majority of systems to simply monitor each other by having each system monitored by two other systems, and each system also monitors two other systems. By connecting the systems in a circle, like people standing in a circle holding hands, we can have each system have two neighbors which it waches, and which also watch it. Because each system monitored by two other systems, losing a single system doesn't cause us to lose monitoring capability (there's no single point of failure). This looks a lot like the picture below.
In AMP's montoring scheme, each system only has to know about two other systems, and each system is only known about by two other systems. By closing the circle, the problem of who watches the watchers is solved in a more scalable way than a hierarchy - and without requiring multicast.
As a result, our central system (the CMA) doesn't have to actively watch these other systems, because they are watching each other. In our implementation, the CMA does have to lay out the neighbor arrangements, but once set up, the watching takes place without intervention.
And as noted, at the top level, the CMA is small enough to use or get the assistance of standard clustering techniques.
This article covers many of the same issues as the "How to implement 'no news is good news' monitoring reliably" posting from a different perspective. It seems good to present this information in several ways, because it seems a bit counterintuitive.