Sometimes when you've been doing something for a long time, it's easy to take for granted the things you know. In recent days, I've run into a several people who think that a "no news is good news" methodology for monitoring can't possibly be reliable. So, this blog post is about that - how the Assimilation Monitoring Project (AMP) follows a "no news is good news" methodology and still reports failures reliably.
For monitoring, what you want is to know that if something fails, you'll know. In the worst case, you might not diagnose it correctly, but you'll know that something's wrong. So, let's look at how this is architected and see if we how everything is monitored so that we can see if we can find a way for failures to happen and slip past our monitoring net and not get reported.
You Can't Tell The Players Without A Program
There are a variety of components that are used in the monitoring process for a server named S. They are:
- The hardware for S
- The network switch port that S is connected to or cable or NIC in S
- The nanoprobe running on S
- The OS for S
- The agents performing monitoring on S
- The nanoprobes, servers or NICs on S's neighbors - let's call them S+1, and S-1.
- The CMA - running under an HA system similar to Linux-HA or Pacemaker
- The network fabric connecting S and its neighbors to the CMA
So, let's see what happens when each of these fails.
If the hardware for S fails, for example, loses power or crashes, then the nanoprobe on S will stop sending heartbeats and both S+1 and S-1 will report the failure to the CMA.
If the switch port, cable or NIC for S fails, then S+1 and S-1 will report the failure to the CMA, and S will attempt (but fail) to report problems with S+1 and S-1. The CMA will splice S out of the loop, and S+1 and S-1 will become neighbors. This can also happen if a firewall blocks our packets. Except for the nanoprobe running on the CMA server, all communication uses the same port - so, it is unlikely that a firewall rule would block communication to the CMA and not also block heartbeats to S+1 and S-1.
If the nanoprobe running on S fails (crashes), then S+1 and S-1 will report that S has crashed. Not strictly accurate, but a failure does get reported.
If the OS for S crashes or stops scheduling user processes, then the nanoprobe on S will stop sending heartbeats and both S+1 and S-1 will report the failure to the CMA.
If the agents performing monitoring fail by a timeout, the timeout will be reported to the CMA.
If the agents performing the monitoring fail by a false negative, then a failure will be reported to the CMA. If they fail and create a false positive, then no failure will be reported to the CMA. These two cases are most likely cause by a bug in the monitoring agent.
If neighbor S+1 fails, then S-1 is still able to report failures of S. If both neighbor S+1 and S-1 fail, then temporarily no one is able to report failures of S. However, an anomaly will be detected, by the CMA because S should have also reported the failure of S+1. More extensive probing from the CMA can be triggered by the anomaly to reveal the actual situation. For any given failure of any given S, there should always be two or four servers (depending on neighbor toplogy) reporting that failure. If the number of nanoprobes reporting the failure is not what it should be, then the CMA can probe systems to discover what the actual situation is. Note that this is a very rare occurrence, and it is unlikely that the software will diagnose the exact failure correctly, but humans can be notified that something odd has happened.
If the CMA (or the server running it) fails when an HA system is protecting it, then there is a period of time when reports of all types from nanoprobes cannot be received. Once the CMA restarts, then the IP address will be taken over by the new CMA, and reports will once again be processed. All communication takes place by reliable UDP, so packets will be retransmitted until the new CMA is ready to process them. It is important to note that packets are not acknowledged by the CMA until they have been fully processed or stored on disk in a place that won't get lost during a failover. It is worth noting that currently packets are not stored on a permanent queue, but are always fully processed before acknowledgements are sent.
If the network fabric between S and the CMA fails there are a few cases to consider. If any of our monitored servers is connected to the failed switch, then it will be detected as noted earlier. If this is a backbone switch to which none of our nanoprobes are directly connected, then this failure may not be detected by the current software. If any of our heartbeat paths go through this switch, then we will detect something is wrong, but we may not diagnose it correctly. With a fairly dense coverage of servers in the infrastruture by nanoprobes, the probability of this going unnoticed is low. Although it might be desirable to do so, I don't know of any systems that test every possible path through the switching hierarchy.
These types of backbone switches also cannot be discovered by our current stealthy CDP or LLDP methods. Although it is straightforward to create agents to monitor these switches, it does not fit as naturally into the current architecture as do servers, services, and directly connected switches. This is similar to the case of systems on which we cannot install agents. They can be monitored, but unlike the more normal cases, it is a completely manual effort to configure them and interpret the results.
With more "noisy" discovery of switches through SNMP, such switches could be automatically configured and the overall topology could be more fully mapped - ensuring that the monitoring agents are not stuck behind the failing switch. This is not planned for the near future.
In summary, through the judicious use of redundancy, the AMP architecture does a very creditable job of being both quite reliable, and yet sticking to a "no news is good news" philosophy.