watchdog

November 04, 2007

Availability, MTBF, MTTR and other bedtime tales

If we let A represent availability, then the simplest formula for availability is:
    A = Uptime/(Uptime + Downtime)

Of course, it's more interesting when you start looking at the things that influence uptime and downtime.  The most common measures that can be used in this way are MTBF and MTTR.

    MTBF is  Mean Time Between Failures
    MTTR is Mean Time To Repair   

 A = MTBF / (MTBF+MTTR)

One interesting observation you can make when reading this formula is that if you could instantly repair everything (MTTR = 0), then it wouldn't matter what the MTBF is - Availability would be 100% (1) all the time.

That's exactly what HA clustering tries to do.  It tries to make the MTTR as close to zero as it can by automatically (autonomically) switching in redundant components for failed components as fast as it can.   Depending on the application architecture and how fast failure can be detected and repaired, a given failure might not be observable by at all by a client of the service.  If it's not observable by the client, then in some sense it didn't happen at all.  This idea of viewing things from the client's perspective is an important one in a practical sense, and I'll talk about that some more later on.

It's important to realize that any given data center, or cluster provides many services, and not all of them are related to each other.  Failure of one component in the system may not cause failure of the system.  Indeed, good HA design eliminates single points of failure by introducing redundancy.  If you're going to try and calculate MTBF in a real-life (meaning complex) environment with redundancy and interrelated services, it's going to be very complicated to do.

    MTBFx is  Mean Time Between Failures for entity x
    MTTRx is Mean Time To Repair for entity x
    Ax is the Availability of entity x   

Ax = MTBFx / (MTBFx+MTTRx)

In practice, these measures (MTBFx and MTTRx) are hard to come by for nontrivial real systems - in fact, they're so tied in to application reliability and architecture, hardware architecture, deployment strategy, operational skill and training, and a whole host of other factors, that you can actually compute them only very very rarely.  So, why did I spend your time talking about it?  That's simple - although you probably won't compute them, you can learn some important things from these formulas, and you can see how mistakes you make in viewing these formulas might lead you to some wrong conclusions.

Let's get right into one example of a wrong conclusion you might draw from incorrectly applying these formulas.

Let's say we have a service which runs on a single machine, which you put onto a cluster composed of two computers with a certain individual MTBF (Mi) and you can fail over to the other computer ("repair") a computer in a certain repair time (Ri).  With two computers, they'll fail twice as often as a single computer, so the system MTBF becomes Mi/2.  If you compute the availability of the cluster, it then becomes:

    A = Mi/2 / (Mi/2+Ri)

Using this (incorrect) analysis for a 1000 node cluster performing the same service, the system MTBF becomes Mi/1000.

    A = Mi/1000 / (Mi/1000+Ri)

If you take the number of nodes in the cluster to the limit (approaching infinity), the Availability approaches zero.

    A = 0/(0+Ri) = 0/Ri = 0

This makes it appear that adding cluster nodes decreases availability.  Is this really true?  Of course not!  The mistake here is thinking that the service needed all those  cluster nodes to make it go.  If your service was a complicated interlocking scientific computation that would stop if any cluster node failed, then this model might be correct.  But if the other nodes were providing redundancy or unrelated services, then they would have no effect on MTBF of the service in question.  Of course, as they break, you'd have to repair them, which would mean replacing systems more and more often, which would be both annoying and expensive, but it wouldn't cause the service availability to go down.

To properly apply these formulas, even intuitively, you need to make sure you understand what your service is, how you define a failure, how the service components relate to each other, and what happens when one of them fails.  Here are a few rules of thumb for thinking about availability

  • Complexity is the enemy of reliability (MTTR).  This can take many forms
    • Complex software fails more often than simple software
    • Complex hardware fails more often than simple hardware
    • Software dependencies usually mean that if any component fails, the whole service fails
    • Configuration complexity lowers the chances of the configuration being correct
    • Complexity drastically increases the possibility of human error
      • What is complex software? - Software whose model of the universe doesn't match that of the staff who manage it.
  • Redundancy is the friend of availability - it allows for quick autonomic recovery - significantly improving MTTR.  Replication is another word for redundancy.
  • Good failure detection is vital - HA and other autonomic software can only recover from failures it detects.  Undetected failures have human-speed MTTR or worse, not autonomic-speed MTTR.  They can be worse than human-speed MTTR because the humans are surprised that it wasn't automatically recovered and they respond more slowly than normal.  In addition, the added complexity of correcting an autonomic service and trying to keep their fingers out of the gears may slow down their thought processes.
  • Non-essential components don't count - failure of inactive or non-essential components doesn't affect service availability.  These inactive components can be hardware (spare machines), or software (like administrative interfaces), or hardware only being used to run non-essential software.  More generally, for the purpose of calculating the availability of service X, non-essential components include anything not running service X or services essential to X.

The real world is much more complex than any simple rules of thumb like these, but these are certainly worth taking into account.

October 05, 2007

How to use a watchdog timer

In an earlier posting[1], I mentioned that explaining how to optimally use a watchdog driver would be a good thing to talk about later. Now seems a good time to talk about that, giving a brief overview of some good techniques for getting the most out out of your watchdog timer.

As was mentioned previously, one can have a software watchdog timer like softdog[2], or a hardware timer, or a watchdog utility like apphbd[3] (application heartbeat daemon). Although each method has its advantages and disadvantages, the methods that an application can use to take best advantage of them are very similar.

The basic idea of using a watchdog is simple:  Periodically send a heartbeat to the watchdog timer.  If the application fails to heartbeat in the specified interval, or exits prematurely, then a recovery action is taken.  So, all your application has to do  is set a timer and tickle the watchdog timer when your timer goes off.  Sounds extremely simple - and for the most part it is.

How to get into trouble with watchdog timers

If your application does disk I/O or grows in size as it runs, or calls functions or systems calls that might block, then the timing of your application can change dramatically when the system is under heavy I/O load or memory pressure.  This can mean that your application is judged to be hung when it's not.  When this happens, the watchdog timer you're using will trigger a recovery action - maybe restarting your application, or rebooting the machine.  For this kind of a situation, this is probably not what you had in mind.  Of course, if you're in an HA environment like Linux-HA[4] where a machine reboot will cause a service failover, this may exactly what's needed to straighten out your problem.  As always, YMMV[5].

When not to use watchdog timers

Watchdog timers need reasonably reliable real-time performance, and an application which you can modify and which runs periodically (or which you're willing to make run periodically).  If you can't modify the application, or it is expected to have extremely erratic real-time performance, it doesn't run continuously, or runs in an environment which services your application erratically, then watchdog timers may not be for you.

Making your watchdog timer do more

Having your application not appear to be dead is sort-of-OK, but not exactly a deep metric for how well your application is behaving. Since this code will run inside your application, and you have to modify your application anyway, you have the opportunity to make this a form of white-box testing [6]. To do this, tie tickling your watchdog timer to good measures of your program's sanity. Below are a few examples of how you might go about doing this.  Not every example is appropriate for every type of application, so take this as food for thought.

  • Audit your data structures. If you have several data structures representing work to do, clients, outputs, etc., you can audit your data structures for mutual consistency, and only send out heartbeats when you don't find any errors.  For example, perhaps every piece of work to be done ought to belong to an active client. This is an interesting technique - because it can allow for transient "false positives" or errors that get corrected by the natural flow of the application.  If you audit your data structures once every 10 seconds, but set your heartbeat rate to 30 seconds, then you can have a transient error last up to two iterations without causing a restart.  If it persists beyond that, your watchdog timer will take action.

  • Check for work being processed.  If you have work in your input queues, but no work has been completed since the last heartbeat, then suppress sending out a heartbeat until some work actually gets processed.

  • Check for old work to be done.  If you have work queues, you can skip heartbeats whenever you have work in your queues which is "too old".  This may be a symptom that your application isn't processing its work, or that it has somehow lost track of this particular piece of work.

Of course, these are just a few simple ideas, but they may spark some better ideas for your particular application.  Anything you can examine periodically to see if your application seems to be doing whatever it is it's supposed to do is a potential candidate for using to control when and whether to tickle your watchdog timer.

Making your watchdog timer more reliable

Something you should really avoid is false positives on watchdog timeouts - especially when the consequence is to reboot the machine. Spurious reboots are typically frowned upon ;-).  This is all a fine idea, but how exactly do you go about doing that?  Here are a few tips to keep in mind.

  1. Tune your timeout interval. Some watchdog timers (like apphbd) have a warning level as well as a fatal timing level. Be sure and take advantage of the warning level to help you tune your application's heartbeat interval.  For example, if your application really ought to send out a heartbeat every second, you can set the warning time threshold to 1.5 seconds, and the fatal watchdog timer to a much larger value, say 5 or 10 seconds. That way, as you test, you can see how close you come to the 3 second time limit in your most extreme cases of load.

  2. Know your application's expected behavior.  If is single-threaded and does short-lived tasks throughout the day, except for once a day when it produces a summary report, then keep that extra work and the delay it can cause in mind when setting your timer value.

  3. Make your application a better real-time citizen. Rather than processing all its work in one huge uninterruptible chunk, make sure you process chunks whose size is bounded by some constant amount of time, and allow interruptions (for processing watchdog timer requests) between these chunks.

  4. Consider using the glib mainloop task dispatcher[7]. The mainloop construct is great for event driven programs which don't actually require threading. We use it extensively in Heartbeat [4], and it has worked out very well for us.

  5. Consider using threads. Many people swear by threads as the only way to write any reasonable program. Many people (including many of those same people) swear at them[8], When properly used, threads can be helpful in writing programs with better real-time behavior.

  6. Consider doing your disk I/O into a separate process (or thread). It's usually disk I/O or memory allocation (implying disk I/O) which is most likely to hang your program and give it unpredictable realtime behavior.

  7. Consider using asynchronous I/O. This is another technique for avoiding blocking by disk I/O. It's not terribly portable, and the API seems to me to be a little subject to change, but it's a really nice idea.

  8. Consider locking your program into memory. If your program needs about the same amount of real memory as it occupies virtual memory, then the system impact of locking your program into memory may not be high. This is not for everyone - because if everyone did this, then why bother implementing virtual memory?

  9. Consider setting your program with a soft-realtime (POSIX realtime) priority. Even more than the previous step, this step is not for everyone. If your program goes into an infinite loop, the entire system stops - end of story. Not good. But, if your program is critical, small, and well-behaved, this can be a reasonable thing to do.

  10. Consider running real time Linux [9]. This is an even more drastic step, but if your whole system needs to be real time, some great work has been done here which you might look into.

  11. If you're writing in Java, consider real-time Java from IBM[10]. A better recommendation might be to not use Java, but for some people Java is their religion, but if you or your management insists on Java, then this may be just the ticket for you.

In summary, here are general ideas to keep in mind:

  • Tie tickling the watchdog timer with the sanity of your application

  • Do what is reasonable to improve the predictability of the realtime behavior of your program

Hopefully the tips above will help you do this.

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[2] http://linux-ha.org/softdog
[3] http://linux.die.net/man/8/apphbd
[4] http://linux-ha.org/
[5] http://en.wiktionary.org/wiki/your_mileage_may_vary
[6] http://en.wikipedia.org/wiki/White_box_testing
[7] http://library.gnome.org/devel/glib/unstable/glib-The-Main-Event-Loop.html
[8] http://sourcefrog.net/weblog/software/languages/java/java-threads.html
[9] http://www-03.ibm.com/press/us/en/pressrelease/21232.wss
[10] http://domino.research.ibm.com/comm/research_projects.nsf/pages/metronome.index.html