In an earlier posting[1], I mentioned that explaining how to optimally use a watchdog driver would be a good thing to talk about later. Now seems a good time to talk about that, giving a brief overview of some good techniques for getting the most out out of your watchdog timer.
As was mentioned previously, one can have a software watchdog timer like softdog[2], or a hardware timer, or a watchdog utility like apphbd[3] (application heartbeat daemon). Although each method has its advantages and disadvantages, the methods that an application can use to take best advantage of them are very similar.
The basic idea of using a watchdog is simple: Periodically send a heartbeat to the watchdog timer. If the application fails to heartbeat in the specified interval, or exits prematurely, then a recovery action is taken. So, all your application has to do is set a timer and tickle the watchdog timer when your timer goes off. Sounds extremely simple - and for the most part it is.
How to get into trouble with watchdog timers
If your application does disk I/O or grows in size as it runs, or calls functions or systems calls that might block, then the timing of your application can change dramatically when the system is under heavy I/O load or memory pressure. This can mean that your application is judged to be hung when it's not. When this happens, the watchdog timer you're using will trigger a recovery action - maybe restarting your application, or rebooting the machine. For this kind of a situation, this is probably not what you had in mind. Of course, if you're in an HA environment like Linux-HA[4] where a machine reboot will cause a service failover, this may exactly what's needed to straighten out your problem. As always, YMMV[5].
When not to use watchdog timers
Watchdog timers need reasonably reliable real-time performance, and an application which you can modify and which runs periodically (or which you're willing to make run periodically). If you can't modify the application, or it is expected to have extremely erratic real-time performance, it doesn't run continuously, or runs in an environment which services your application erratically, then watchdog timers may not be for you.
Making your watchdog timer do more
Having your application not appear to be dead is sort-of-OK, but not exactly a deep metric for how well your application is behaving. Since this code will run inside your application, and you have to modify your application anyway, you have the opportunity to make this a form of white-box testing [6]. To do this, tie tickling your watchdog timer to good measures of your program's sanity. Below are a few examples of how you might go about doing this. Not every example is appropriate for every type of application, so take this as food for thought.
Audit your data structures. If you have several data structures representing work to do, clients, outputs, etc., you can audit your data structures for mutual consistency, and only send out heartbeats when you don't find any errors. For example, perhaps every piece of work to be done ought to belong to an active client. This is an interesting technique - because it can allow for transient "false positives" or errors that get corrected by the natural flow of the application. If you audit your data structures once every 10 seconds, but set your heartbeat rate to 30 seconds, then you can have a transient error last up to two iterations without causing a restart. If it persists beyond that, your watchdog timer will take action.
Check for work being processed. If you have work in your input queues, but no work has been completed since the last heartbeat, then suppress sending out a heartbeat until some work actually gets processed.
Check for old work to be done. If you have work queues, you can skip heartbeats whenever you have work in your queues which is "too old". This may be a symptom that your application isn't processing its work, or that it has somehow lost track of this particular piece of work.
Of course, these are just a few simple ideas, but they may spark some better ideas for your particular application. Anything you can examine periodically to see if your application seems to be doing whatever it is it's supposed to do is a potential candidate for using to control when and whether to tickle your watchdog timer.
Making your watchdog timer more reliable
Something you should really avoid is false positives on watchdog timeouts - especially when the consequence is to reboot the machine. Spurious reboots are typically frowned upon ;-). This is all a fine idea, but how exactly do you go about doing that? Here are a few tips to keep in mind.
Tune your timeout interval. Some watchdog timers (like apphbd) have a warning level as well as a fatal timing level. Be sure and take advantage of the warning level to help you tune your application's heartbeat interval. For example, if your application really ought to send out a heartbeat every second, you can set the warning time threshold to 1.5 seconds, and the fatal watchdog timer to a much larger value, say 5 or 10 seconds. That way, as you test, you can see how close you come to the 3 second time limit in your most extreme cases of load.
Know your application's expected behavior. If is single-threaded and does short-lived tasks throughout the day, except for once a day when it produces a summary report, then keep that extra work and the delay it can cause in mind when setting your timer value.
Make your application a better real-time citizen. Rather than processing all its work in one huge uninterruptible chunk, make sure you process chunks whose size is bounded by some constant amount of time, and allow interruptions (for processing watchdog timer requests) between these chunks.
Consider using the glib mainloop task dispatcher[7]. The mainloop construct is great for event driven programs which don't actually require threading. We use it extensively in Heartbeat [4], and it has worked out very well for us.
Consider using threads. Many people swear by threads as the only way to write any reasonable program. Many people (including many of those same people) swear at them[8], When properly used, threads can be helpful in writing programs with better real-time behavior.
Consider doing your disk I/O into a separate process (or thread). It's usually disk I/O or memory allocation (implying disk I/O) which is most likely to hang your program and give it unpredictable realtime behavior.
Consider using asynchronous I/O. This is another technique for avoiding blocking by disk I/O. It's not terribly portable, and the API seems to me to be a little subject to change, but it's a really nice idea.
Consider locking your program into memory. If your program needs about the same amount of real memory as it occupies virtual memory, then the system impact of locking your program into memory may not be high. This is not for everyone - because if everyone did this, then why bother implementing virtual memory?
Consider setting your program with a soft-realtime (POSIX realtime) priority. Even more than the previous step, this step is not for everyone. If your program goes into an infinite loop, the entire system stops - end of story. Not good. But, if your program is critical, small, and well-behaved, this can be a reasonable thing to do.
Consider running real time Linux [9]. This is an even more drastic step, but if your whole system needs to be real time, some great work has been done here which you might look into.
If you're writing in Java, consider real-time Java from IBM[10]. A better recommendation might be to not use Java, but for some people Java is their religion, but if you or your management insists on Java, then this may be just the ticket for you.
In summary, here are general ideas to keep in mind:
Tie tickling the watchdog timer with the sanity of your application
Do what is reasonable to improve the predictability of the realtime behavior of your program
Hopefully the tips above will help you do this.
References
[1]
http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[2]
http://linux-ha.org/softdog
[3]
http://linux.die.net/man/8/apphbd
[4] http://linux-ha.org/
[5]
http://en.wiktionary.org/wiki/your_mileage_may_vary
[6]
http://en.wikipedia.org/wiki/White_box_testing
[7]
http://library.gnome.org/devel/glib/unstable/glib-The-Main-Event-Loop.html
[8]
http://sourcefrog.net/weblog/software/languages/java/java-threads.html
[9]
http://www-03.ibm.com/press/us/en/pressrelease/21232.wss
[10]
http://domino.research.ibm.com/comm/research_projects.nsf/pages/metronome.index.html
Comments
You can follow this conversation by subscribing to the comment feed for this post.