monitoring

June 10, 2008

Watch that basket!

The computing industry has lots of trends, numerous buzzwords, and a number of hot topics.  Sometimes these are in conflict with each other, or at least start out that way...  But, in the end, there are often good ways to harmonize all these various things.

Let's wander into virtual machine territory again today.  If you have gone to the trouble to create a bunch of virtual machines, the chances are you hope to do a little server consolidation - because when that's properly done it can save you some money.

This sounds good, and indeed has lots of good things going for it.  It's buzzword compliant, it's green, it saves you green (money).  What's not to like?

To see what you might not like if this is all you do, let's take an example to make it obvious...

If you put all your virtual machines on one physical server, then if that server fails, you lose all your virtual machines.  If you put ten virtual machines on one server, then the impact of that server crashing is roughly ten times as great as if a single server crashed.    If you work at it, you might be able to consolidate the ten most critical virtual machines onto a single server - and bring your entire data center to a halt with just one crash - bringing a suddenly much more personal meaning to the term "shock and awe"

This is not typically what people are looking for in their data center - and could easily be one of those career-limiting mistakes that you'd like to avoid - unless you already have your next job lined up.

This falls under the "putting all your eggs into one basket" way of doing business.  This part of a famous quote - but not the whole quote.  Mark Twain said "Put all your eggs in the one basket and --- WATCH THAT BASKET"[1].  So, to follow Mark Twain's advice, we need to not just put our eggs into one basket, we also need to watch that basket.

As most of you already know, watching servers and services is most commonly done by high-availability software - something like Linux-HA[2].  A properly configured HA system will watch the basket for you, and keep the worst from happening to your basket, your servers or your career.

As you can see, doing virtualization for reasons of consolidation doesn't make much sense unless you also add management software (HA software or otherwise) to watch your basket of virtual machines for you.

In the end, it's easy to see that all these things are connected - virtualization, server consolidation, power savings (green computing), availability management, and you want to manage them all.

[1] http://herbison.com/herbison/broken_eggs_watch.html
[2] http://linux-ha.org/

December 13, 2007

How Managed Virtualization (including HA) conflicts with System Management

Managed Virtualization Versus System Management

In an earlier post[1], I talked about a couple of kinds of virtualization, comparing two of them and highlighting their strengths.  This posting discusses how virtualization can confuse and confound conventional systems management - both automated and manual, and gives some thoughts on how to deal with it.

We all know that virtualization is a GoodThing(TM).  Therefore, it can't really have any disadvantages, can it?  <tongue-in-cheek-off> Unfortunately, it does have disadvantages.  The great strength of virtualization is its ability to break the ties between a service or operating system and the server which implements its service.  Many software systems and a good number of human beings find this confusing.  If I want to reboot a physical server, what services or operating systems will be disrupted by the reboot?

Conversely, if I want to do something to the machine that's running a particular service, which machine do I have to log into?  If you're running both service virtualization (conventional HA like Linux-HA[2]) on top of server virtualization (ala Xen or VMware), then you have a doubly difficult task - first you have to figure out which virtual machine is running a service, then you have to figure out which physical machine is running that particular virtual machine.

This can be really annoying and can easily result in system administrators[3] making mistakes either in the middle of the night, or when under pressure (which all sysadmins know is pretty much all the time).

Remember - Complexity is the Enemy of Reliability.   This is just another example of my favorite phrase at work.

And, if you want to have server monitoring software which tries to figure out whether a service is stopped and have it restart it, then it can also get confused by the fact that all these stupid servers and services are always moving around.  They just won't stay put!  Back in the olden days, you logged into a server and you edited the inittab, and you always knew what hardware it was running on and what server it was.  Now, with virtualization, and especially with virtualization management software, you never know what's where.

A Recipe for Chaos and Conflict

Your HA software and/or your virtualization management software can move things around on you.  Imagine that you have these four kinds of things in your data center:

  • High-Availability (HA/service-virtualization) management software

  • Virtualization management software

  • System management monitoring software

  • Human system administrators

This is a recipe for chaos, interspersed with the occasional career-limiting disaster. It's this kind of thing that leads system administrators to pull their hair out, and keep their resumes up to date.  None of these is bad by itself, in fact, each is a GoodThing(TM).  But they don't normally play well with each other. In typical myopic software design fashion, each of these layers is usually unaware of the other layers (except, of course for the last (human) layer - who has to make up for all the poor integration).

In addition, since the software layers typically aren't aware of all this wonderful virtualization going on, they can't really deal with the picture reliably.  They don't know what should be happening where, because it isn't fixed.  The various virtualization management packages keep changing things!

So, what's a body to do?  As far as I know, there are two basic options.

  1. Integrate the four layers of management with each other using things like CIM[4] and SNMP[5]

  2. Empower your HA software to also manage the server virtualization of your data center

Integration of Layers

Virtually every data center (sadly, pun intended) has a variety of server types and a variety of operating systems, and a variety of management software.  They mostly don't play well with each other.  Almost the only way to get them to play together - even if imperfectly - is to have them talk together using industry standard protocols.

Today, that means using SNMP or CIM.   Here is my personal view on the characteristics of these two protocols for your consideration.

  • SNMP - widely deployed - implemented in a truly compatible way, but far too weak for a job this hard.  SNMP is great for grabbing statistics, checking whether a server or router is up and what kind of load it is seeing in great detail.  Anything much beyond this, and the MIBs become 100% vendor-specific - meaning that cross-vendor integration breaks down - basically completely.  For HA clustering or virtualization management or worse yet the combination of the two - forget it.

  • CIM - widely deployed in expensive disk subsystems - but rarely deployed outside that.  It has newly developed models for virtualization and clustering, but like most standards they're mostly lowest-common-denominator standards, and unfortunately not widely deployed.  For example, Linux-HA[2] implements CIM, but unfortunately Linux-HA has tremendous power and capability which CIM can't begin to model.  So, this winds up being only possible to model using vendor-specific extensions - greatly weakening the possible integrations.

Now, I'm not saying that these two protocols are useless - far from it. Without open standards like CIM and SNMP, the prospect truly is hopeless.   But I am saying that integrating them in the typical-for-the-industry highly-heterogeneous data center is a challenge, and the more layers there are to integrate, the bigger the challenge.  Since standards necessarily trail industry practice, the more "bleeding edge" the topic (i.e., HA clustering or virtualization) and the more powerful the underlying tool (like Linux-HA), the greater the mismatch.

Here we have two bleeding edge topics and four layers.  Yikes!  Surely there must be some kind of alternative to this somewhat-unattractive mess.

Decrease The Layers and Let Them Manage Themselves

As I mentioned in my earlier virtualization posting, some HA packages (like Linux-HA) can also manage virtualization simultaneously.  So, one way of dealing with this is to let (or extend) your service virtualization product also manage your server virtualization.  One advantage of this approach is that service virtualization software (HA software) is comparatively mature technology, minimizing the risk.

Unfortunately, this doesn't yet go all the way in solving the problem either.  There are a few things that should change to make this really work well. These include

  • Support much larger HA clusters - hundreds to thousands of nodes.  In an ideal world, you'd really like fewer of these HA/virtualization clusters as you can get.  Today you'd typically have to have one of these clusters for every 8-32 physical servers - which makes an awfully lot of these things to manage in a data center containing hundreds or thousands of servers.

  • Integrate with many virtualization layers - Such a product would need to integrate with Xen, IBM System Z, IBM System P, Linux KVM, VMware, and future virtualization layers like the one promised by Microsoft.   This isn't rocket science, but by the time you're done, it will be some work.

  • Support monitoring and controlling services inside the virtual machine - Otherwise you haven't really integrated the two layers - and you wind up running some HA software inside some of the virtual machines.  Again, this isn't rocket science, but it will require some work[1] for each operating system you want to manage services for.

  • Integrate with provisioning systems - so that you can add and delete virtual machines and allocate disk to them and their applications with fewer possibilities for error, and more automation.

None of these items are technically difficult, and none of them are prohibitively expensive to implement.  Given that I'm the project leader for Linux-HA, and Linux-HA is one of the most capable HA products around, you might imagine that some of these thoughts are on my mind for our future  ;-).  Of course, that doesn't eliminate the necessity for integration with the remaining layers above, which is why Linux-HA implements both CIM and SNMP.  This allows the virtualization management infrastructure to actively and autonomically manage  servers and services, while letting it bubble up events (especially those it can't automatically recover from) to the management consoles and humans via protocols like SNMP and/or CIM.

Conclusions

Virtualization technologies add complexity to the data center along with the benefits they bring, and in the process may render the existing management facilities less than useful.  However, if HA and Virtualization management are performed by a single entity, and open standards like CIM and SNMP are used, systems can be active the problems can be minimized.

See Also

Preparing for Virtual Management http://www.itbusinessedge.com/blogs/dcc/?p=276

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/virtualization-.html
[2] http://linux-ha.org/
[3] http://linux-ha.org/SysAdmin
[4] http://www.dmtf.org/standards/cim/
[4] http://en.wikipedia.org/wiki/Common_Information_Model_%28computing%29
[5] http://en.wikipedia.org/wiki/Simple_Network_Management_Protocol

December 04, 2007

A brief overview of load balancing techniques

Something that people commonly do which involves a form of automation is load balancing.  Load balancing is the idea that incoming network requests are distributed across a set of servers which then each provide the same service.  If you spread the load across "n" servers, then in an ideal world what you get is "n" times the throughput.  And, since you have redundant servers, with the right kind of automation software, you can also get a degree of high-availability.   This is way cool!  This article will talk about load balancing as a general technique, and specifically about ways to do it on Linux using free or open source software.  In particular we'll talk about the Linux Virtual Server project[1], (LVS, ipvs) and the Cluster IP[2] as load balancing techniques.

Meanwhile back in the real world, we see some slight differences from this ideal view of things.  We see that load balancers often introduce single points of failure, and that that the load balancer or some kind of back end servers typically introduce scalability limitations.  To really understand these  problems, we need to look at specific load balancing techniques in a little more detail.  Please understand that I'm not an in-depth expert on any of these techniques, but I do have basic familiarity with the methods described here.

Linux Virtual Server
The first technique we'll cover is the Linux Virtual Server[1] (LVS) - which is implemented by the ipvs kernel module.  Much of what I have to say about LVS also applies to the most load balancers - hardware or software, since they typically work roughly the same way as LVS.

I usually describe  LVS clusters as being similar to a baseball[3] diamond - with the load balancer on third base, web (or other) "real servers" stretched from home plate to second base, and the back end database on first base.  In this image, requests flow from the left to right starting from the users in the dugout to the left of second base foul line,  and responses flow from right to left from the database or file server on first base back to the users in the dugout.  [This imagery works great when talking to Americans or Japanese on the phone, but often fails for people from other cultures].

The first thing to notice is that the only inherently scalable portion of this arrangement is the web servers in the middle.  The load balancer (on third base) and the database server (on first base) are each potentially performance bottlenecks and potentially single points of failure.

If you make each of them redundant to eliminate single points of failure, the picture looks something like this:

Diamondha640
There are a number of variations on this basic theme:

  • Failover vs load sharing load balancers

  • Different applications on the "real servers" instead of WAS / Web servers.

  • Different routing techniques for the load balancer

  • Different data sources instead of a DB2 database

In the end, however, they look a lot the same, and work very similarly.

In a NAT[5] arrangement, both incoming and outgoing packets flow through the LVS director.  In a direct routing arrangement, only incoming packets flow through the LVS director, and outgoing packets bypass the director, and go directly to the clients.

LVS monitoring
Although you could set this all up by hand and start all the services by hand, if anything failed, then you'd have to reconfigure things by hand.  Since the theme of this blog is automation, obviously, the right answer is to automate this setup and reconfiguration on failure.   A common way to do this is to use the Linux-HA software[6], which includes the LVS tool ldirectord[4]. Ldirectord will look at your real servers and see if they and the services they're running are operating correctly.  It will then take corrective action if it sees problems.  The Linux-HA software will watch the directors (sitting on third base), and fail things over and back if problems come up, to eliminate the single point of failure on third base.  As of now, the most common configurations of real servers have them be part of an LVS cluster, but not part of a Linux-HA cluster.  For historical reasons, the load balancers (directors) on third base are in one cluster, and the database server(s) on first base are commonly in separate clusters.  However, with release 2.x versions of Linux-HA it is perfectly sensible to include the both in the same cluster, perhaps in an n+1 sparing arrangement.  If you have fewer than 10-12 real servers, then it might also make sense to let Linux-HA manage those real servers as well.  The reason for the upper limit is to ensure that the total cluster isn't larger than the current Linux-HA limitations on cluster size (approximately 16 nodes).  Another  possible configuration is to use Linux-HA to monitor your real servers.  This would involve writing a clone resource agent for configuring LVS to point at the various real servers.  This might result in a more scalable monitoring arrangement than the current ldirectord monitoring arrangement, since the monitoring is done on each real server, and only errors are reported back to Linux-HA. 

This is a very brief overview of LVS, which perhaps we can expand on in a future posting.  For a thorough treatment of LVS,  I recommend The Linux Enterprise Cluster[7] by Karl Kopper.

Performance characteristics
Clearly every inbound packet has to go through the load balancer (director) - so it has to receive, look at, and forward each inbound packet.  It may also have to rewrite headers and recompute checksums on each packet.  If it configured with NAT, then it also has to read and rewrite all outbound packets as well. In addition, with ldirectord and similar software, the director also has the job of monitoring the all the real server processes on all the real servers.  Eventually, this node (or these nodes) will become a bottleneck.  When this happens depends on the nature of the workload, the complexity of monitoring, and the director configuration chosen.

Cluster IP
Although LVS doesn't require a master's degree to configure, some features of it do have a reasonably steep learning curve.  For a very easy-to-configure, albeit less scalable load distribution method on Linux, you might consider using ClusterIP addresses[2].

What is a Cluster IP?
The unique feature of a cluster IP is that it has no load balancer, hence no single point of failure.  Wow! That seems weird!  What does the picture look like?  If you move the users out of the dugout onto third base, you'll get the basic idea.  But that picture brings lots of questions to mind - like how do packets get routed?

The answer is simple - each machine in the cluster has the same IP address.  Say what?  The same IP address? Yes.  I mean the same IP address.   How can this work?  This sounds like it flies in the face of usual teaching about networking.  Which it does.

Enter the Multicast MAC address
The trick to making this work is to have each machine have an ARP table entry with the same MAC address in it - a multicast MAC address.  So when an ARP request is given, all nodes in the cluster respond, but they all give the same answer"I have IP address XXX with MAC address YYY".  So, in effect, there is no confusion - because it doesn't matter which ARP reply is listened to, they all say the same thing.  Therefore at the IP level everyone is happy.

So far, this is a reasonably satisfying answer, but not quite omplete.  What about addressing at the MAC level, and at the TCP or UDP level?

At the MAC level, multicast MAC addresses are recognized by switches, and is routed to all the switch ports, since everyone has presented that MAC address as "theirs". So, it copies all the packets to all the servers.

What happens at the TCP or UDP level?
This is where things get a little more interesting.  Now, it's more obvious how each machine gets the packets - because every machine gets them.  But, now what?  We clearly don't want every machine to respond to a given TCP packet. That would totally confuse everything, as would giving every packet to all the applications.  To solve this problem, Linux has added a hashing feature which allows the source address, source and destination port number to be used in a hashing function to allow it to decide which machine will respond to any given request.  So, if you have three hash buckets and three servers, the packet header information (source IP and port numbers) can be hashed into three buckets with one bucket assigned to each server.   If the packet hashes to the hash bucket assigned to this server, then it is kept, and passed along to the UDP or TCP layers.  If it doesn't hash to the bucket assigned to this server, then it's just dropped (ignored).

So, this hashing method determines which host serves the requests.  Although the ethernet driver in every machine sees each packet, each packet is only processed by one machine each.  Now you know how it works.

It also turns out to be very easy to configure using Linux-HA, as you can see on our ClusterIP web page[8].  In the process, Linux-HA also handles all the redundancy and failover of cluster IP buckets for you automatically.  Very cool indeed.

If you only configure one bucket per node, then when a node fails, all of its traffic has to get assigned to one machine.  If you start out with 3 nodes in your ClusterIP group, and one node dies, then that means that one node gets all the additional traffic - effectively doubling its workload.  So, a better idea for "n" nodes, to have n*(n-1) cluster IP buckets.  That way when any given machine fails, its workload is split evenly across the remaining nodes.  In Linux-HA terminology, the ClusterIP address is called a clone resource, and what you want is to configure clone_max to n*(n-1)and clone_node_max also to n*(n-1).  Although clone_node_max probably doesn't have to be this large, it would allow a single node to handle all the traffic, if a sufficient number of ClusterIP peers die.

Performance characteristics
Every node in the cluster will see all incoming IP packets.  As I understand it, many/all switches will also send every packet to every switch port in the subnet (or vlan).  This argues for a small subnet for this function.  But, the packets are discarded at a very early stage - minimizing the overhead on the host.  Outbound packets are not affected by this arrangement.  This kind of arrangement works well for these kinds of cases:

  • long processing time per packet (complex J2EE applications, for example)

  • small incoming packets with large outgoing packets

  • smaller number of processing nodes

It probably works less well with the opposite kinds of configurations:

  • high number of incoming packets with trivial processing per pacekt

  • large incoming packets (uploading DVD images, for example)

  • large number of processing nodes

Note that in this case, since there is no head-end processor like an LVS director that can be a single point of failure, so no special provisions are needed for high-availability when used with Linux-HA.  It is typically not as scalable as LVS load balancer, but it is trivial to set up and use.

[1] http://www.linuxvirtualserver.org/
[2] http://flaviostechnotalk.com/wordpress/index.php/2005/06/12/loadbalancer-less-clusters-on-linux/
[3] http://en.wikipedia.org/wiki/Baseball
[4] http://www.vergenet.net/linux/ldirectord/
[5] http://en.wikipedia.org/wiki/Network_address_translation
[6] http://linux-ha.org/
[7] http://www.nostarch.com/frameset.php?startat=cluster
[8] http://www.linux-ha.org/ClusterIP

November 12, 2007

Alan eats his own cl_respawn dog food. Yum!!

In this posting, I show how to use cl_respawn[1] to monitor my system logging and help keep it running, and along the way, I improved cl_respawn a little as well.  In addition, I explain why I couldn't just use the respawn directive in /etc/inittab[5] (and why you probably can't either).   I first talked about cl_respawn in one of my first blog posts[6].

The problem

When we run our automated CTS[2] tests for Linux-HA[3] we rely on the guaranteed log entry delivery provided by syslog-ng[4].  Basically, we redirect all our logs in a test cluster to a test overseer machine, and then CTS watches this consolidated log for errors and correct behavior.

This is a nice system and it works pretty well, but it relies on the reliability of syslog-ng.  For the most part, that's just fine.  But, sometimes syslog-ng just stops running.  Then the tests show that Heartbeat has failed, but it's really just syslog-ng that's crashed on me.  So, in the past I added some code to CTS to make it test the logging after every error, and then hit the machines over the head with a hammer and restart logging if logging wasn't working.

This was sort-of OK, because it meant subsequent tests would run fine, but the one test would show failed - even though it probably succeeded.  This would be fine, except that one of my machines (my oldest and slowest) had syslog-ng die on it a few times a day.  I don't know why, and as long as I can live with it for my testing, I don't much care.  I just want it to work.  (I know, it's a lousy attitude, but I have way more to do than I can possibly do).

The solution

Then it I had this revolutionary thought - I could use HA software to make my logging highly available!!

Hold the presses, folks, new headline reads
   "HA guru realizes he can use HA software just like he tells everyone else to do!"

To fix this problem all I had to do was change the init script for syslog to use our cool little cl_respawn tool to babysit the syslog-ng service.  Although I could have used Heartbeat to monitor this service, it seemed like overkill and would have conflicted with CTS.

So, I set out to use cl_respawn to restart syslog-ng quickly - minimizing but not eliminating the possibliity of losing important log messages.

When I looked at the init scripts (they're from SUSE Linux), they had these statements in them:

  • For starting
    startproc -p ${syslog_pid} ${BINDIR}/${syslog} $params
  • For stopping
    killproc -p ${syslog_pid} -TERM ${BINDIR}/${syslog} ; rc_status -v
  • For status
    checkproc -p ${syslog_pid}      ${BINDIR}/${syslog{; rc_status -v
  • checkproc -p ${syslog_pid}      /usr/bin/cl_respawn; rc_status -v

My first thought was ll I had to do was insert cl_respawn ahead of the ${BINDIR}/syslog and I'd be done.  Well.... not quite...

If I had done that, then the pid file for the service ${syslog_pid} would have pointed not to cl_respawn, but to syslog-ng.  So, when I tried to shut down syslog, cl_respawn would have just respawned it.  OOPS.  Not quite the right effect.

What was necessary was for the syslog pid file to contain the pid of cl_respawn, not the pid of syslog-ng.  One minor problem - the author of cl_respawn didn't deal with pidfiles.  To fix that, I added support for a -p option to tell it the name of the pid file to use.

Now I try it.  Uh-oh... It didn't work.  The logs are quickly filled with attempts to start  ${syslog} and having it fail continually with  socket in use.  What was all that about?

By default, syslog-ng forks itself into the background,  and its parent process exits.  That makes cl_respawn think it's died - so it restarts it - and it fails ad infinitum.  So, I read the man page for syslog-ng and discover the -F option to keep it from forking.  Without that, cl_respawn can't tell when it dies.

Along the way, I read the code, find a couple of other minor bugs and fix them.  I update my init script and now it looks like this:

  • For starting
    startproc -p ${syslog_pid} /usr/bin/cl_respawn -p ${syslog_pid} ${BINDIR}/${syslog} -F $params
  • For stopping
    killproc -p ${syslog_pid} -TERM /usr/bin/cl_respawn ; rc_status -v
  • For status
  • checkproc -p ${syslog_pid}      /usr/bin/cl_respawn; rc_status -v

Of course, if you don't run SUSE Linux, then your init scripts will look somewhat different, but I'm sure you'll figure it out.

Why not just use respawn in inittab?

Those of you who know UNIX administration to any degree realize that /etc/inittab[5] has a respawn directive you can give it.  Why wouldn't that do the trick?  The short answer is service dependencies.   The longer answer is below:

  • Logging depends on other /etc/init.d services, so you don't want it to start until after those other services (like the network) are started.  The LSB init script system supports these dependencies and starts things in the right order.
  • Other services depend on logging.  A number of other services can't start until after logging starts.  If you try and disable the /etc/init.d/syslog service on your machine so you can start it with respawn from /etc/inittab, havoc ensues - because these other services won't start until the /etc/init.d/syslog service is started.  If you disable it, they won't start.
  • What fun would that be?  I mean, if we wrote this cl_respawn tool, we probably ought to use it ;-).

What did I learn?

  • How to use cl_respawn in real life
  • Some missing requirements for cl_respawn
  • I was reminded of the advantages of using our own software
  • How handy simple little tools like cl_respawn can be

[1http://hg.linux-ha.org/dev/file/tip/tools/cl_respawn.c
[2http://linux-ha.org/CTS
[3http://linux-ha.org/
[4http://www.balabit.com/network-security/syslog-ng/opensource-logging-system/"
[5http://www.freebsd.org/cgi/man.cgi?query=inittab&manpath=Red+Hat+Linux%2Fi386+9&format=html
[6http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html

November 04, 2007

Availability, MTBF, MTTR and other bedtime tales

If we let A represent availability, then the simplest formula for availability is:
    A = Uptime/(Uptime + Downtime)

Of course, it's more interesting when you start looking at the things that influence uptime and downtime.  The most common measures that can be used in this way are MTBF and MTTR.

    MTBF is  Mean Time Between Failures
    MTTR is Mean Time To Repair   

 A = MTBF / (MTBF+MTTR)

One interesting observation you can make when reading this formula is that if you could instantly repair everything (MTTR = 0), then it wouldn't matter what the MTBF is - Availability would be 100% (1) all the time.

That's exactly what HA clustering tries to do.  It tries to make the MTTR as close to zero as it can by automatically (autonomically) switching in redundant components for failed components as fast as it can.   Depending on the application architecture and how fast failure can be detected and repaired, a given failure might not be observable by at all by a client of the service.  If it's not observable by the client, then in some sense it didn't happen at all.  This idea of viewing things from the client's perspective is an important one in a practical sense, and I'll talk about that some more later on.

It's important to realize that any given data center, or cluster provides many services, and not all of them are related to each other.  Failure of one component in the system may not cause failure of the system.  Indeed, good HA design eliminates single points of failure by introducing redundancy.  If you're going to try and calculate MTBF in a real-life (meaning complex) environment with redundancy and interrelated services, it's going to be very complicated to do.

    MTBFx is  Mean Time Between Failures for entity x
    MTTRx is Mean Time To Repair for entity x
    Ax is the Availability of entity x   

Ax = MTBFx / (MTBFx+MTTRx)

In practice, these measures (MTBFx and MTTRx) are hard to come by for nontrivial real systems - in fact, they're so tied in to application reliability and architecture, hardware architecture, deployment strategy, operational skill and training, and a whole host of other factors, that you can actually compute them only very very rarely.  So, why did I spend your time talking about it?  That's simple - although you probably won't compute them, you can learn some important things from these formulas, and you can see how mistakes you make in viewing these formulas might lead you to some wrong conclusions.

Let's get right into one example of a wrong conclusion you might draw from incorrectly applying these formulas.

Let's say we have a service which runs on a single machine, which you put onto a cluster composed of two computers with a certain individual MTBF (Mi) and you can fail over to the other computer ("repair") a computer in a certain repair time (Ri).  With two computers, they'll fail twice as often as a single computer, so the system MTBF becomes Mi/2.  If you compute the availability of the cluster, it then becomes:

    A = Mi/2 / (Mi/2+Ri)

Using this (incorrect) analysis for a 1000 node cluster performing the same service, the system MTBF becomes Mi/1000.

    A = Mi/1000 / (Mi/1000+Ri)

If you take the number of nodes in the cluster to the limit (approaching infinity), the Availability approaches zero.

    A = 0/(0+Ri) = 0/Ri = 0

This makes it appear that adding cluster nodes decreases availability.  Is this really true?  Of course not!  The mistake here is thinking that the service needed all those  cluster nodes to make it go.  If your service was a complicated interlocking scientific computation that would stop if any cluster node failed, then this model might be correct.  But if the other nodes were providing redundancy or unrelated services, then they would have no effect on MTBF of the service in question.  Of course, as they break, you'd have to repair them, which would mean replacing systems more and more often, which would be both annoying and expensive, but it wouldn't cause the service availability to go down.

To properly apply these formulas, even intuitively, you need to make sure you understand what your service is, how you define a failure, how the service components relate to each other, and what happens when one of them fails.  Here are a few rules of thumb for thinking about availability

  • Complexity is the enemy of reliability (MTTR).  This can take many forms
    • Complex software fails more often than simple software
    • Complex hardware fails more often than simple hardware
    • Software dependencies usually mean that if any component fails, the whole service fails
    • Configuration complexity lowers the chances of the configuration being correct
    • Complexity drastically increases the possibility of human error
      • What is complex software? - Software whose model of the universe doesn't match that of the staff who manage it.
  • Redundancy is the friend of availability - it allows for quick autonomic recovery - significantly improving MTTR.  Replication is another word for redundancy.
  • Good failure detection is vital - HA and other autonomic software can only recover from failures it detects.  Undetected failures have human-speed MTTR or worse, not autonomic-speed MTTR.  They can be worse than human-speed MTTR because the humans are surprised that it wasn't automatically recovered and they respond more slowly than normal.  In addition, the added complexity of correcting an autonomic service and trying to keep their fingers out of the gears may slow down their thought processes.
  • Non-essential components don't count - failure of inactive or non-essential components doesn't affect service availability.  These inactive components can be hardware (spare machines), or software (like administrative interfaces), or hardware only being used to run non-essential software.  More generally, for the purpose of calculating the availability of service X, non-essential components include anything not running service X or services essential to X.

The real world is much more complex than any simple rules of thumb like these, but these are certainly worth taking into account.

October 10, 2007

Split-brain, Quorum, and Fencing - updated

In some ways, an HA system is pretty simple - it starts services, it stops them, and it sees if they and the computers that run them are still running.  But, there are a few bits of important "rocket science" hiding in there among all these apparently simple tasks.  Much of the rocket science that's there centers around trying to solve a single thorny problem - split brain.  The methods that are used to solve this problem are quorum and fencing.  Unfortunately, if you manage an HA system you need to understand these issues.  So this post will concentrate on these three topics: split-brain, quorum, and fencing.

If you have three computers and some way for them to communicate with each other, you can make a cluster out of them and,each can monitor the others to see if their peer has crashed.  Unfortunately, there's a problem here - you can't distinguish a crash of a peer from broken communications with the peer.  All you really know is that you can't hear anything from them.  You're really stuck in a Dunn's law[1] situation - where you really don't know very much, but desperately need to.  Maybe you don't feel too desperate yet.  Perhaps you think that you don't need to be able to distinguish these two cases.  The truth is that sometimes you don't need to, but much of the time you very much need to be able to tell the difference.  Let's see if I can make this clearer with an illustration.

Let's say you have three computers, paul, silas, and mark, and paul and silas can't hear anything from mark and vice versa.  Let's further suppose that mark had a filesystem /importantstuff from a SAN volume mounted on it when we lost contact with it. and that mark is alive but out of contact.  What happens if we just go ahead and mount /importantstuff up on paul? The short answer is that bad things will happen[2]. /importantstuff will be irreparably corrupted as two different computers update the disk independently.  The next question you'll ask yourself is "Where are those backup tapes?". That's the kind of question that's been known to be career-ending.

Split-Brain

This problem of a subset of computers in a cluster beginning to operate autonomously from each other is called Split Brain[3]. In our example above, the cluster has split into two subclusters: {paul, silas} and {mark}, and each subset is unaware of the others.  This is the perhaps most difficult problem to deal with in high-availability clustering.  Although this situation does not occur frequently in practice, it does occur more often than one would guess.  As a result, it's vital that a clustering system have a way to safely deal with this situation.

Earlier I mentioned that there was information you really want to know, but don't know.  Exactly what information did I mean?   What I wanted to know was "is it safe to mount up /importantstuff somewhere else?".  In turn, you could figure that out if you knew the answer to one of these two questions:  "Is mark really dead?" which is one way of figuring out "Is mark going to write on the volume any more?"  But, of course, since we can't communicate with mark, this is pretty hard to figure out.  So, cluster developers came out with a kind of clever way of ensuring that this question can be answered.  We call that answer fencing.

Fencing

Fencing is the idea of putting a fence around a subcluster so that it can't access cluster resources, like  /importantstuff.  If you put a fence between it and its resources, then suddenly you know the answer to the question "Is mark going to write on the volume any more?" - and the answer is no - because that's what the fence is designed to prevent.  So, instead of passively wondering what the answer to the safeness question is, fencing takes action to ensure the "right" answer to the question.

This sort of abstract idea of fencing is fine enough, but how is this fencing stuff actually done? There are basically two general techniques:  resource fencing [4] and node fencing.[5]. 

  • Resource fencing is the idea that if you know what resources a node might be using, then you can use some method of keeping it from accessing those resources. For example, if one has a disk which is accessed by a fiber channel switch, then one can talk to the fiber channel switch and tell it to deny the errant node access to the SAN.

  • Node fencing is the idea that one can keep a node from accessing all resources - without knowing what kind of resources it might be accessing, or how one might deny access to them.  A common way of doing this is to power off or reset the errant node.  This is a very effective if somewhat inelegant method of keeping it from accessing anything at all.  This technique is also called STONITH[6] - which is a  graphic and colorful acronym standing for Shoot The Other Node In The Head.

With fencing, we can easily keep errant nodes from accessing resources, and we can now keep the world safe for democracy - or at least keep our little corner of it safe for clustering.  An important aspect of good fencing techniques is that they're performed without the cooperation of the node being fenced off, and that they give positive confirmation that the fencing was done.  Since errant nodes are suspect, it's by far better to rely on positive confirmation from a correctly operating fencing component than to rely on errant cluster nodes you can't communicate with to police themselves.

Although fencing is sufficient to ensure safe resource access, it is not typically considered to be sufficient for happy cluster operation because without some other mechanism, there are some behaviors it can get into which can be significantly annoying (even if your data really is safe).  To discuss this, let's return our sample cluster.

Earlier we talked about how paul or silas could use fencing to keep the errant node mark from accessing /importantstuff.  But, what about mark?  If mark is still alive, then it is going to regard paul and silas as errant, not itself.  So, it would also proceed to fence paul and silas - and progress in the cluster would stop.  If it is using STONITH, then one could get into a sort of infinite reboot loop, with nodes declaring each other as errant and rebooting each other, coming back up and doing it all over again.  Although this is kind of humorous the first time you see this in a test environment - in production with important services, the humor of the situation probably wouldn't be your first thought.  To solve this problem, we introduce another new mechanism - quorum.

Quorum

One way to solve the mutual fencing dilemma described above is to somehow select only one of these two subclusters to carry on and fence the subclusters it can't communicate with.  Of course, you have to solve it without communicating with the other subclusters - since that's the problem - you can't communicate with them.  The idea of quorum represents the process of selecting a unique (or distinguished for the mathematically inclined) subcluster.

The most classic solution to selecting a single subcluster is a majority vote.  If you choose a subcluster with more than half of the members in it, then (barring bugs) you know there can't be any other subclusters like this one. So, this is looks like a simple and elegant solution to the problem. For many cases, that's true.  But, what if your cluster only has two nodes in it?  Now,  if you have a single node fail, then you can't do anything - no one has quorum.  If this is the case, then two machines have no advantage over a single machine - it's not much of an HA cluster.  Since 2-node HA clusters are by far the most common size of HA cluster, it's kind of an important case to handle well.  So, how are we going to get out of this problem?

Quorum Variants and Improvements

What you need in this case, is some kind of a 3rd party arbitrator to help select who can fence off the other nodes and allow you to bring up resources - safely.  To solve this problem there is a variety of other methods available to act as this arbitrator - either software or hardware. Although there are several methods available to use as arbitrator, we'll only talk about one each of hardware and software methods: SCSI reserve and Quorum Daemon.

  • SCSI reserve:  In hardware, we fall back on our friend SCSI reserve.  In this usage, both nodes try and reserve a disk partition available to both of them, and the SCSI reserve mechanism ensures that only one of the two of them can succeed.  Although I won't go into all the gory details here, SCSI reserve creates its own set of problems including it won't work reliably over geographic distances.  A disk which one uses in this way with SCSI reserve to determine quorum is sometimes called a quorum disk.  Some HA implementations (notably Microsoft's) require a quorum disk.

  • Quorum Daemon:  In Linux-HA[7], we have implemented a quorum daemon - whose sole purpose in life is to arbitrate quorum disputes between cluster members.  One could argue that for the purposes of quorum this is basically SCSI reserve implemented in software - and such an analogy is a reasonable one.  However, since it is designed for only this purpose, it has a number of significant advantages over SCSI reserve - one of which is that it can conveniently and reliably operate over geographic distances, making it ideal for disaster recovery (DR) type situations.  I'll cover the quorum daemon and why it's a good thing in more detail in a later posting.  Both HP and Sun have similar implementations, although I have security concerns about them, particularly over long distances.  Other than the security concerns (which might or might not concern you), both HP's and Sun's implementations are also good ideas.

Arguably the best way to use these alternative techniques is not directly as a quorum method, but rather as a way of breaking ties when the number of nodes in a subcluster is exactly half the number of nodes in the cluster.  Otherwise, these mechanisms can become single points of failure - that is, if they fail the cluster cannot recover.

Alternatives to Fencing

There are times when it is impossible to use normal 3rd-party fencing techniques.  For example, in a split-site configuration (a cluster which is split across geographically distributed sites), when inter-site communication fails, then attempts to fence will also fail.  In these cases, there are a few self-fencing alternatives which one can use when the more normal third-party fencing methods aren't available.  These include:

  • Node suicide.  If a node is running resources and it loses quorum, then it can power itself off or reboot itself (sort of a self-STONITH).  The remaining nodes wait "long enough" for the other node to notice and kill itself.  The problem is that a node which is sick might not succeed in self-suicide, or might not notice that it had a membership change, or had lost quorum.   It is equally bad if notification of these events is simply delayed "too long".  Since there is a belief that the node in question is, or at least might be, malfunctioning, this is not a trivial question.  In this case, use of hardware or software watchdog timers becomes critical.

  • Self-shutdown.  This self-fencing method is a variant on suicide, except that resources are stopped gracefully.  It has many of the same problems, except it is somewhat less reliable because the time to shut down resources can be quite long.  Like the case above, use of hardware or software watchdog timers becomes critical.

Note that without fencing, the membership and quorum algorithms are extremely critical.  You've basically lost a layer of protection, and you've switched from relying on a component which gives positive confirmation to relying on a probably faulty component to fence itself, and then hoping without confirmation that you've waited long enough before continuing.

Summary

Split-brain is the idea that a cluster can have communication failures, which can cause it to split into subclusters.  Fencing is the way of ensuring that one can safely proceed in these cases, and quorum is the idea of determining which subcluster can fence the others and proceed to recover the cluster services.

An Important Final Note

It is fencing which best guarantees the safety of your resources.  Nothing else works quite as well.  If you have fencing in your cluster software, and you have irreparable resources (i.e. that would be irreparably damaged in a split-brain situation), then you must configure fencing.  If your HA software doesn't support (3rd party) fencing, then I suggest that you consider getting a different HA package.

See Also

General cluster concepts[8]

References

[1]   http://linux-ha.org/DunnsLaw
[2]   http://linux-ha.org/BadThingsWillHappen
[3]   http://linux-ha.org/SplitBrain
[4]   http://linux-ha.org/ResourceFencing
[5]   http://linux-ha.org/NodeFencing
[6]   http://linux-ha.org/STONITH
[7]   http://linux-ha.org/
[8]   http://linux-ha.org/ClusterConcepts

October 05, 2007

How to use a watchdog timer

In an earlier posting[1], I mentioned that explaining how to optimally use a watchdog driver would be a good thing to talk about later. Now seems a good time to talk about that, giving a brief overview of some good techniques for getting the most out out of your watchdog timer.

As was mentioned previously, one can have a software watchdog timer like softdog[2], or a hardware timer, or a watchdog utility like apphbd[3] (application heartbeat daemon). Although each method has its advantages and disadvantages, the methods that an application can use to take best advantage of them are very similar.

The basic idea of using a watchdog is simple:  Periodically send a heartbeat to the watchdog timer.  If the application fails to heartbeat in the specified interval, or exits prematurely, then a recovery action is taken.  So, all your application has to do  is set a timer and tickle the watchdog timer when your timer goes off.  Sounds extremely simple - and for the most part it is.

How to get into trouble with watchdog timers

If your application does disk I/O or grows in size as it runs, or calls functions or systems calls that might block, then the timing of your application can change dramatically when the system is under heavy I/O load or memory pressure.  This can mean that your application is judged to be hung when it's not.  When this happens, the watchdog timer you're using will trigger a recovery action - maybe restarting your application, or rebooting the machine.  For this kind of a situation, this is probably not what you had in mind.  Of course, if you're in an HA environment like Linux-HA[4] where a machine reboot will cause a service failover, this may exactly what's needed to straighten out your problem.  As always, YMMV[5].

When not to use watchdog timers

Watchdog timers need reasonably reliable real-time performance, and an application which you can modify and which runs periodically (or which you're willing to make run periodically).  If you can't modify the application, or it is expected to have extremely erratic real-time performance, it doesn't run continuously, or runs in an environment which services your application erratically, then watchdog timers may not be for you.

Making your watchdog timer do more

Having your application not appear to be dead is sort-of-OK, but not exactly a deep metric for how well your application is behaving. Since this code will run inside your application, and you have to modify your application anyway, you have the opportunity to make this a form of white-box testing [6]. To do this, tie tickling your watchdog timer to good measures of your program's sanity. Below are a few examples of how you might go about doing this.  Not every example is appropriate for every type of application, so take this as food for thought.

  • Audit your data structures. If you have several data structures representing work to do, clients, outputs, etc., you can audit your data structures for mutual consistency, and only send out heartbeats when you don't find any errors.  For example, perhaps every piece of work to be done ought to belong to an active client. This is an interesting technique - because it can allow for transient "false positives" or errors that get corrected by the natural flow of the application.  If you audit your data structures once every 10 seconds, but set your heartbeat rate to 30 seconds, then you can have a transient error last up to two iterations without causing a restart.  If it persists beyond that, your watchdog timer will take action.

  • Check for work being processed.  If you have work in your input queues, but no work has been completed since the last heartbeat, then suppress sending out a heartbeat until some work actually gets processed.

  • Check for old work to be done.  If you have work queues, you can skip heartbeats whenever you have work in your queues which is "too old".  This may be a symptom that your application isn't processing its work, or that it has somehow lost track of this particular piece of work.

Of course, these are just a few simple ideas, but they may spark some better ideas for your particular application.  Anything you can examine periodically to see if your application seems to be doing whatever it is it's supposed to do is a potential candidate for using to control when and whether to tickle your watchdog timer.

Making your watchdog timer more reliable

Something you should really avoid is false positives on watchdog timeouts - especially when the consequence is to reboot the machine. Spurious reboots are typically frowned upon ;-).  This is all a fine idea, but how exactly do you go about doing that?  Here are a few tips to keep in mind.

  1. Tune your timeout interval. Some watchdog timers (like apphbd) have a warning level as well as a fatal timing level. Be sure and take advantage of the warning level to help you tune your application's heartbeat interval.  For example, if your application really ought to send out a heartbeat every second, you can set the warning time threshold to 1.5 seconds, and the fatal watchdog timer to a much larger value, say 5 or 10 seconds. That way, as you test, you can see how close you come to the 3 second time limit in your most extreme cases of load.

  2. Know your application's expected behavior.  If is single-threaded and does short-lived tasks throughout the day, except for once a day when it produces a summary report, then keep that extra work and the delay it can cause in mind when setting your timer value.

  3. Make your application a better real-time citizen. Rather than processing all its work in one huge uninterruptible chunk, make sure you process chunks whose size is bounded by some constant amount of time, and allow interruptions (for processing watchdog timer requests) between these chunks.

  4. Consider using the glib mainloop task dispatcher[7]. The mainloop construct is great for event driven programs which don't actually require threading. We use it extensively in Heartbeat [4], and it has worked out very well for us.

  5. Consider using threads. Many people swear by threads as the only way to write any reasonable program. Many people (including many of those same people) swear at them[8], When properly used, threads can be helpful in writing programs with better real-time behavior.

  6. Consider doing your disk I/O into a separate process (or thread). It's usually disk I/O or memory allocation (implying disk I/O) which is most likely to hang your program and give it unpredictable realtime behavior.

  7. Consider using asynchronous I/O. This is another technique for avoiding blocking by disk I/O. It's not terribly portable, and the API seems to me to be a little subject to change, but it's a really nice idea.

  8. Consider locking your program into memory. If your program needs about the same amount of real memory as it occupies virtual memory, then the system impact of locking your program into memory may not be high. This is not for everyone - because if everyone did this, then why bother implementing virtual memory?

  9. Consider setting your program with a soft-realtime (POSIX realtime) priority. Even more than the previous step, this step is not for everyone. If your program goes into an infinite loop, the entire system stops - end of story. Not good. But, if your program is critical, small, and well-behaved, this can be a reasonable thing to do.

  10. Consider running real time Linux [9]. This is an even more drastic step, but if your whole system needs to be real time, some great work has been done here which you might look into.

  11. If you're writing in Java, consider real-time Java from IBM[10]. A better recommendation might be to not use Java, but for some people Java is their religion, but if you or your management insists on Java, then this may be just the ticket for you.

In summary, here are general ideas to keep in mind:

  • Tie tickling the watchdog timer with the sanity of your application

  • Do what is reasonable to improve the predictability of the realtime behavior of your program

Hopefully the tips above will help you do this.

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[2] http://linux-ha.org/softdog
[3] http://linux.die.net/man/8/apphbd
[4] http://linux-ha.org/
[5] http://en.wiktionary.org/wiki/your_mileage_may_vary
[6] http://en.wikipedia.org/wiki/White_box_testing
[7] http://library.gnome.org/devel/glib/unstable/glib-The-Main-Event-Loop.html
[8] http://sourcefrog.net/weblog/software/languages/java/java-threads.html
[9] http://www-03.ibm.com/press/us/en/pressrelease/21232.wss
[10] http://domino.research.ibm.com/comm/research_projects.nsf/pages/metronome.index.html

September 21, 2007

Tools for monitoring services

For the purposes of this posting, I'm concentrating more on things that can tell you if a particular service is working or not working and recover a failed service, and less on a datacenter-wide view of service health.

As I discussed in my previous posting on this topic[1], there are three basic ways to monitor a service:

  1. See if the process providing the service is alive
  2. Use the service API to determine if the service is well (simplified version: check to see if someone is listening on the port)
  3. Instrument the service to periodically heartbeat (check in with) a watchdog program or device

There are a lot of programs to monitor services using these three methods.  Many of the open source programs to provide these kinds of services are listed on the Linux-HA service monitoring page[2].  Rather than reproduce that page here, I'll discuss a few of the better-known programs found there. 

  • Monit[3] is very highly thought of by many sysadmins, and is often used for service monitoring and restart.  The information in this post about monit[3] was provided by the most excellent Christian Wilken.

    Monit is a monitoring tool which can take the necessary action to ensure service availability. It can monitor services locally or remotely by polling a specific port (type 2 monitoring). It can monitor binary files. I.e. you might want monit to monitor your apache binary file or any other binary on the system. It checks the md5sum and the octal permissions of the file and warns the administrator if something has changed (and unmonitor the service). It can also check the uid and gid of the file.

    Through the integrated web interface you can see uptime, cpu load and memory consumption for each service monitored. you can decide what to do when a service fails.

    Here's an example: you want to monitor apache (port 80) and if it fails you want to restart the service. If it fails to restart apache (after let's say 3 tries) you can set it up to raise an alarm (and send a mail to the admins). you can also setup the apache service monitoring to depend on apache_bin (the binary apache file). That means if the apache binary has changed the apache service is unmonitored until you take action.

    Monit also monitors its own binaries so if you (or someone else for that matter) updates the monit code you will also get an alarm and you have to take action. However, it does not monitor its own operation, or register itself with some other monitoring service like /dev/watchdog or apphbd.   Monit is a very flexible monitoring/automation tool and its config file is simple and easy to learn.  Monit is a simple yet effective tool.
  • Mon[4] is another well-thought-of monitoring tool.  It is a straightforward Perl program which has goals very similar to Monit.  Mon monitors services, and then takes actions when services are found to have failed.  For example, it will notify an administrator, or restart a service.  It comes with scripts which are capable of monitoring a large number of services.  It has been around for many years, and has many happy users.  It is simple to configure, normally monitors remotely, supports dependencies between monitors, typically is used as a type 2 monitoring service.  Like most similar services, nothing normally monitors mon's operation to see if it is still running, or operating correctly.
  • Nagios[7] is a very flexible system management package which provides for a lot of capabilities for monitoring servers and services  and taking a variety of actions when failures occur.  It is designed to watch and manage your entire data center.  For the most part, it uses SNMP to do that - however it can use other methods for monitoring as well.  One of the more interesting features of Nagios is that it supports service dependencies.  Normally, each service is defined by hand, and it can take a wide variety of actions when things fail, including escalation actions.  However, it does not natively provide process restarting.   Recently Gerard Petersen has documented[8] a way of getting nagios to do this with a consistent set of rules.  Understanding his method of doing this generically requires an extensive knowledge of Nagios and Gerard describes it as "somewhat cumbersome".  Nevertheless, it does add this capability to Nagios in a generic way.  It is a "type 1" monitoring system (done remotely with ssh) - it only checks if the process is alive.  You can tie this capability together with real service functionality checks (type 2) using dependencies.  However, if the service can't be stopped gracefully, and it has to do a kill -9 to stop it, then the service may not work properly afterwards.  This is a pretty unsurprising caveat.  However, it will attempt to kill the process and recover without manual intervention.  Nagios works remotely and performs both type 1 and type 2 checks (remotely), and implements service dependencies.

    It is worth noting that this Nagios technique will not work for servers which are part of an HA cluster, because there is no fixed association between a server name and which services are running on it.  You could use it to monitor the HA server processes if you like, but usually the HA software will do that itself.  Note that when you combine management services, you need to make sure you understand which management service is going to control what, know how each affects the other and  keep them strictly out of each other's way.

    Nagios does much more than just monitor services and restart daemons, however.  It keeps statistics, and provides historical and near-real-time performance and load graphs, provides an extensive set of alerts, and performs a variety of other system management functions. I don't know if/how the central Nagios processes are monitored.

  • apphbd[9] is part of the Linux-HA[10] suite of programs.  It performs type 3 monitoring of properly instrumented services.  That is, you can write a program which connects to apphbd, and sends it heartbeats periodically.  When your application doesn't send a heartbeat message by the expected time, or exits without informing apphbd of its intention to exit, then apphbd will notify another application of this condition, which will restart your application.  Apphbd normally registers with /dev/watchdog to send heartbeats to the kernel watchdog device.  As a precaution against false reboots, apphbd locks itself into memory and runs with soft realtime priority.

  • Heartbeat[10] from the Linux-HA project is typically thought of as a clustering package.  However, it is perfectly happy to manage a single system - which it thinks of as a cluster with one node.  It is capable of doing both type 1 and type 2 checks, implements service dependencies, and will automatically restart services if they die, including services which depend on failed service.   The configuration is in XML, and isn't difficult but can be tedious.

    For the case of monitoring a single machine, I wrote a prototype script[11] which looks at your init.d directories for all set of active services, and creates a cib.xml configuration file for the set of services on your machine.  Heartbeat will then poll to monitor those services and ensure that they are running, and restart them (and any services which depend on them) automatically.  Because the configuration this script creates relies solely on the init scripts, the monitoring is done by using the status actions of the LSB.  As a result, for the most part these are type 1 monitoring.  To do type 2 service monitoring, it is necessary to create an R2-style Heartbeat configuration and make sure the services to be monitored have a repeating monitor action declared for them.  Of course, the script mentioned before makes sure all this happens.

    If you wish to do more thorough monitoring, you can convert these scripts to OCF resource agents, modify the configuration and perform any kind of monitoring you wish.  Heartbeat itself can be monitored by SNMP or CIM - which lets you easily include Heartbeat as being monitored over the data center via Nagios, OpenNMS or other package.  In addition, if you wish to protect yourself against OS hangs and crashes and hardware problems, you can easily add servers to your one-node cluster, make some small rule changes, and now your service will continue running even if a server or OS dies.  The process of changing from a 1-node to an n-node cluster isn't at all complicated.  If one has things configured properly, and one uses the autojoin[12] feature.

    It is worth noting that the action which Heartbeat takes when a service can't be stopped is quite different from Nagios'.  In Heartbeat, when a resource (service) can't be stopped, Heartbeat would normally reboot the machine.  Although this sounds drastic, it can get most services running even when the problem is in the kernel.  Since normally Heartbeat is managing a cluster, the service would normally be taken over immediately by another node in the cluster.

    With regard to monitoring Heartbeat itself: it monitors its own operation, and it can be configured to heartbeat either with a watchdog device like softdog, or with apphbd (discussed above).  When you put it into a cluster, then the nodes of the cluster monitor each other, and use STONITH to reboot machines which have stopped working.  If you use Heartbeat then, you get as complete a set of monitoring as is available.
  • cl_respawn is a simple tool which is packaged as part of the Heartbeat[10] package.  cl_respawn takes arguments which tell it how often to heartbeat with apphbd, and the value of a "magic" exit code.  This exit code is one that when the application exits with this return code that cl_respawn does not restart the its child process.  Although it only does type 1 monitoring, it doesn't poll to see if a process has died, ithe OS gives it a signal and it responds immediately.  When you invoke a process with cl_respawn, it is necessary to make sure it does not fork off and run in the background.
  • Supervise[13] from D. J. Bernstein's daemontools[14] replaces the usual init.d with a service directory, so that things are installed, stopped and started and otherwise managed in a non-standard way.    However, as supervise is part of this whole process, service monitoring and restarting comes along for free.  When a daemon is started it becomes a child of supervise.  When a daemon dies, supervise is notified and it is automatically respawned immediately without waiting for a poll interval.  Supervise itself does not appear to be monitored for failures by svscan, which appears to be unmonitored.  Daemontools does not appear to support service dependencies.  Like cl_respawn, when you invoke a process with daemontools it is necessary to make sure it does not fork off and run in the background.
  • Watchdog drivers can either be implemented in software like softdog[15]  or they can be implemented in hardware.  All the watchdog drivers that ship with Linux share the same API - that implemented by softdog[15].  As a result, a program which works with softdog can work quite nicely with the hardware-based watchdog drivers.  It's worth noting that kernel hangs which completely disable kernel timer services can cause softdog to malfunction. The good news that hangs like this are extremely rare.   If one has a hardware based watchdog device, it takes a hardware failure for these to fail to reset a hung machine.  This too, is extremely rare.  If one is a true belt-and-suspenders sysadmin (with a enough time and budget), one could use both a software and hardware based watchdog
  • ldirectord[16] is a specialized monitoring system which is delivered as a subpackage of the Linux-HA suite.  It monitors servers and services in a load balancing cluster and takes dead servers out of the load balancing rotation, and restarts services when they stop responding.  It is a type 2 service, and because it's an HA resource, is normally monitored by Heartbeat itself.

See Also
Howtoforge articles on monitoring[17].

Related systems - not quite on topic

Ganglia[5] is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state.

Although it doesn't attempt to repair problems, it is widely used in high-performance clusters and might be of interest to people reading this post.

Opennms[6] is a Java-based enterprise grade open source network monitoring platform. It consists of a community supported open-source project as well as a commercial services, training and support organization.

The goal is for OpenNMS to be a truly distributed, scalable platform for all aspects of the FCAPS network management model, and to make this platform available to both open source and commercial applications.

OpenNMS is oriented towards responding to SNMP data and traps, but can be modified to monitor other things.  It does not appear to me to be well-suited to recovering from a service or process hanging, nor for monitoring a single system.  Where it shines is in managing an entire data center with a large number of SNMP-enabled servers and network devices.

References

[1]  http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[2]  http://linux-ha.org/RelatedTechnologies/MonitoringSoftware
[3]  http://www.tildeslash.com/monit/
[4]  http://mon.wiki.kernel.org/index.php/Main_Page
[5]  http://ganglia.info/
[6]  http://www.opennms.org/
[7]  http://www.nagios.org/
[8]  http://www.gp-net.nl/2007/05/12/nagios-remote-eventhandling-for-init-scripts/en/
[9]  http://linux.die.net/man/8/apphbd
[10] http://linux-ha.org/
[11] http://hg.linux-ha.org/dev/raw-file/8c5da9553636/tools/1node2heartbeat
[12] http://linux-ha.org/ha.cf/AutojoinDirective
[13] http://cr.yp.to/daemontools/supervise.html
[14] http://cr.yp.to/daemontools.html
[15] http://en.wikibooks.org/wiki/Linux_Kernel_Drivers_Annotated/Character_Drivers/Softdog_Driver
[16] http://linux.die.net/man/8/ldirectord
[17] http://www.howtoforge.com/taxonomy_menu/1/59

September 15, 2007

Service Monitoring - basics of a key part of automated management

Software failures are pretty common.  If you've used software, you've seen software fail.  Typically they're among the most common types of failures, with only power failures being more common.  So, if you want your service to work reliably, something is going to have to monitor the software that implements it to see if it's working.  Software outages are estimated at around 25% of the total of unplanned outages[1].  Like most general estimates  measured against other people's computers - YMMV - Your Mileage May Vary[2] .

A sufficient condition for program triviality is that it have no bugs

So, it seems pretty obvious that this is a pretty big chunk of potential outages to watch over and try and eliminate or recover from if you can.  Of course, before you can recover from a problem you have to know there's a problem.  That's where monitoring comes in.  Monitoring software is software that monitors (or watches) other software.

Local or Network Monitoring

There are lots of packages for monitoring software available in the wild.  I tend to favor monitoring software that runs on the machine being monitored - it works even when the network is down, it minimizes the network traffic, and security concerns are greatly reduced.  But, it can be relatively painful to administer since you have to install and configure software on each machine, and you have to separately monitor the servers.  If you have thousands of servers you're monitoring in this way, it's potentially very painful (at least without the right tools for managing it centrally).  Since I like clusters, I'll mention that if you're running a cluster, this monitoring can be administered centrally on the cluster, making that much simpler than it would otherwise be - in fact it may come "for free" with when you configure the services in the cluster.

How to Monitor Services?

You have software on your server which implements services.  So, how are you going to monitor it?   There are basically two different ways to monitor software externally - the easy way, and the not-so-easy way.

  1. A really simple way to monitor a service is to just look to see if it's still running.  If you're on a UNIX-like machine, you can do a ps to see if it's still running, and if it is, say "Good service!", and if it's not, say "Bad service - Go to your room!".  If the majority of your software failures result in the process exiting (leaving a core or not), then this will work just fine for you.  This is how most init script status actions work.  However, if the software hangs, or gives crazy or otherwise incorrect results, this method won't detect that kind of misbehavior for you - you'll need a less-simple method.  It's also worth noting that, simple as this is, you have to be on the machine that's running the service.  You can't directly see the process table of a remote machine.  This technique is the basis for how the UNIX respawn init directive works - when a service exits it gets automatically restarted.  Simple, but usually too simple - which is why it only sees limited use in practice.  You could argue that good HA systems spend half of their time implementing the respawn directive right ;-).
  2. A not-so-simple-way to monitor a service is to use the service and see if the result you got was reasonable, and completed in a reasonable time.  For example, if you are monitoring a web server, you might do an HTTP GET operation on the port the server is on, and see if you get reasonable-looking HTML and a non-error return code in a "reasonable" time.  For example, the Linux-HA[3] apache[4] resource agent does exactly this using the wget[5] command.  If you have a database, you might do a short query whose answer is easily sanity-checked.  This is what the Linux-HA[3] DB2[6] resource agent does (as do several other database resource agents).  Another advantage of this technique is that you can often perform it remotely - which works out well if you want to monitor everything centrally.

Is there some kind of standard for monitoring UNIX-like services?

In fact, there are a couple of them and Linux-HA[3] implements them both.  These two standards are the LSB (Linux Standard Base[7]) and the OCF (Open Cluster Framework[8]) standards.

  • The Linux Standard Base defines a standard for init scripts[9] which tells how to start, stop, and determine the status of a service.  It's the status part that we care about here - because this implements a poor-man's simple monitoring operation.
  • The Open Cluster Framework[8] defines a standard for resource agents[10] (RAs) which tells how to start, stop, and (lucky for us) monitor a service.  In fact, the monitor action is interesting, because it defines multiple levels of monitoring.  Maybe you want to do something lightweight frequently, and something heavier-weight less often.  If so, then the OCF RA standard may be for you.  For the interested, it's simple to convert an init script into an OCF resource agent - since the OCF RA specification was built on top of the LSB init specification.

Both of these standards only work for monitoring services locally.  That is, you have to be on the server to use either.  So, you can't use them if you want to monitor services remotely.

Quis custodiet ipsos custodes?  - Who will watch the watchers?

If your monitoring software fails, how would you know?  Since it's software, and like all software, might fail, who's monitoring it to make sure it works?  As you can see, this is a potentially endless problem.  In my experience, no one has any endless solutions - only endless problems.  But, there is a reasonable way to approach it - create a hierarchy of watchers.

Applications (which can also watch themselves) are watched by a watcher program, which in turn registers with a watchdog driver.  In Linux, that can be either the softdog[11] watchdog driver or a hardware watchdog driver.  To use a watchdog driver, the watching program has to check in periodically and tickle the watchdog driver periodically.  If it doesn't, then the system reboots.  Good reason to make the top level watcher as simple as possible.  How to tickle a watchdog optimally is good fodder for a future post.

Unfortunately, the standard Linux softdog driver only allows one program at a time to use it.  Unless you want to restrict yourself to only one watcher on a system, you have to have some kind of watchdog program that the watchers register with.  The Linux-HA project provides the apphbd[12] (application heartbeat daemon) designed expressly for this purpose.  So, you register with apphbd, and apphbd registers with the watchdog driver - which watches it.  Apphbd can watch as many programs as you want - but they have to register with it, and they have to tickle it (send it heartbeats) periodically.  That means your application has to be modified to do it - which is different from the two earlier approaches.  Just like the watchdog driver, you have to tickle apphbd periodically, or it notifies someone who will take a recovery action.  Fortunately, however, apphbd doesn't reboot the machine - normally the application just gets restarted if it dies or stops sending heartbeats.  Avoiding unnecessary reboots is generally thought to be a good thing  ;-) - so most people like this a lot.

In the process of talking about how to watch watchers, we've added a third method of monitoring applications.  The first two could be done with any application, and the third only by applications you've modified.  To summarize, they are:

  1. Checking to see if the process is running (local only)
  2. Exercising an application API to check for application sanity (local or remote)
  3. "Tickling" a watchdog program or driver (typically local) - requires modifying the application.

But what if the kernel fails, or the driver fails, or the watchdog hardware fails, what do you do then?   Really there's only one thing to do - that's let another machine watch this one - and give it the ability to kill your machine using something like STONITH[13].  Voila!  You've just made an HA cluster.  How does that monitoring machine get monitored?  You let the cluster members monitor each other - resulting in a sort of mutual monitoring society.  Can this fail?  Of course! - given enough problems anything can fail - but it's really unlikely.  Trying to make this kind of problem go away completely helps you discover that paranoia can be a very expensive hobby.

This idea of hierarchical watchers is generally a good pattern to follow for monitoring.  For example, have each machine monitor its own services, and monitor the machines in a cluster, then monitor the clusters centrally.  This minimizes network traffic, and is scalable to very large environments.

I've covered the basic concepts of service monitoring, without quite getting around to talking about how to implement it in practice.  This will be covered in a future posting.

References
[1] Gregory Pfister - In Search Of Clusters - Second Edition. p. 390, 1998, Prentice Hall
[2] http://en.wiktionary.org/wiki/your_mileage_may_vary
[3] http://linux-ha.org/
[4] http://httpd.apache.org/
[5] http://www.gnu.org/software/wget/
[6] http://www.ibm.com/db2
[7] http://www.linuxbase.org/
[8] http://opencf.org/
[9] http://www.linuxbase.org/spec/refspecs/LSB_3.0.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
[10] http://www.opencf.org/cgi-bin/viewcvs.cgi/specs/ra/resource-agent-api.txt?rev=HEAD
[11] http://en.wikibooks.org/wiki/Linux_Kernel_Drivers_Annotated/Character_Drivers/Softdog_Driver
[12] http://pwet.fr/man/linux/administration_systeme/apphbd
[13] http://linux-ha.org/STONITH or http://en.wikipedia.org/wiki/STONITH