« October 2007 | Main | December 2007 »

November 2007

November 27, 2007

Quorum Server Illustrated - updated

In two earlier posts [1] [2], I gave brief descriptions of the quorum server which seem to have left as much confusion as they provided clarity.  This post is only about the Linux-HA quorum server, and includes illustrations for clarity.

The Linux-HA Quorum API

In the Linux-HA quorum API, you can configure a number of quorum modules which are used as follows.  If a quorum module returns HAVEQUORUM, then the cluster has quorum.  If it returns NOQUORUM then the cluster does not have quorum.  If a quorum module returns QUORUMTIE, then the next quorum module in the list is consulted.  If the final module returns QUORUMTIE, then it is treated as a NOQUORUM event.

The quorum daemon is normally used in conjunction with the nomal arithmetic voting quorum module, so that it is only consulted when the number of nodes in the cluster is exactly half the number of configured modules in the system.  So, it is worth noting that the quorum server will never be consulted if a cluster has an odd number of nodes.

Quorum Server Scenarios

Below, I'll go through the basic quorum server cases so you can see how all this works in more detail - with pictures, even!

Normal Situation - Everything up
Quorum_server_normalsm_2

In the picture above, everything is normal.  The quorum server is up, and both sites are also up.  Because the cluster has all its nodes up, the quorum server is irrelevant.

Single Site Failure
Quorum_server_nj_failedsm_3

In the situation above, we show the "New Jersey" site as down.  In this case, the conventional voting quorum has a tie (1/2 - exactly half of the nodes).  In this case the quourm server is consulted.  Since only New York is talking to the quorum server, the quorum server grants quorum to the New York site.

Split Brain Avoided
Quorum_server_splitbrainsm_2

In the case above, the link between the sites has been lost, but both sites and the quorum server are all up.  In this case, both New York and New Jersey contact the quorum server because each sees 1/2 nodes as being up - resulting in a tie condition.

In this case, the quorum server will choose one of the two sites to provide quorum to, and I assume in this case that New York was chosen.  Because New Jersey  wasn't granted quorum, it will shut its resources down.

What happens when the quorum server goes down?
Quorum_server_failed_both_upsm

That is the situation shown above.  Because New York and New Jersey are both up, they have 2/2 votes and both provide service as they should.  This illustrates the point that the quorum server is not a single point of failure.

Multiple Failures -> Loss of Service

Multiple_failures_no_servicesm_3

In this final case, multiple failures have occurred - both New Jersey and the quorum server are down.  In this case, New York doesn't have quorum, so it shuts down services and none are provide by any node in the cluster.  Of course, this situation can be overridden in the cluster configuration by changing the quorum policy, but from an automated perspective, this is all that can be (should be) done.

Security Concerns

If you want to run your quorum server communications across networks which mig

November 12, 2007

Alan eats his own cl_respawn dog food. Yum!!

In this posting, I show how to use cl_respawn[1] to monitor my system logging and help keep it running, and along the way, I improved cl_respawn a little as well.  In addition, I explain why I couldn't just use the respawn directive in /etc/inittab[5] (and why you probably can't either).   I first talked about cl_respawn in one of my first blog posts[6].

The problem

When we run our automated CTS[2] tests for Linux-HA[3] we rely on the guaranteed log entry delivery provided by syslog-ng[4].  Basically, we redirect all our logs in a test cluster to a test overseer machine, and then CTS watches this consolidated log for errors and correct behavior.

This is a nice system and it works pretty well, but it relies on the reliability of syslog-ng.  For the most part, that's just fine.  But, sometimes syslog-ng just stops running.  Then the tests show that Heartbeat has failed, but it's really just syslog-ng that's crashed on me.  So, in the past I added some code to CTS to make it test the logging after every error, and then hit the machines over the head with a hammer and restart logging if logging wasn't working.

This was sort-of OK, because it meant subsequent tests would run fine, but the one test would show failed - even though it probably succeeded.  This would be fine, except that one of my machines (my oldest and slowest) had syslog-ng die on it a few times a day.  I don't know why, and as long as I can live with it for my testing, I don't much care.  I just want it to work.  (I know, it's a lousy attitude, but I have way more to do than I can possibly do).

The solution

Then it I had this revolutionary thought - I could use HA software to make my logging highly available!!

Hold the presses, folks, new headline reads
   "HA guru realizes he can use HA software just like he tells everyone else to do!"

To fix this problem all I had to do was change the init script for syslog to use our cool little cl_respawn tool to babysit the syslog-ng service.  Although I could have used Heartbeat to monitor this service, it seemed like overkill and would have conflicted with CTS.

So, I set out to use cl_respawn to restart syslog-ng quickly - minimizing but not eliminating the possibliity of losing important log messages.

When I looked at the init scripts (they're from SUSE Linux), they had these statements in them:

  • For starting
    startproc -p ${syslog_pid} ${BINDIR}/${syslog} $params
  • For stopping
    killproc -p ${syslog_pid} -TERM ${BINDIR}/${syslog} ; rc_status -v
  • For status
    checkproc -p ${syslog_pid}      ${BINDIR}/${syslog{; rc_status -v
  • checkproc -p ${syslog_pid}      /usr/bin/cl_respawn; rc_status -v

My first thought was ll I had to do was insert cl_respawn ahead of the ${BINDIR}/syslog and I'd be done.  Well.... not quite...

If I had done that, then the pid file for the service ${syslog_pid} would have pointed not to cl_respawn, but to syslog-ng.  So, when I tried to shut down syslog, cl_respawn would have just respawned it.  OOPS.  Not quite the right effect.

What was necessary was for the syslog pid file to contain the pid of cl_respawn, not the pid of syslog-ng.  One minor problem - the author of cl_respawn didn't deal with pidfiles.  To fix that, I added support for a -p option to tell it the name of the pid file to use.

Now I try it.  Uh-oh... It didn't work.  The logs are quickly filled with attempts to start  ${syslog} and having it fail continually with  socket in use.  What was all that about?

By default, syslog-ng forks itself into the background,  and its parent process exits.  That makes cl_respawn think it's died - so it restarts it - and it fails ad infinitum.  So, I read the man page for syslog-ng and discover the -F option to keep it from forking.  Without that, cl_respawn can't tell when it dies.

Along the way, I read the code, find a couple of other minor bugs and fix them.  I update my init script and now it looks like this:

  • For starting
    startproc -p ${syslog_pid} /usr/bin/cl_respawn -p ${syslog_pid} ${BINDIR}/${syslog} -F $params
  • For stopping
    killproc -p ${syslog_pid} -TERM /usr/bin/cl_respawn ; rc_status -v
  • For status
  • checkproc -p ${syslog_pid}      /usr/bin/cl_respawn; rc_status -v

Of course, if you don't run SUSE Linux, then your init scripts will look somewhat different, but I'm sure you'll figure it out.

Why not just use respawn in inittab?

Those of you who know UNIX administration to any degree realize that /etc/inittab[5] has a respawn directive you can give it.  Why wouldn't that do the trick?  The short answer is service dependencies.   The longer answer is below:

  • Logging depends on other /etc/init.d services, so you don't want it to start until after those other services (like the network) are started.  The LSB init script system supports these dependencies and starts things in the right order.
  • Other services depend on logging.  A number of other services can't start until after logging starts.  If you try and disable the /etc/init.d/syslog service on your machine so you can start it with respawn from /etc/inittab, havoc ensues - because these other services won't start until the /etc/init.d/syslog service is started.  If you disable it, they won't start.
  • What fun would that be?  I mean, if we wrote this cl_respawn tool, we probably ought to use it ;-).

What did I learn?

  • How to use cl_respawn in real life
  • Some missing requirements for cl_respawn
  • I was reminded of the advantages of using our own software
  • How handy simple little tools like cl_respawn can be

[1http://hg.linux-ha.org/dev/file/tip/tools/cl_respawn.c
[2http://linux-ha.org/CTS
[3http://linux-ha.org/
[4http://www.balabit.com/network-security/syslog-ng/opensource-logging-system/"
[5http://www.freebsd.org/cgi/man.cgi?query=inittab&manpath=Red+Hat+Linux%2Fi386+9&format=html
[6http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html

November 04, 2007

Availability, MTBF, MTTR and other bedtime tales

If we let A represent availability, then the simplest formula for availability is:
    A = Uptime/(Uptime + Downtime)

Of course, it's more interesting when you start looking at the things that influence uptime and downtime.  The most common measures that can be used in this way are MTBF and MTTR.

    MTBF is  Mean Time Between Failures
    MTTR is Mean Time To Repair   

 A = MTBF / (MTBF+MTTR)

One interesting observation you can make when reading this formula is that if you could instantly repair everything (MTTR = 0), then it wouldn't matter what the MTBF is - Availability would be 100% (1) all the time.

That's exactly what HA clustering tries to do.  It tries to make the MTTR as close to zero as it can by automatically (autonomically) switching in redundant components for failed components as fast as it can.   Depending on the application architecture and how fast failure can be detected and repaired, a given failure might not be observable by at all by a client of the service.  If it's not observable by the client, then in some sense it didn't happen at all.  This idea of viewing things from the client's perspective is an important one in a practical sense, and I'll talk about that some more later on.

It's important to realize that any given data center, or cluster provides many services, and not all of them are related to each other.  Failure of one component in the system may not cause failure of the system.  Indeed, good HA design eliminates single points of failure by introducing redundancy.  If you're going to try and calculate MTBF in a real-life (meaning complex) environment with redundancy and interrelated services, it's going to be very complicated to do.

    MTBFx is  Mean Time Between Failures for entity x
    MTTRx is Mean Time To Repair for entity x
    Ax is the Availability of entity x   

Ax = MTBFx / (MTBFx+MTTRx)

In practice, these measures (MTBFx and MTTRx) are hard to come by for nontrivial real systems - in fact, they're so tied in to application reliability and architecture, hardware architecture, deployment strategy, operational skill and training, and a whole host of other factors, that you can actually compute them only very very rarely.  So, why did I spend your time talking about it?  That's simple - although you probably won't compute them, you can learn some important things from these formulas, and you can see how mistakes you make in viewing these formulas might lead you to some wrong conclusions.

Let's get right into one example of a wrong conclusion you might draw from incorrectly applying these formulas.

Let's say we have a service which runs on a single machine, which you put onto a cluster composed of two computers with a certain individual MTBF (Mi) and you can fail over to the other computer ("repair") a computer in a certain repair time (Ri).  With two computers, they'll fail twice as often as a single computer, so the system MTBF becomes Mi/2.  If you compute the availability of the cluster, it then becomes:

    A = Mi/2 / (Mi/2+Ri)

Using this (incorrect) analysis for a 1000 node cluster performing the same service, the system MTBF becomes Mi/1000.

    A = Mi/1000 / (Mi/1000+Ri)

If you take the number of nodes in the cluster to the limit (approaching infinity), the Availability approaches zero.

    A = 0/(0+Ri) = 0/Ri = 0

This makes it appear that adding cluster nodes decreases availability.  Is this really true?  Of course not!  The mistake here is thinking that the service needed all those  cluster nodes to make it go.  If your service was a complicated interlocking scientific computation that would stop if any cluster node failed, then this model might be correct.  But if the other nodes were providing redundancy or unrelated services, then they would have no effect on MTBF of the service in question.  Of course, as they break, you'd have to repair them, which would mean replacing systems more and more often, which would be both annoying and expensive, but it wouldn't cause the service availability to go down.

To properly apply these formulas, even intuitively, you need to make sure you understand what your service is, how you define a failure, how the service components relate to each other, and what happens when one of them fails.  Here are a few rules of thumb for thinking about availability

  • Complexity is the enemy of reliability (MTTR).  This can take many forms
    • Complex software fails more often than simple software
    • Complex hardware fails more often than simple hardware
    • Software dependencies usually mean that if any component fails, the whole service fails
    • Configuration complexity lowers the chances of the configuration being correct
    • Complexity drastically increases the possibility of human error
      • What is complex software? - Software whose model of the universe doesn't match that of the staff who manage it.
  • Redundancy is the friend of availability - it allows for quick autonomic recovery - significantly improving MTTR.  Replication is another word for redundancy.
  • Good failure detection is vital - HA and other autonomic software can only recover from failures it detects.  Undetected failures have human-speed MTTR or worse, not autonomic-speed MTTR.  They can be worse than human-speed MTTR because the humans are surprised that it wasn't automatically recovered and they respond more slowly than normal.  In addition, the added complexity of correcting an autonomic service and trying to keep their fingers out of the gears may slow down their thought processes.
  • Non-essential components don't count - failure of inactive or non-essential components doesn't affect service availability.  These inactive components can be hardware (spare machines), or software (like administrative interfaces), or hardware only being used to run non-essential software.  More generally, for the purpose of calculating the availability of service X, non-essential components include anything not running service X or services essential to X.

The real world is much more complex than any simple rules of thumb like these, but these are certainly worth taking into account.