In some ways, an HA system is pretty simple - it starts services, it stops them, and it sees if they and the computers that run them are still running. But, there are a few bits of important "rocket science" hiding in there among all these apparently simple tasks. Much of the rocket science that's there centers around trying to solve a single thorny problem - split brain. The methods that are used to solve this problem are quorum and fencing. Unfortunately, if you manage an HA system you need to understand these issues. So this post will concentrate on these three topics: split-brain, quorum, and fencing.
If you have three computers and some way for them to communicate with each other, you can make a cluster out of them and,each can monitor the others to see if their peer has crashed. Unfortunately, there's a problem here - you can't distinguish a crash of a peer from broken communications with the peer. All you really know is that you can't hear anything from them. You're really stuck in a Dunn's law[1] situation - where you really don't know very much, but desperately need to. Maybe you don't feel too desperate yet. Perhaps you think that you don't need to be able to distinguish these two cases. The truth is that sometimes you don't need to, but much of the time you very much need to be able to tell the difference. Let's see if I can make this clearer with an illustration.
Let's say you have three computers, paul, silas, and mark, and paul and silas can't hear anything from mark and vice versa. Let's further suppose that mark had a filesystem /importantstuff from a SAN volume mounted on it when we lost contact with it. and that mark is alive but out of contact. What happens if we just go ahead and mount /importantstuff up on paul? The short answer is that bad things will happen[2]. /importantstuff will be irreparably corrupted as two different computers update the disk independently. The next question you'll ask yourself is "Where are those backup tapes?". That's the kind of question that's been known to be career-ending.
Split-Brain
This problem of a subset of computers in a cluster beginning to operate autonomously from each other is called Split Brain[3]. In our example above, the cluster has split into two subclusters: {paul, silas} and {mark}, and each subset is unaware of the others. This is the perhaps most difficult problem to deal with in high-availability clustering. Although this situation does not occur frequently in practice, it does occur more often than one would guess. As a result, it's vital that a clustering system have a way to safely deal with this situation.
Earlier I mentioned that there was information you really want to know, but don't know. Exactly what information did I mean? What I wanted to know was "is it safe to mount up /importantstuff somewhere else?". In turn, you could figure that out if you knew the answer to one of these two questions: "Is mark really dead?" which is one way of figuring out "Is mark going to write on the volume any more?" But, of course, since we can't communicate with mark, this is pretty hard to figure out. So, cluster developers came out with a kind of clever way of ensuring that this question can be answered. We call that answer fencing.
Fencing
Fencing is the idea of putting a fence around a subcluster so that it can't access cluster resources, like /importantstuff. If you put a fence between it and its resources, then suddenly you know the answer to the question "Is mark going to write on the volume any more?" - and the answer is no - because that's what the fence is designed to prevent. So, instead of passively wondering what the answer to the safeness question is, fencing takes action to ensure the "right" answer to the question.
This sort of abstract idea of fencing is fine enough, but how is this fencing stuff actually done? There are basically two general techniques: resource fencing [4] and node fencing.[5].
Resource fencing is the idea that if you know what resources a node might be using, then you can use some method of keeping it from accessing those resources. For example, if one has a disk which is accessed by a fiber channel switch, then one can talk to the fiber channel switch and tell it to deny the errant node access to the SAN.
Node fencing is the idea that one can keep a node from accessing all resources - without knowing what kind of resources it might be accessing, or how one might deny access to them. A common way of doing this is to power off or reset the errant node. This is a very effective if somewhat inelegant method of keeping it from accessing anything at all. This technique is also called STONITH[6] - which is a graphic and colorful acronym standing for Shoot The Other Node In The Head.
With fencing, we can easily keep errant nodes from accessing resources, and we can now keep the world safe for democracy - or at least keep our little corner of it safe for clustering. An important aspect of good fencing techniques is that they're performed without the cooperation of the node being fenced off, and that they give positive confirmation that the fencing was done. Since errant nodes are suspect, it's by far better to rely on positive confirmation from a correctly operating fencing component than to rely on errant cluster nodes you can't communicate with to police themselves.
Although fencing is sufficient to ensure safe resource access, it is not typically considered to be sufficient for happy cluster operation because without some other mechanism, there are some behaviors it can get into which can be significantly annoying (even if your data really is safe). To discuss this, let's return our sample cluster.
Earlier we talked about how paul or silas could use fencing to keep the errant node mark from accessing /importantstuff. But, what about mark? If mark is still alive, then it is going to regard paul and silas as errant, not itself. So, it would also proceed to fence paul and silas - and progress in the cluster would stop. If it is using STONITH, then one could get into a sort of infinite reboot loop, with nodes declaring each other as errant and rebooting each other, coming back up and doing it all over again. Although this is kind of humorous the first time you see this in a test environment - in production with important services, the humor of the situation probably wouldn't be your first thought. To solve this problem, we introduce another new mechanism - quorum.
Quorum
One way to solve the mutual fencing dilemma described above is to somehow select only one of these two subclusters to carry on and fence the subclusters it can't communicate with. Of course, you have to solve it without communicating with the other subclusters - since that's the problem - you can't communicate with them. The idea of quorum represents the process of selecting a unique (or distinguished for the mathematically inclined) subcluster.
The most classic solution to selecting a single subcluster is a majority vote. If you choose a subcluster with more than half of the members in it, then (barring bugs) you know there can't be any other subclusters like this one. So, this is looks like a simple and elegant solution to the problem. For many cases, that's true. But, what if your cluster only has two nodes in it? Now, if you have a single node fail, then you can't do anything - no one has quorum. If this is the case, then two machines have no advantage over a single machine - it's not much of an HA cluster. Since 2-node HA clusters are by far the most common size of HA cluster, it's kind of an important case to handle well. So, how are we going to get out of this problem?
Quorum Variants and Improvements
What you need in this case, is some kind of a 3rd party arbitrator to help select who can fence off the other nodes and allow you to bring up resources - safely. To solve this problem there is a variety of other methods available to act as this arbitrator - either software or hardware. Although there are several methods available to use as arbitrator, we'll only talk about one each of hardware and software methods: SCSI reserve and Quorum Daemon.
SCSI reserve: In hardware, we fall back on our friend SCSI reserve. In this usage, both nodes try and reserve a disk partition available to both of them, and the SCSI reserve mechanism ensures that only one of the two of them can succeed. Although I won't go into all the gory details here, SCSI reserve creates its own set of problems including it won't work reliably over geographic distances. A disk which one uses in this way with SCSI reserve to determine quorum is sometimes called a quorum disk. Some HA implementations (notably Microsoft's) require a quorum disk.
Quorum Daemon: In Linux-HA[7], we have implemented a quorum daemon - whose sole purpose in life is to arbitrate quorum disputes between cluster members. One could argue that for the purposes of quorum this is basically SCSI reserve implemented in software - and such an analogy is a reasonable one. However, since it is designed for only this purpose, it has a number of significant advantages over SCSI reserve - one of which is that it can conveniently and reliably operate over geographic distances, making it ideal for disaster recovery (DR) type situations. I'll cover the quorum daemon and why it's a good thing in more detail in a later posting. Both HP and Sun have similar implementations, although I have security concerns about them, particularly over long distances. Other than the security concerns (which might or might not concern you), both HP's and Sun's implementations are also good ideas.
Arguably the best way to use these alternative techniques is not directly as a quorum method, but rather as a way of breaking ties when the number of nodes in a subcluster is exactly half the number of nodes in the cluster. Otherwise, these mechanisms can become single points of failure - that is, if they fail the cluster cannot recover.
Alternatives to Fencing
There are times when it is impossible to use normal 3rd-party fencing techniques. For example, in a split-site configuration (a cluster which is split across geographically distributed sites), when inter-site communication fails, then attempts to fence will also fail. In these cases, there are a few self-fencing alternatives which one can use when the more normal third-party fencing methods aren't available. These include:
Node suicide. If a node is running resources and it loses quorum, then it can power itself off or reboot itself (sort of a self-STONITH). The remaining nodes wait "long enough" for the other node to notice and kill itself. The problem is that a node which is sick might not succeed in self-suicide, or might not notice that it had a membership change, or had lost quorum. It is equally bad if notification of these events is simply delayed "too long". Since there is a belief that the node in question is, or at least might be, malfunctioning, this is not a trivial question. In this case, use of hardware or software watchdog timers becomes critical.
Self-shutdown. This self-fencing method is a variant on suicide, except that resources are stopped gracefully. It has many of the same problems, except it is somewhat less reliable because the time to shut down resources can be quite long. Like the case above, use of hardware or software watchdog timers becomes critical.
Note that without fencing, the membership and quorum algorithms are extremely critical. You've basically lost a layer of protection, and you've switched from relying on a component which gives positive confirmation to relying on a probably faulty component to fence itself, and then hoping without confirmation that you've waited long enough before continuing.
Summary
Split-brain is the idea that a cluster can have communication failures, which can cause it to split into subclusters. Fencing is the way of ensuring that one can safely proceed in these cases, and quorum is the idea of determining which subcluster can fence the others and proceed to recover the cluster services.
An Important Final Note
It is fencing which best guarantees the safety of your resources. Nothing else works quite as well. If you have fencing in your cluster software, and you have irreparable resources (i.e. that would be irreparably damaged in a split-brain situation), then you must configure fencing. If your HA software doesn't support (3rd party) fencing, then I suggest that you consider getting a different HA package.
See Also
General cluster concepts[8]
References
[1] http://linux-ha.org/DunnsLaw
[2] http://linux-ha.org/BadThingsWillHappen
[3]
http://linux-ha.org/SplitBrain
[4]
http://linux-ha.org/ResourceFencing
[5] http://linux-ha.org/NodeFencing
[6]
http://linux-ha.org/STONITH
[7]
http://linux-ha.org/
[8]
http://linux-ha.org/ClusterConcepts
I don't totally agreed that SCSI-reserved disk is quorum disk. in redhat and m$ cluster suits quorum disk is shared partition (FC or SCSI only, I think) which writes by cluster members, each to it's own "sector". successful write is count as vote.
Posted by: zul | 18 October 2007 at 17:35
Not all SCSI-reserve disks are quorum disks, and not all uses of shared disks are quorum disks.
In the situation you're describing, I wouldn't call a disk used this way a quorum disk. It's being used as a communication method - a method of sending "I'm alive" or heartbeat messages. Used in this way, the disk is another communication method - a disk-based networking scheme - and when it comes to communication paths - the more the merrier.
Adding communication methods doesn't eliminate the need for quorum, nor the need for fencing, and it doesn't eliminate split brain. For example, there was a scheduler bug in Red Hat 2.4.18-2.4.20 which caused Heartbeat to stop running for nearly an hour. This would make the other node think that node was dead, but it wasn't. And, since it wasn't running, it wouldn't notice that it didn't have quorum.
Posted by: Alan Robertson | 18 October 2007 at 21:18
You write: "As of this writing, I believe that the quorum daemon feature is unique to Linux-HA."
Sun Cluster supports Quorum Servers (scqsd) sice Version 3.2. Docs link: http://docs.sun.com/app/docs/doc/819-5360/gbdud?a=view
Posted by: Nils Goroll | 27 November 2007 at 02:07
Thanks for catching this problem. I'll correct the text. After writing this, I also found out that HP had a similar mechanism. I'll update both this page and the one here: http://techthoughts.typepad.com/managing_computers/2007/10/more-about-quor.html with your links. Thanks!
Posted by: Alan R. | 27 November 2007 at 06:45