In a previous article[1], I talked about quorum, and alluded to some more details about quorum which I'll discuss here in a little more detail. Let's examine a couple of common quorum tie-breaker methods, and see what's useful, and what's hype, and what's painful to use.
Can the standard voting quorum method fail?
By fail, I mean, can it grant quorum to two partitions simultaneously? The answer, unfortunately, is yes, even though it seems like a mathematical impossibility. This is because the world, unfortunately, is more complicated than simple mathematics, and quorum methods don't stand alone. Quorum methods are tied to membership algorithms. If a membership method fails, and for a period of time, a given node appears to be in the membership of two partitions, then while that's true, both partitions could legitimately think they have quorum. Like many possible failures, this one is unlikely, but it is certainly possible. Sigh... Paranoia is so depressing.
SCSI Reserve Limitations
In the earlier article, I mentioned that SCSI reserve was often painful to use, without being explicit on why. Let's explore that now. SCSI reserve is an operation which happens at the physical disk volume level - that is, at the SCSI LUN level. With "dumb" disks this typically corresponds to an entire disk spindle - which nowadays is something like 180G-750G of disk data - which is clearly a significant waste of resources.
However, most people who use shared disks use them in a SAN are using it with a smart SAN disk controller which allows the creation of "logical" volumes which correspond to more much smaller sizes for a single SCSI LUN, using any RAID method you want. For a disk used just as a quorum disk, you don't need to actually write anything on it, so you want it as small as you can make it. But, most people probably make it RAID 1, which means that the minimum size is probably something like 2 gigabytes of SAN disk. If you have a large data center, it could easily have 50 clusters in it, and each one requires such a quorum device. In this case, this makes for a lot of extra volumes, extra administration and possibilities for confusion and human error going on here. In addition, smaller SAN disk units may have limitations on the total number of disk partitions they can manage.
So, perhaps you think - I'm smarter than that, I'll just use software volume managers to take care of this for me, and avoid all those extra logical disks in my SAN. Unfortunately, that typically doesn't work. This is because when you issue the SCSI reserve, it can't reserve a logical partition, only a physical partition, so many logical volume managers block reserve operations. So, logical volume managers are not much help here.
To make it even more complicated, multi-pathing to disk devices often confuses (i.e., "breaks") disk reservation issues - particularly with SCSI II non-persistent reservations.
Of course, if you're replicating data instead of sharing it, disk reserve operations are of no help at all - since reserving a disk on one disk volume has no effect whatsoever on the other volume.
None of these considerations are changed by what kind of SCSI reserve you issue (persistent or the older non-persistent reserve). However, there are even more problems that occur when you use the older non-persistent SCSI II reserve (the most commonly available kind), since it isn't persistent, and is broken by a bus reset or a device reset. So, if you use SCSI II reserve, then you have to continually verify that you still have the reserve.
Count Key Data Disk Reserve
Mainframe disk subsystems support a non SCSI-based disk model, called the extended count key data (ECKD) disks. These disks also support reserve methods similar to those provided by SCSI.
Quorum Daemon - Helpful or Hype?
Earlier, I said that the Linux-HA[2] quorum daemon[3] can help out here - not only for local shared disk situations, but for disaster-recovery type situations where you're replicating data between sites, and gave a hand-waving style argument on why that's so. Let's see if we can go past the hand waving into more of the details so you can see what this is, how this works, and decide for yourself it would be helpful to you, or if this is just more hype written by someone who likes his project's work.
If you recall, I also described it as analogous to a software implementation of SCSI reserve. But, in this case, there are no disks involved, so the hardware and SCSI protocol limitations mentioned above go away. So, if the hardware has gone away, what kind of software has replaced it, and how does it work?
In the simplest view, the quorum daemon is simple - it takes TCP connections from clients, and when multiple clients both want to have quorum for the same cluster, it grants it to exactly one of them. So, at this level of detail, it's quite simple, and the logic also straightforward. But, there's a little more to it when you get into the details, so let's spend a few words describing the next level of detail, for those of you who are still skeptical.
TCP doesn't do a good job of telling you when your peer goes away, so both the client and the server processes send the other side heartbeats. If the client stops hearing heartbeats, then he notifies his cluster that he no longer has quorum. When the server stops hearing heartbeats, he takes quorum away from the client, and sends him noquorum messages. Before switching quorum from one client to the other, the quorum daemon induces a configurable delay to allow the previous quorum owner time to notice that they've lost quorum and to shut down any resources it might be running.
What other interesting design features does it have? Well, for one, all communication between client and server goes over SSL, with certificate authentication. The server has a copy of the client's public certificate, and the clients have copies of the server's public certificate - so you don't need to get your certificates from a certificate authority - because we authenticate them by certificate, not through certificate authority (which could be a single point of failure). Because of the way we use the SSL certificates, both authentication and authorization are bundled together.
Another nice design feature, is that a single quorum daemon doesn't have any fixed upper limit on the number of clusters it can support. So, a single quorum daemon process should be able to support hundreds of clusters. This makes it an advantage over just adding a third node to your cluster - because it can server hundreds of clusters, whereas typically a single computer can't join more than one cluster. Indeed, since this doesn't run the cluster stack for any given cluster, it would be possible for the quorum daemon to work with multiple cluster implementations - as long as they put the hooks in to talk to it.
The combination of these two features allows you to put your quorum daemon on a completely different site, even at a colocation facility, and let your communication flow over the public Internet - securely. This is really nice economical choice for your split-site DR-style clusters, for companies that don't have 3 or more major data centers. Of course, if you have dozens of clusters that aren't split-site, you can make a single quorum daemon cluster with fencing to serve all the other clusters in your site. All without bothering your storage group to create and manage all those little tiny quorum partitions for you. In fact, without requiring shared storage at all.
Some kinds of software support resource-level quorum, that is, quorum on a resource-level rather than a whole cluster level. Some of these are called resource-driven clusters. The quorum daemon idea could also be used for those arrangements as well.
So, what's not to like about the quorum daemon? Well, you do have to make a certificate for the server, and one for each client. But, you don't have to pay for them, because they don't need to come from one of the well-known certificate producers. It would be nice if the quorum daemon would sync its state to a slave copy that could be used in a cluster as a master/slave resource in case the machine running the quorum daemon failed. You can still run it in a cluster with another machine to take over for it, but the takeover isn't as graceful as it would be if there were a slave copy ready to take over for the master complete with its quorum state.
I've recently learned that HP ServiceGuard[4] has a feature similar to this which they call the quorum server daemon (qs)[5]. The documentation indicates that HP's qs authenticates by IP address and does not use SSL for communication or authentication/authorization. As a result, there are security concerns associated with it, and qs is probably unsuitable for split-site configurations.
And, even more recently, thanks to Nils Gorrol I've learned that Sun has a similar feature in their sqsd [6] since SunCluster version 3.2. Looking at the documentation, it looks as though it has similar security issues that HP's qs has for split-site arrangements and potentially insecure networks.
Thanks to all those who keep me on the straight and narrow regarding the facts ;-)
Ping tiebreaker
Some HA systems provide a ping tiebreaker. To make this work, you pick a address outside the cluster to ping, and any partition that can ping that address has quorum. The obvious advantage is that it's very simple to set up - doesn't require any additional servers or shared disk. The disadvantage (and it's a big one) is that it's very possible for multiple partitions to think they have quorum. In the case of split-site (disaster recovery) type clusters, it's going to happen fairly often. If you can use this method for a single site in conjunction with fencing, then it will likely work out quite well. It's a lot better than no tiebreaker, or one that always says "you have quorum". Having said that, it's significantly inferior to any of the other methods.
If I've omitted things you want to know about, or this brings up ideas or questions in your mind, or you know of other implementations of the quorum daemon idea, by all means post a reply to this article.
References
[1] http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
[2] http://linux-ha.org/
[3] http://www.linux-ha.org/QuorumServerGuide
[4] http://h71036.www7.hp.com/enterprise/cache/6468-0-0-0-121.html
[5] http://www.docs.hp.com/en/B3936-90065/ch03s01.html
[6] http://docs.sun.com/app/docs/doc/819-5360/gbdud?a=view
Hi Alan!
Sorry to be dense, but I am still not getting what happens if (when) the quorum daemon goes down?
Emily
Posted by: Emily Ratliff | 13 November 2007 at 20:27
I have some good pictures for this, I just forgot about them when writing this article...
The recommended way to use the quorum server is effectively as a "tie-breaker" quorum method. So, it is only consulted when the previous quorum method determines that it has a tie. As a result, two different things might happen:
1) If both nodes/sites are up and the quorum daemon is down, nothing bad happens, since the quorum daemon isn't consulted (we have 2/2 votes, so there is no tie).
2) If one of the nodes is down, or the communication between the nodes is down AND the quorum daemon is down, then the cluster won't have quorum. Note that the quorum server is NOT a single point of failure, because it takes an additional failure (a node or the inter-site comm link) for the system to fail as a whole.
Posted by: Alan R. | 14 November 2007 at 13:51
> Ping Quorum ... it's very possible for multiple partitions to think they have quorum. In
> the case of split-site (disaster recovery) type clusters, it's going to happen fairly often.
Why? Please explain it, provide some details. Thank you.
Posted by: Alexander Horz | 05 January 2008 at 12:26
Alexander:
First let's take the local case...
ARP cache corruption can cause this kind of a problem without any hardware failures. Also I have seen switches fail this way - A can communicate with B, B can communicate with C, but A can't communicate with C. In these cases it's typically caused by a firmware bug or broken hardware, but it happens.
For the split-site case, the possibilities are less obscure, and more likely.
In a split-site cluster, it is normal to have a dedicated link between the sites. This is done for reasons of privacy and latency.
This link is completely separate from links to the outside world, so, it can fail independently. In fact, if it does fail it's likely that the links to the outside world are still working. When this happens, now both sides think the other is dead, and both can ping the outside URL (for example www.google.com). Now, you're in big split-brain trouble.
For either of these cases, if you were consulting the quorum server, then it would only grant quorum to one side - and the problem goes away (or never happens, depending on your perspective).
Posted by: Alan Robertson | 07 January 2008 at 14:31