« Bad application design => Bad availability (more Rockies ticket debacle) | Main | Availability, MTBF, MTTR and other bedtime tales »

31 October 2007

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Emily Ratliff

Hi Alan!

Sorry to be dense, but I am still not getting what happens if (when) the quorum daemon goes down?

Emily

Alan R.

I have some good pictures for this, I just forgot about them when writing this article...

The recommended way to use the quorum server is effectively as a "tie-breaker" quorum method. So, it is only consulted when the previous quorum method determines that it has a tie. As a result, two different things might happen:

1) If both nodes/sites are up and the quorum daemon is down, nothing bad happens, since the quorum daemon isn't consulted (we have 2/2 votes, so there is no tie).

2) If one of the nodes is down, or the communication between the nodes is down AND the quorum daemon is down, then the cluster won't have quorum. Note that the quorum server is NOT a single point of failure, because it takes an additional failure (a node or the inter-site comm link) for the system to fail as a whole.

Alexander Horz

> Ping Quorum ... it's very possible for multiple partitions to think they have quorum. In
> the case of split-site (disaster recovery) type clusters, it's going to happen fairly often.
Why? Please explain it, provide some details. Thank you.

Alan Robertson

Alexander:
First let's take the local case...
ARP cache corruption can cause this kind of a problem without any hardware failures. Also I have seen switches fail this way - A can communicate with B, B can communicate with C, but A can't communicate with C. In these cases it's typically caused by a firmware bug or broken hardware, but it happens.

For the split-site case, the possibilities are less obscure, and more likely.
In a split-site cluster, it is normal to have a dedicated link between the sites. This is done for reasons of privacy and latency.

This link is completely separate from links to the outside world, so, it can fail independently. In fact, if it does fail it's likely that the links to the outside world are still working. When this happens, now both sides think the other is dead, and both can ping the outside URL (for example www.google.com). Now, you're in big split-brain trouble.

For either of these cases, if you were consulting the quorum server, then it would only grant quorum to one side - and the problem goes away (or never happens, depending on your perspective).

The comments to this entry are closed.

Become a Fan

AddThis Social Bookmark Button
Blog powered by Typepad