« Bad application design => Bad availability (more Rockies ticket debacle) | Main | Availability, MTBF, MTTR and other bedtime tales »

October 31, 2007

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e54ed61e07883300e54f08d4688834

Listed below are links to weblogs that reference More about quorum - updated:

Comments

Emily Ratliff

Hi Alan!

Sorry to be dense, but I am still not getting what happens if (when) the quorum daemon goes down?

Emily

alanr

I have some good pictures for this, I just forgot about them when writing this article...

The recommended way to use the quorum server is effectively as a "tie-breaker" quorum method. So, it is only consulted when the previous quorum method determines that it has a tie. As a result, two different things might happen:

1) If both nodes/sites are up and the quorum daemon is down, nothing bad happens, since the quorum daemon isn't consulted (we have 2/2 votes, so there is no tie).

2) If one of the nodes is down, or the communication between the nodes is down AND the quorum daemon is down, then the cluster won't have quorum. Note that the quorum server is NOT a single point of failure, because it takes an additional failure (a node or the inter-site comm link) for the system to fail as a whole.

Alexander Horz

> Ping Quorum ... it's very possible for multiple partitions to think they have quorum. In
> the case of split-site (disaster recovery) type clusters, it's going to happen fairly often.
Why? Please explain it, provide some details. Thank you.

Alan Robertson

Alexander:
First let's take the local case...
ARP cache corruption can cause this kind of a problem without any hardware failures. Also I have seen switches fail this way - A can communicate with B, B can communicate with C, but A can't communicate with C. In these cases it's typically caused by a firmware bug or broken hardware, but it happens.

For the split-site case, the possibilities are less obscure, and more likely.
In a split-site cluster, it is normal to have a dedicated link between the sites. This is done for reasons of privacy and latency.

This link is completely separate from links to the outside world, so, it can fail independently. In fact, if it does fail it's likely that the links to the outside world are still working. When this happens, now both sides think the other is dead, and both can ping the outside URL (for example www.google.com). Now, you're in big split-brain trouble.

For either of these cases, if you were consulting the quorum server, then it would only grant quorum to one side - and the problem goes away (or never happens, depending on your perspective).

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment