« Alan eats his own cl_respawn dog food. Yum!! | Main | A brief overview of load balancing techniques »

27 November 2007

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Kris Buytaert

Alan,

Could it be that you have placed the wrong image in the "What happens when the quorum server goes down?" section ?

Alan R.

Indeed I did. I'll correct it. This whole confusion about images - clobbering them with each other, and so on is what delayed the article a week (that and the Thanksgiving Holiday in the US). I thought I had that straightened out. Groan...

Thanks for your quick response on this!

Alan R.

I've corrected it, and I also added a brief section on quorum server security. Thanks again Kris!

Atanas

"So, it is worth noting that the quorum server will never be consulted if a cluster has an odd number of nodes."

This statement confuse me a little. What about if we have for e.g. 5 node cluster. 3 nodes reside in site A and the other two in site B. Image the network between the two sites fails. The first subcluster has quorum (3 nodes) continues to serve whereas the other subcluster loses quorum (2 node) and no services are brought up there.

This is OK as far as the network has gone down. But what about if site A has been struck by disaster? The second subcluster must take over the service. From its point of view it cannot see the other nodes in site A and again does not have quorum (2 nodes). So it cannot run the service.

IMO here comes the quorum server which is located in different site. In the above case of disaster it cannot connect to nodes in site A but it connect to the 2 nodes left in site B. So the quorum server gives quorum to the left nodes.

What would you say about that?

Alan R.

Ahhh... OK. I should have been more precise in my language. In the case you gave, one would normally make the number of quorum votes each site has to be the same (sorry I forgot to say that). So each node in site A would have 2 votes apiece, and those in site B would have 3 votes apiece.

This is how I'd normally recommend to do it, so that you don't wind up with the case where you can't fail over because of an imbalance of votes between sites.

Does that help?

Jonny

I have tried testing this out as it sounds like a great idea and potential solution to some problems we are having.
I tried a 2-node cluster and separate quorumd server but I cannot seem to get it to work. I don't get any log messages referring to quorum so I don't know if I have it working or not.
I set it up in 3 virtual machines on Xen.
A practical how-to would be excellent!

ZiLi0n

Split Brain Avoided: How the quorum server select a subcluster? In which attributes is based this decision?

Alan R.

I didn't write that code, so my memory may not be perfect in this regard - but what I remember is that each subcluster petitioning for quorum can give a weight - how badly it wants quorum. It then selects the one with the greatest weight.

How that weight gets set, I don't recall. But, that's the idea, IIRC...

ZiLi0n

I have a cluster with two nodes and a quorum server. I did the steps "http://www.linux-ha.com/QuorumServerGuide" for make the quorum server.

The quorum server has the hearbeat package compiled with the options "enable-quorumd=yes"... but Do the two nodes also need this option? I think yes...but not found plugin quorumd.

I execute quorumd in quorum server: /usr/lib/heartbeat/quorumd

This is the a node log:

crmd[10880]: 2008/07/29_13:15:15 info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=8)
crmd[10880]: 2008/07/29_13:15:15 info: ccm_event_detail: NEW MEMBERSHIP: trans=8, nodes=2, new=0, lost=0 n_idx=0, new_idx=2, old_idx=4
crmd[10880]: 2008/07/29_13:15:15 info: ccm_event_detail: CURRENT: ast2 [nodeid=1, born=1]
crmd[10880]: 2008/07/29_13:15:15 info: ccm_event_detail: CURRENT: ast1 [nodeid=0, born=8]
crmd[10880]: 2008/07/29_13:15:15 info: process_client_disconnect: Received HUP from tengine:[-1]
crmd[10880]: 2008/07/29_13:15:15 info: do_election_count_vote: Updated voted hash for ast1 to vote
crmd[10880]: 2008/07/29_13:15:15 info: do_election_count_vote: Election ignore: our vote (ast1)
crmd[10880]: 2008/07/29_13:15:15 info: crmdManagedChildDied: Process pengine:[24696] exited (signal=0, exitcode=0)
crmd[10880]: 2008/07/29_13:15:15 info: process_client_disconnect: Received HUP from pengine:[-1]
crmd[10880]: 2008/07/29_13:15:16 info: do_election_count_vote: Election check: vote from ast2

crmd[10880]: 2008/07/29_13:15:13 WARN: crmd_ha_msg_callback: Ignoring HA message (op=noop) from ast2: not in our membership list (size=1)
crmd[10880]: 2008/07/29_13:15:14 WARN: crmd_ha_msg_callback: Ignoring HA message (op=noop) from ast2: not in our membership list (size=1)
crmd[10880]: 2008/07/29_13:15:14 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
crmd[10880]: 2008/07/29_13:15:14 info: mem_handle_event: no mbr_track info
crmd[10880]: 2008/07/29_13:15:14 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
crmd[10880]: 2008/07/29_13:15:14 info: mem_handle_event: instance=8, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
crmd[10880]: 2008/07/29_13:15:14 info: crmd_ccm_msg_callback: Quorum lost after event=INVALID (id=8)
crmd[10880]: 2008/07/29_13:15:14 info: crmd_ccm_msg_callback: Quorum lost: triggering transition (INVALID)
crmd[10880]: 2008/07/29_13:15:14 ERROR: do_ccm_update_cache: 2 nodes w/o quorum
crmd[10880]: 2008/07/29_13:15:14 info: ccm_event_detail: INVALID: trans=8, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=4
crmd[10880]: 2008/07/29_13:15:14 info: ccm_event_detail: CURRENT: ast2 [nodeid=1, born=1]
crmd[10880]: 2008/07/29_13:15:14 info: ccm_event_detail: CURRENT: ast1 [nodeid=0, born=8]
crmd[10880]: 2008/07/29_13:15:14 info: ccm_event_detail: NEW: ast2 [nodeid=1, born=1]

This is in quorum server:

Jul 29 13:08:08 arbitro quorumd: [2801]: debug: receive 0 byte or error from client 1
Jul 29 13:08:08 arbitro quorumd: [2801]: debug: client 1 disconnected
Jul 29 13:08:08 arbitro quorumd: [2801]: debug: delete client 1
Jul 29 13:08:23 arbitro quorumd: [2801]: debug: create new client 2
Jul 29 13:11:52 arbitro quorumd: [2801]: debug: client 2 disconnected
Jul 29 13:11:52 arbitro quorumd: [2801]: debug: delete client 2
Jul 29 13:11:53 arbitro quorumd: [2801]: debug: create new client 3
Jul 29 13:15:12 arbitro quorumd: [2801]: debug: create new client 4
Jul 29 13:15:12 arbitro quorumd: [2801]: debug: receive 0 byte or error from client 4
Jul 29 13:15:12 arbitro quorumd: [2801]: debug: client 4 disconnected
Jul 29 13:15:12 arbitro quorumd: [2801]: debug: delete client 4
Jul 29 13:15:12 arbitro quorumd: [2801]: debug: create new client 5
Jul 29 13:15:12 arbitro quorumd: [2801]: debug: receive 0 byte or error from client 3
Jul 29 13:15:12 arbitro quorumd: [2801]: debug: client 3 disconnected
Jul 29 13:15:12 arbitro quorumd: [2801]: debug: delete client 3
Jul 29 13:15:13 arbitro quorumd: [2801]: debug: receive 0 byte or error from client 5
Jul 29 13:15:13 arbitro quorumd: [2801]: debug: client 5 disconnected

Thanks!!!

The comments to this entry are closed.

Become a Fan

AddThis Social Bookmark Button
Blog powered by Typepad