DR

November 27, 2007

Quorum Server Illustrated - updated

In two earlier posts [1] [2], I gave brief descriptions of the quorum server which seem to have left as much confusion as they provided clarity.  This post is only about the Linux-HA quorum server, and includes illustrations for clarity.

The Linux-HA Quorum API

In the Linux-HA quorum API, you can configure a number of quorum modules which are used as follows.  If a quorum module returns HAVEQUORUM, then the cluster has quorum.  If it returns NOQUORUM then the cluster does not have quorum.  If a quorum module returns QUORUMTIE, then the next quorum module in the list is consulted.  If the final module returns QUORUMTIE, then it is treated as a NOQUORUM event.

The quorum daemon is normally used in conjunction with the nomal arithmetic voting quorum module, so that it is only consulted when the number of nodes in the cluster is exactly half the number of configured modules in the system.  So, it is worth noting that the quorum server will never be consulted if a cluster has an odd number of nodes.

Quorum Server Scenarios

Below, I'll go through the basic quorum server cases so you can see how all this works in more detail - with pictures, even!

Normal Situation - Everything up
Quorum_server_normalsm_2

In the picture above, everything is normal.  The quorum server is up, and both sites are also up.  Because the cluster has all its nodes up, the quorum server is irrelevant.

Single Site Failure
Quorum_server_nj_failedsm_3

In the situation above, we show the "New Jersey" site as down.  In this case, the conventional voting quorum has a tie (1/2 - exactly half of the nodes).  In this case the quourm server is consulted.  Since only New York is talking to the quorum server, the quorum server grants quorum to the New York site.

Split Brain Avoided
Quorum_server_splitbrainsm_2

In the case above, the link between the sites has been lost, but both sites and the quorum server are all up.  In this case, both New York and New Jersey contact the quorum server because each sees 1/2 nodes as being up - resulting in a tie condition.

In this case, the quorum server will choose one of the two sites to provide quorum to, and I assume in this case that New York was chosen.  Because New Jersey  wasn't granted quorum, it will shut its resources down.

What happens when the quorum server goes down?
Quorum_server_failed_both_upsm

That is the situation shown above.  Because New York and New Jersey are both up, they have 2/2 votes and both provide service as they should.  This illustrates the point that the quorum server is not a single point of failure.

Multiple Failures -> Loss of Service

Multiple_failures_no_servicesm_3

In this final case, multiple failures have occurred - both New Jersey and the quorum server are down.  In this case, New York doesn't have quorum, so it shuts down services and none are provide by any node in the cluster.  Of course, this situation can be overridden in the cluster configuration by changing the quorum policy, but from an automated perspective, this is all that can be (should be) done.

Security Concerns

If you want to run your quorum server communications across networks which mig

October 31, 2007

More about quorum - updated

In a previous article[1], I talked about quorum, and alluded to some more details about quorum which I'll discuss here in a little more detail.  Let's examine a couple of common quorum tie-breaker methods, and see what's useful, and what's hype, and what's painful to use.

Can the standard voting quorum method fail?
By fail, I mean, can it grant quorum to two partitions simultaneously?  The answer, unfortunately, is yes, even though it seems like a mathematical impossibility.  This is because the world, unfortunately, is more complicated than simple mathematics, and quorum methods don't stand alone.  Quorum methods are tied to membership algorithms.  If a membership method fails, and for a period of time, a given node appears to be in the membership of two partitions, then while that's true, both partitions could legitimately think they have quorum.  Like many possible failures, this one is unlikely, but it is certainly possible.  Sigh...  Paranoia is so depressing.

SCSI Reserve Limitations

In the earlier article, I mentioned that SCSI reserve was often painful to use, without being explicit on why.  Let's explore that now.  SCSI reserve is an operation which happens at the physical disk volume level - that is, at the SCSI LUN  level.  With "dumb" disks this typically corresponds to an entire disk spindle - which nowadays is something like 180G-750G of disk data - which is clearly a significant waste of resources.

However, most people who use shared disks use them in a SAN are using it with a smart SAN disk controller which allows the creation of "logical" volumes which correspond to more much smaller sizes for a single SCSI LUN, using any RAID method you want.  For a disk used just as a quorum disk, you don't need to actually write anything on it, so you want it as small as you can make it.  But, most people probably make it RAID 1, which means that the minimum size is probably something like 2 gigabytes of SAN disk. If you have a large data center, it could easily have 50 clusters in it, and each one requires such a quorum device.  In this case, this makes for a lot of extra volumes, extra administration and possibilities for confusion and human error going on here.  In addition, smaller SAN disk units may have limitations on the total number of disk partitions they can manage.

So, perhaps you think - I'm smarter than that, I'll just use software volume managers to take care of this for me, and avoid all those extra logical disks in my SAN.  Unfortunately, that typically doesn't work.  This is because when you issue the SCSI reserve, it can't reserve a logical partition, only a physical partition, so many logical volume managers block reserve operations.  So, logical volume managers are not much help here.

To make it even more complicated, multi-pathing to disk devices often confuses (i.e., "breaks") disk reservation issues - particularly with SCSI II non-persistent reservations.

Of course, if you're replicating data instead of sharing it, disk reserve operations are of no help at all - since reserving a disk on one disk volume has no effect whatsoever on the other volume.

None of these considerations are changed by what kind of SCSI reserve you issue (persistent or the older non-persistent reserve).  However, there are even more problems that occur when you use the older non-persistent SCSI II reserve (the most commonly available kind), since it isn't persistent, and is broken by a bus reset or a device reset.  So, if you use SCSI II reserve, then you have to continually verify that you still have the reserve.

Count Key Data Disk Reserve

Mainframe disk subsystems support a non SCSI-based disk model, called the extended count key data (ECKD) disks.  These disks also support reserve methods similar to those provided by SCSI.

Quorum Daemon - Helpful or Hype?
Earlier, I said that the Linux-HA[2] quorum daemon[3] can help out here - not only for local shared disk situations, but for disaster-recovery type situations where you're replicating data between sites, and gave a hand-waving style argument on why that's so.  Let's see if we can go past the hand waving into more of the details so you can see what this is, how this works, and decide for yourself it would be helpful to you, or if this is just more hype written by someone who likes his project's work.

If you recall, I also described it as analogous to a software implementation of SCSI reserve.  But, in this case, there are no disks involved, so the hardware and SCSI protocol limitations mentioned above go away.  So, if the hardware has gone away, what kind of software has replaced it, and how does it work?

In the simplest view, the quorum daemon is simple - it takes TCP connections from clients, and when multiple clients both want to have quorum for the same cluster, it grants it to exactly one of them.  So, at this level of detail, it's quite simple, and the logic also straightforward.  But, there's a little more to it when you get into the details, so let's spend a few words describing the next level of detail, for those of you who are still skeptical.

TCP doesn't do a good job of telling you when your peer goes away, so both the client and the server processes send the other side heartbeats.  If the client stops hearing heartbeats, then he notifies his cluster that he no longer has quorum.  When the server stops hearing heartbeats, he takes quorum away from the client, and sends him noquorum messages.  Before switching quorum from one client to the other, the quorum daemon induces a configurable delay to allow the previous quorum owner time to notice that they've lost quorum and to shut down any resources it might be running.

What other interesting design features does it have?  Well, for one, all communication between client and server goes over SSL, with certificate authentication.  The server has a copy of the client's public certificate, and the clients have copies of the server's public certificate - so you don't need to get your certificates from a certificate authority - because we authenticate them by certificate, not through certificate authority (which could be a single point of failure).  Because of the way we use the SSL certificates, both authentication and authorization are bundled together.

Another nice design feature, is that a single quorum daemon doesn't have any fixed upper limit on the number of clusters it can support.  So, a single quorum daemon process should be able to support hundreds of clusters.  This makes it an advantage over just adding a third node to your cluster - because it can server hundreds of clusters, whereas typically a single computer can't join more than one cluster.  Indeed, since this doesn't run the cluster stack for any given cluster, it would be possible for the quorum daemon to work with multiple cluster implementations - as long as they put the hooks in to talk to it.

The combination of these two features allows you to put your quorum daemon on a completely different site, even at a colocation facility, and let your communication flow over the public Internet - securely.  This is really nice economical choice for your split-site DR-style clusters, for companies that don't have 3 or more major data centers.  Of course, if you have dozens of clusters that aren't split-site, you can make a single quorum daemon cluster with fencing to serve all the other clusters in your site.  All without bothering your storage group to create and manage all those little tiny quorum partitions for you.  In fact, without requiring shared storage at all.

Some kinds of software support resource-level quorum, that is, quorum on a resource-level rather than a whole cluster level.  Some of these are called resource-driven clusters.  The quorum daemon idea could also be used for those arrangements as well.

So, what's not to like about the quorum daemon?  Well, you do have to make a certificate for the server, and one for each client.  But, you don't have to pay for them, because they don't need to come from one of the well-known certificate producers.  It would be nice if the quorum daemon would sync its state to a slave copy that could be used in a cluster as a master/slave resource in case the machine running the quorum daemon failed.  You can still run it in a cluster with another machine to take over for it, but the takeover isn't as graceful as it would be if there were a slave copy ready to take over for the master complete with its quorum state.

I've recently learned that HP ServiceGuard[4] has a feature similar to this which they call the quorum server daemon (qs)[5].  The documentation indicates that HP's qs authenticates by IP address and does not use SSL for communication or authentication/authorization.  As a result, there are security concerns associated with it, and qs is probably unsuitable for split-site configurations.

And, even more recently, thanks to Nils Gorrol I've learned that Sun has a similar feature in their sqsd [6] since SunCluster version 3.2.  Looking at the documentation, it looks as though it has similar security issues that HP's qs has for split-site arrangements and potentially insecure networks.

Thanks to all those who keep me on the straight and narrow regarding the facts ;-)

Ping tiebreaker

Some HA systems provide  a ping tiebreaker.  To make this work, you pick a address outside the cluster to ping, and any partition that can ping that address has quorum.  The obvious advantage is that it's very simple to set up - doesn't require any additional servers or shared disk.  The disadvantage (and it's a big one) is that it's very possible for multiple partitions to think they have quorum.  In the case of split-site (disaster recovery) type clusters, it's going to happen fairly often.  If you can use this method for a single site in conjunction with fencing, then it will likely work out quite well.  It's a lot better than no tiebreaker, or one that always says "you have quorum".  Having said that, it's significantly inferior to any of the other methods.

If I've omitted things you want to know about, or this brings up ideas or questions in your mind, or you know of other implementations of the quorum daemon idea, by all means post a reply to this article.

References

[1] http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
[2] http://linux-ha.org/
[3] http://www.linux-ha.org/QuorumServerGuide
[4] http://h71036.www7.hp.com/enterprise/cache/6468-0-0-0-121.html
[5] http://www.docs.hp.com/en/B3936-90065/ch03s01.html
[6] http://docs.sun.com/app/docs/doc/819-5360/gbdud?a=view

October 29, 2007

Bad application design => Bad availability (more Rockies ticket debacle)

One final quick note on the Rockies ticket sales debacle - following up on my previous posting[1] on the subject.  This note discusses how to including the humans in your system design can improve both your perceived availability and your customer satisfaction while cutting your costs.

Pacolian did tweak their application to be a little more friendly to their infrastructure and perhaps fixed some networking issues, so that people did eventually get tickets to the Rockies/Red Sox games in Denver.  But, many, many people got told they were going to get tickets, but the system was still so slow and under such heavy load that it timed out many or even most users before they could pay for their tickets.  So, even the second try was highly unsatisfactory, and increased infrastructure costs.  Although you could argue that the system didn't crash, having it be so slow as to be unusable certainly creates the perception of unavailability in the minds of many users.

If you want to sell tickets to a hot event, there are many ways to do it.  For any system you come up with, you can expect the users who use it to try and game the system somehow to increase their odds.

What are the goals of selling tickets to events?

  • sell all the tickets
  • fairness
    • discourage professional ticket scalpers
    • limit number of tickets sold to any one party
    • make "gaming" the system harder
  • minimize infrastructure costs
  • minimize customer frustration

The way Pacolian did it, was to allow people to queue up in the web site and then sell the tickets first-come-first served beginning at a certain time.  Although this might sound reasonable because it's modeled after how tickets are sold at box offices in the real physical world, it's a bad idea in the Internet.  This encourages exactly the behavior that took their systems down, maximized customer frustration, and infrastructure costs, while still giving an edge to professional scalpers.  In their system, if you want tickets, obviously you open as many browser windows at a time as you can, hit the system as hard as you can at exactly the same time to increase your chances of getting in line first.  This creates a situation similar to what operating system people sometimes call thundering herd[2] [3] behavior.  The difference here is that the thundering herd loop included people and browsers.  It makes kind of a comical sight if you think about it...

Another method which I seem to recall hearing that other vendors use is a lottery system.  Such a system might work like this (details may need tweaking):

  1. Announce a registration period of 8-18 hours where people can register for which games they want to buy tickets for, and which price ranges they're willing to buy.
  2. Anyone registering during this time can register once, and give an email address where they can be reached, and which can be accessed from this computer.  Warn people not to do this on shared computers.
  3. Each computer gets one cookie and one email address for registering.  If you try and re-register the same browser or use the email address multiple times, you're rejected.
  4. At the end of the registration period, winners are randomly selected, and notified by email including a random cookie which is correlated to the cookie given to their browser earlier.  (you have to have both the cookie and the email to purchase a ticket).
  5. They have a specified period of time to purchase the tickets, and any given credit card number can't be used more than once per game.  Any tickets not paid for by that time get sold to other applicants.

This means that there is no advantage to registering early or late.  As long as you register, you have the same chance as anyone else.  The peak load on such a lottery system system is probably one or two orders of magnitude below that of the Pacolian thundering herd design.

The result of this is:

  • System availability goes up,
  • Customer satisfaction goes up
  • Infrastructure costs go down.
  • The possibility of gaming the system still exists, but is probably no worse than the original Pacolian system.

All in all, a big win.  To be fair, I didn't think of this idea myself, but before the Rockies/Pacolian debacle, I'd never put much thought into ticket sales either.  Because I'm not an expert in this area, there are no doubt many improvements that an expert would make to my proposal to make it harder to game.  However, from an availability perspective, this application design is much more robust than the original Pacolian system because it takes into account the motivations of the humans that are part of the computer system.

References

[1] http://techthoughts.typepad.com/managing_computers/2007/10/the-cost-of-un-.html
[2] http://catb.org/jargon/html/T/thundering-herd-problem.html
[3] http://en.wikipedia.org/wiki/Thundering_herd

October 22, 2007

The cost of un-availability - and the value of a bad example

Today, the good people of Major League Baseball suffered what looked like a denial-of-service attack which kept them from selling tickets to the (at least) the World Series games in Denver (at Coors Field).  This "attack" started at the same time as tickets sales began - 1000 MDT.

Amusingly enough, this apparent denial-of-service attack was probably caused by customers.  This year's World Series promises to be a good one, between the venerable Boston Red Sox[1] and the unbelievably hot and exceptional Colorado Rockies[2] (Go Rockies!), who are in the World Series for the first time ever, and who have played some absolutely amazing baseball in recent weeks.

As a result of this first-of-a-kind opportunity, many Coloradans stayed home from work, or took a "break" from work to order tickets all at once.  The server infrastructure couldn't stand up to the load, and no one got enough packets through to be able to order any tickets.

Since I was one of those trying to order tickets, and this was clearly a lack of availability in a critical time, I did a tiny bit of investigation.  It appears that they had about 15 servers in their mix.  The ticket sales are being managed by Paciolan[3].

In an event like this, it is vital that they have both a load balancing methodology and a load-shedding methodology.  It appears that they had both.  However, the symptoms suggest that their load-shedding methodology was insufficient to this incredible load.  Some of us Coloradans may be fickle Rockies' fans, but we're sure loyal when they're hot - especially when our other teams aren't going anywhere!    Unfortunately, the Rockies were too hot and and their fans too loyal for Pacolian's load shedding infrastructure.

Here's what I can see about their infrastructure from the outside:

  • They have external web servers which put you in a holding pattern, and try and get through to the inner sanctum of web servers once a minute.
  • If you get in to the inner web servers, you can then order tickets, presumably without a heavy overload, since the inner infrastructure limits the number of simultaneous users.

However, there is an Achilles' heel here - which both the loyal and the fickle Rockies' fans ran immediately into.  You have to have enough network bandwidth to allow people access to the outer infrastructure.  If you don't, various bad things can happen - your load balancers can crash, your routers can crash.  I'm not a networking expert, but if the offered load is an order of magnitude or two higher than the incoming infrastructure can support, most packets won't get through.  If most don't get through, then it doesn't matter how good the load shedding methods are, or how robust your servers are.  Customers can't buy your product.  OOPS!

Pacolian claimed that they were the victims of a real DDOS attack, and that they measured 8.5 million hits in an hour.  Quite honestly, that doesn't sound that high to me.  Personally, I had 4 browsers going at once.  8.5 million hits and hour is only 2361 hits/second, which is less than 142K hits/minute.  In my judgment, between people like me, school kids, scalpers, etc. 142K people isn't very many.  If they were all running 3 or 4 browsers, it would only take about 35K people - which is basically nothing.  Also, when you look at network bandwidth, ethernet links, no matter how fast they are, can only support a relatively small number of packets/second maximum because of minimum times required between packets.  IIRC, that number is something like 1000 packets/sec.    If they only had a single gigabit link to their infrastrucure, 142K people would be 2 orders of magnitude larger than that - which would put us in the right ball park (pun intended) for the behavior that was observed.

One can look at the fact that they almost certainly had only a single site to take this load as a single point of failure.  The networking infrastructure to that site failed.  Exactly how it failed, I can't say.  The fact that it failed is indisputable.

The good news for the Rockies is that because it's the World Series - eventually, somehow those tickets will get sold. But, in the mean time, they appear to have dumped Paciolan for this event[4].  This is unfortunate from my perspective, since I'm in the UK at the moment, and can't exactly run down to Coors Field to wait in line to buy tickets.  Oh well.  I guess it's not really all about me, eh?  :-D

But, this will take longer, and has already aggravated customers a great deal.  The Rockies will probably survive this error - after all that's not their specialty - they're baseball players.  But, it will take Paciolan a while to live this down.  They underestimated the load, maybe they bid too low for selling the tickets, maybe it was done with too little lead time, whatever.  In the end, this is their failure.  This will have cost them a good bit of reputation.  If they had succeeded, they would probably have the World Series business and maybe other sports for several years to come.  Under the circumstances, this will no doubt be a tremendous opportunity cost  - the cost of lost opportunities.

Although I can't quantify the size of their opportunity loss, it is a clear illustration of how it is that lack of availability translates directly into the bottom line of companies - even if it's future revenues.  To be fair to Paciolan,I believe that this is the first time anyone has attempted to sell  World Series tickets only by the web.  Hopefully, it won't be the last.

IBM has done some very high-profile sports web sites in the past, and I can tell you from the people that I've worked with who worked on those, that they required lots of money, incredible planning, always at least three geographic sites with separate networking infrastructure.  Fortunately, so far, IBM has not suffered any embarrassing failures of this magnitude.

Maybe you're saying "I don't sell World Series Tickets, so this doesn't apply to me".  Probably you don't sell World Series tickets.  But, you do probably do something vital for your company's future health.  But, if you take this catastrophe to heart, maybe you can avoid the same embarassing and expensive fate as Paciolan.

If anyone from Paciolan reads this article, I'm sure our readership would love to hear what actually went wrong, and how, in your opinion, it might have been avoided.  Of course, feel free to correct the things I guessed at as well!

Even more than usual, I commend my disclaimer page to you.  I don't speak for anyone but myself.  Not my employer, the Boston Red Sox, the outstanding Colorado Rockies, nor the unlucky/unfortunate Paciolan.

See also my follow-up posting [5] on this subject.

References

[1] http://boston.redsox.mlb.com/index.jsp?c_id=bos
[2] http://colorado.rockies.mlb.com/index.jsp?c_id=col
[3] http://www.paciolan.com/
[4] http://colorado.rockies.mlb.com/news/press_releases/press_release.jsp?ymd=20071022&content_id=2276226&vkey=pr_col&fext=.jsp&c_id=col
[5] http://techthoughts.typepad.com/managing_computers/2007/10/bad-application.html

October 10, 2007

Split-brain, Quorum, and Fencing - updated

In some ways, an HA system is pretty simple - it starts services, it stops them, and it sees if they and the computers that run them are still running.  But, there are a few bits of important "rocket science" hiding in there among all these apparently simple tasks.  Much of the rocket science that's there centers around trying to solve a single thorny problem - split brain.  The methods that are used to solve this problem are quorum and fencing.  Unfortunately, if you manage an HA system you need to understand these issues.  So this post will concentrate on these three topics: split-brain, quorum, and fencing.

If you have three computers and some way for them to communicate with each other, you can make a cluster out of them and,each can monitor the others to see if their peer has crashed.  Unfortunately, there's a problem here - you can't distinguish a crash of a peer from broken communications with the peer.  All you really know is that you can't hear anything from them.  You're really stuck in a Dunn's law[1] situation - where you really don't know very much, but desperately need to.  Maybe you don't feel too desperate yet.  Perhaps you think that you don't need to be able to distinguish these two cases.  The truth is that sometimes you don't need to, but much of the time you very much need to be able to tell the difference.  Let's see if I can make this clearer with an illustration.

Let's say you have three computers, paul, silas, and mark, and paul and silas can't hear anything from mark and vice versa.  Let's further suppose that mark had a filesystem /importantstuff from a SAN volume mounted on it when we lost contact with it. and that mark is alive but out of contact.  What happens if we just go ahead and mount /importantstuff up on paul? The short answer is that bad things will happen[2]. /importantstuff will be irreparably corrupted as two different computers update the disk independently.  The next question you'll ask yourself is "Where are those backup tapes?". That's the kind of question that's been known to be career-ending.

Split-Brain

This problem of a subset of computers in a cluster beginning to operate autonomously from each other is called Split Brain[3]. In our example above, the cluster has split into two subclusters: {paul, silas} and {mark}, and each subset is unaware of the others.  This is the perhaps most difficult problem to deal with in high-availability clustering.  Although this situation does not occur frequently in practice, it does occur more often than one would guess.  As a result, it's vital that a clustering system have a way to safely deal with this situation.

Earlier I mentioned that there was information you really want to know, but don't know.  Exactly what information did I mean?   What I wanted to know was "is it safe to mount up /importantstuff somewhere else?".  In turn, you could figure that out if you knew the answer to one of these two questions:  "Is mark really dead?" which is one way of figuring out "Is mark going to write on the volume any more?"  But, of course, since we can't communicate with mark, this is pretty hard to figure out.  So, cluster developers came out with a kind of clever way of ensuring that this question can be answered.  We call that answer fencing.

Fencing

Fencing is the idea of putting a fence around a subcluster so that it can't access cluster resources, like  /importantstuff.  If you put a fence between it and its resources, then suddenly you know the answer to the question "Is mark going to write on the volume any more?" - and the answer is no - because that's what the fence is designed to prevent.  So, instead of passively wondering what the answer to the safeness question is, fencing takes action to ensure the "right" answer to the question.

This sort of abstract idea of fencing is fine enough, but how is this fencing stuff actually done? There are basically two general techniques:  resource fencing [4] and node fencing.[5]. 

  • Resource fencing is the idea that if you know what resources a node might be using, then you can use some method of keeping it from accessing those resources. For example, if one has a disk which is accessed by a fiber channel switch, then one can talk to the fiber channel switch and tell it to deny the errant node access to the SAN.

  • Node fencing is the idea that one can keep a node from accessing all resources - without knowing what kind of resources it might be accessing, or how one might deny access to them.  A common way of doing this is to power off or reset the errant node.  This is a very effective if somewhat inelegant method of keeping it from accessing anything at all.  This technique is also called STONITH[6] - which is a  graphic and colorful acronym standing for Shoot The Other Node In The Head.

With fencing, we can easily keep errant nodes from accessing resources, and we can now keep the world safe for democracy - or at least keep our little corner of it safe for clustering.  An important aspect of good fencing techniques is that they're performed without the cooperation of the node being fenced off, and that they give positive confirmation that the fencing was done.  Since errant nodes are suspect, it's by far better to rely on positive confirmation from a correctly operating fencing component than to rely on errant cluster nodes you can't communicate with to police themselves.

Although fencing is sufficient to ensure safe resource access, it is not typically considered to be sufficient for happy cluster operation because without some other mechanism, there are some behaviors it can get into which can be significantly annoying (even if your data really is safe).  To discuss this, let's return our sample cluster.

Earlier we talked about how paul or silas could use fencing to keep the errant node mark from accessing /importantstuff.  But, what about mark?  If mark is still alive, then it is going to regard paul and silas as errant, not itself.  So, it would also proceed to fence paul and silas - and progress in the cluster would stop.  If it is using STONITH, then one could get into a sort of infinite reboot loop, with nodes declaring each other as errant and rebooting each other, coming back up and doing it all over again.  Although this is kind of humorous the first time you see this in a test environment - in production with important services, the humor of the situation probably wouldn't be your first thought.  To solve this problem, we introduce another new mechanism - quorum.

Quorum

One way to solve the mutual fencing dilemma described above is to somehow select only one of these two subclusters to carry on and fence the subclusters it can't communicate with.  Of course, you have to solve it without communicating with the other subclusters - since that's the problem - you can't communicate with them.  The idea of quorum represents the process of selecting a unique (or distinguished for the mathematically inclined) subcluster.

The most classic solution to selecting a single subcluster is a majority vote.  If you choose a subcluster with more than half of the members in it, then (barring bugs) you know there can't be any other subclusters like this one. So, this is looks like a simple and elegant solution to the problem. For many cases, that's true.  But, what if your cluster only has two nodes in it?  Now,  if you have a single node fail, then you can't do anything - no one has quorum.  If this is the case, then two machines have no advantage over a single machine - it's not much of an HA cluster.  Since 2-node HA clusters are by far the most common size of HA cluster, it's kind of an important case to handle well.  So, how are we going to get out of this problem?

Quorum Variants and Improvements

What you need in this case, is some kind of a 3rd party arbitrator to help select who can fence off the other nodes and allow you to bring up resources - safely.  To solve this problem there is a variety of other methods available to act as this arbitrator - either software or hardware. Although there are several methods available to use as arbitrator, we'll only talk about one each of hardware and software methods: SCSI reserve and Quorum Daemon.

  • SCSI reserve:  In hardware, we fall back on our friend SCSI reserve.  In this usage, both nodes try and reserve a disk partition available to both of them, and the SCSI reserve mechanism ensures that only one of the two of them can succeed.  Although I won't go into all the gory details here, SCSI reserve creates its own set of problems including it won't work reliably over geographic distances.  A disk which one uses in this way with SCSI reserve to determine quorum is sometimes called a quorum disk.  Some HA implementations (notably Microsoft's) require a quorum disk.

  • Quorum Daemon:  In Linux-HA[7], we have implemented a quorum daemon - whose sole purpose in life is to arbitrate quorum disputes between cluster members.  One could argue that for the purposes of quorum this is basically SCSI reserve implemented in software - and such an analogy is a reasonable one.  However, since it is designed for only this purpose, it has a number of significant advantages over SCSI reserve - one of which is that it can conveniently and reliably operate over geographic distances, making it ideal for disaster recovery (DR) type situations.  I'll cover the quorum daemon and why it's a good thing in more detail in a later posting.  Both HP and Sun have similar implementations, although I have security concerns about them, particularly over long distances.  Other than the security concerns (which might or might not concern you), both HP's and Sun's implementations are also good ideas.

Arguably the best way to use these alternative techniques is not directly as a quorum method, but rather as a way of breaking ties when the number of nodes in a subcluster is exactly half the number of nodes in the cluster.  Otherwise, these mechanisms can become single points of failure - that is, if they fail the cluster cannot recover.

Alternatives to Fencing

There are times when it is impossible to use normal 3rd-party fencing techniques.  For example, in a split-site configuration (a cluster which is split across geographically distributed sites), when inter-site communication fails, then attempts to fence will also fail.  In these cases, there are a few self-fencing alternatives which one can use when the more normal third-party fencing methods aren't available.  These include:

  • Node suicide.  If a node is running resources and it loses quorum, then it can power itself off or reboot itself (sort of a self-STONITH).  The remaining nodes wait "long enough" for the other node to notice and kill itself.  The problem is that a node which is sick might not succeed in self-suicide, or might not notice that it had a membership change, or had lost quorum.   It is equally bad if notification of these events is simply delayed "too long".  Since there is a belief that the node in question is, or at least might be, malfunctioning, this is not a trivial question.  In this case, use of hardware or software watchdog timers becomes critical.

  • Self-shutdown.  This self-fencing method is a variant on suicide, except that resources are stopped gracefully.  It has many of the same problems, except it is somewhat less reliable because the time to shut down resources can be quite long.  Like the case above, use of hardware or software watchdog timers becomes critical.

Note that without fencing, the membership and quorum algorithms are extremely critical.  You've basically lost a layer of protection, and you've switched from relying on a component which gives positive confirmation to relying on a probably faulty component to fence itself, and then hoping without confirmation that you've waited long enough before continuing.

Summary

Split-brain is the idea that a cluster can have communication failures, which can cause it to split into subclusters.  Fencing is the way of ensuring that one can safely proceed in these cases, and quorum is the idea of determining which subcluster can fence the others and proceed to recover the cluster services.

An Important Final Note

It is fencing which best guarantees the safety of your resources.  Nothing else works quite as well.  If you have fencing in your cluster software, and you have irreparable resources (i.e. that would be irreparably damaged in a split-brain situation), then you must configure fencing.  If your HA software doesn't support (3rd party) fencing, then I suggest that you consider getting a different HA package.

See Also

General cluster concepts[8]

References

[1]   http://linux-ha.org/DunnsLaw
[2]   http://linux-ha.org/BadThingsWillHappen
[3]   http://linux-ha.org/SplitBrain
[4]   http://linux-ha.org/ResourceFencing
[5]   http://linux-ha.org/NodeFencing
[6]   http://linux-ha.org/STONITH
[7]   http://linux-ha.org/
[8]   http://linux-ha.org/ClusterConcepts

September 28, 2007

Virtualization as High Availability (Disaster Recovery) or High-Availability as Virtualization?

Many pundits and other folks like VMware's CEO Diane Greene have touted[1] virtualization as being the "cure" to disaster recovery, many for the past several years.  Disaster recovery can be pretty reasonably viewed as being high-availability over distance, so it makes some sense to see how DR, HA and virtualization fit together.  What's hype here, and what's real?  Let's look and see what we find out.

What is virtualization?

Wikipedia[2]defines virtualization like this:

In computing, virtualization is a broad term that refers to the abstraction of computer resources. One useful definition is "a technique for hiding the physical characteristics of computing resources from the way in which other systems, applications or end users interact with those resources".

At the moment, this term is most commonly used to refer to machine or OS-level virtualization,  storage virtualization and something I'll call container virtualization.  For now, let's ignore storage virtualization.   Machine-level virtualization means basically making it difficult to tell exactly which physical machine an OS is running on at any given point in time.  There are many other kinds of virtualization as well - for example service virtualization.   Service virtualization doesn't hide or abstract physical machines, but instead virtualizes or hides services.  That is, there is no fixed binding between services and physical machines.  This is, in fact, exactly what classic HA software does - software like Linux-HA[3] . What an HA system does is use service virtualization to recover from server failures, and monitors services so that it can restart them - either locally or on another server.

Customers care about services, not about OSes or physical machines, so from their point of view, machine and service virtualization are similar.  Both  provide the potential for recovering from failures of physical servers and OSes. However, since machine virtualization doesn't  have any visibility of the individual services and the dependencies between them, it can't monitor them and restart them if they fail.

But, the ideas of virtualization need management entities to oversee them and decide to move virtual entities, and coordinate the movement of these virtual entities from one place to another.    VMware will soon have their Site Recovery Manager[4], and Linux has Linux-HA[3] (among other solutions).  The purpose of such management entities is to detect and respond to failures and administrative requests by starting, stopping, and migrating virtual entities in their purview to different sets of physical servers.

What does server (or OS-level) virtualization brings to the table here?

Why would I care?  If HA systems produce a result which sounds so far like it would be similar or even superior from the customer's perspective to what OS level virtualization does (at least from an availability perspective), then why should I care?  The answer is not found in responses to failures, but in responses to administrative events.  If you want to migrate a running virtual machine from physical machine A to physical machine B, many server virtualization techniques (including Xen[10] and VMware) provide a new tool - transparent migration.  In transparent migration, the virtual machine image is migrated from physical machine A to physical machine B while processing work continually without appearing to stop.  This is a pretty cool trick, since you can then migrate virtual machines with apparently zero downtime.  Sadly, you can't migrate a crashed OS or an OS from a crashed server in this way.  As a result, it doesn't help at all with unplanned outages.  However, it makes it possible to take a planned outage with very minimal impact.  In a practical sense, if your server is running full-throttle and writing on all the pages in memory very rapidly, the migration will likely impact performance. So, it's best to not to migrate virtual machines during peak load without good reasons.


Can conventional HA software manage transparent migration?

The simple answer is "mostly no".  Most HA software has only three basic operations that they perform on their resources:  Start, Stop and Monitor.  If you map these operations to operate on virtual machines as resources, start means boot the OS, stop means shut it down, and monitor means look and see if it appears to be running correctly.  Note that each of these also operates on only one machine at a time. A migration like described above becomes stop the virtual machine on machine A, and then start it on machine B.  This is hardly transparent, since the stop and start operations create an outage which will range in time from two minutes to 30 or more minutes and completely loses all transient application state.

The problem is in the HA software's model of resources (or services).  The operations which it performs are like this OP(resource, node) for any given OP like start, stop or monitor.  What's needed instead is a operation model which permits OP(resource, fromnode, tonode).  In addition, when one adds dependencies, then the migratable resource can't depend on any non-migratable resource. [It's actually a little more complicated than this, but it would take too much time to explain all the details in this post).

Are there any HA packages that can model transparent migration?

Yes.  Sorry to sound a bit like a broken record[13], but Linux-HA[3] can do that very nicely - and to my knowledge it is the only HA system which can.  It includes support for migratable resources like virtual machines and container virtualization, and handles the complexities of dependencies alluded to above, along with other potential problems not alluded to above.  So, if you use Linux-HA to manage your virtual machines, you get full use of transparent migration without having to figure out some monumental kludge to shoehorn something kind of like transparent migration into your HA package.  It just works.  And, it just works for containers, and resources which support checkpoint restart.  All in all, a nice feature I'd say.

What does transparent migration have to do with disaster recovery?

Nothing.  Well... Almost nothing.  Let me explain...  Since transparent migration only helps move services when the machines are up, that's not normally very useful in a disaster - since presumably a fire or flood or power failure, or building collapse or whatever has rather rudely and typically unexpectedly interrupted service - precluding migration. Why did I say almost nothing above?  Because some kinds of disasters (hurricanes, and floods for example) often give warning in advance - so you can use transparent migration for those cases.  The other reason is that you can test your disaster recovery without any service outages.  Or rather, you can partially test your disaster recovery.  If there are any resources (disk drives for example) which are only needed during boot, then if those are missing, you probably won't notice when you migrate over and back. So, it is still desirable to periodically do non-transparent migrations of your services to make sure things are all still working after the latest configuration changes.  However, you can migrate over and back every weekend, and maybe once a month or once a quarter perform non-transparent migrations just to make sure all your bases are covered.

Can't you do this with an HA cluster split across sites?

Yes.  You can do it with only service virtualization and a conventional HA system, or you can do it with Linux-HA[3] and either service or server virtualization.  In either case, you have to prepare for the replication of data[5] across sites.  Note that making sure all the data can be accessed from multiple physical machines is part of the baseline discipline which is imposed both by virtual machine failover and by service failover.   This discipline is a starting place for any kind of disaster recovery.  This is a valuable discipline to undertake.  So, although this discipline of replicating data and OSes is a good thing, since it's not unique to virtual machines, this part is more hype than reality.  There's nothing special about virtual machines in this respect.  In this respect, any differences that exist, don't exist because of virtualization per se, but how easy the management tools are to use correctly, and how comprehensive they are.

Advantages of service virtualization

  • Monitors individual services [6] [7] - can detect and recover from software errors that don't cause OS crashes

  • Can recover individual services (more fine-grained)

  • Can recover servers and services without time for a reboot  (faster)

Advantages of Server virtualization

  • Transparent migration for administrative events is outage-free

  • All virtual machines more or less look alike (simplicity)

Best of Both Worlds?

Linux-HA[3] can model virtual machines as resources, and it can model normal services (web server, etc) as resources.  Can it do both at the same time?  Well... Yes and No...  Doing this would require that you have both virtual machines as resources and as cluster members at the same time.  Although you could do this, it might have unpleasant side effects.  However, you could have a cluster of virtual machines and a cluster of real machines - and that would work - but would be two clusters to manage.  To truly get the best of both worlds, you'd have to implement our container resource proposal[8] or something like it.  Of course, since Linux-HA is an open source project, we'd love to see a patch to implement it ;-).

See Also

Virtualization Blog[9],
Dabcc virtualization and server management web site[12],

References

[1]  http://www.byteandswitch.com/document.asp?doc_id=133643
[2] http://en.wikipedia.org/wiki/Virtualization
[3] http://linux-ha.org/
[4] http://www.vmware.com/company/news/releases/srm.html
[5] http://techthoughts.typepad.com/managing_computers/2007/09/automated-disas.html
[6] http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[7] http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html
[8] http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1417
[9] http://www.virtualization.info/
[10] http://www.xensource.com/
[11] http://www.vmware.com/
[12] http://www.dabcc.com/
[13] http://en.wikipedia.org/wiki/Gramophone_record

September 12, 2007

Requested / Suggested Topics - updated

I've been thinking about what kind of content would be useful here.  Those of you who know me know that I'm happy to talk about almost anything.  But not all things are useful, and there are more useful things to talk about than I have time to talk about.  So, here are some of my thoughts from recent conversations with people and reading Russell Coker's HA blog postings[1], and a request[2] from Emily Ratliff's blog.

  • Finish the series on Disaster Recovery
  • Talk about the capabilities of the policies you can express in Linux-HA's[3] CIB
  • How repeated failures accumulate in failure stickiness
  • The importance of civility and friendliness in projects - how Linux-HA[3] has succeeded and failed in this way, and why
  • Fencing (STONITH[4]) discussion
  • Quorum and the relationship to STONITH (fencing)
  • HA and virtualization - HA as virtualization, virtualization as HA[5].
  • Linux-HA[3] capability tour
  • HA/DR training and certification
  • Single System Image - do you need it?
  • Tracking Linux-HA installs - how many people use it and how do you figure it out?

Please chime in with your thoughts and suggestions on which of these (or other) topics interest you.

PS:  Comments should be turned on for the blog now.  Sorry they weren't on at first for my previous posting.

Links:
[1] http://etbe.coker.com.au/category/ha/
[2] http://www.ratliff.net/blog/index.php/2007/09/10/alan-robertsons-new-blog-on-managing-computers/
[3] http://linux-ha.org/
[4] http://linux-ha.org/STONITH
[5] http://www.byteandswitch.com/document.asp?doc_id=133643&WT.svl=news1_3

 

September 07, 2007

Automated Disaster Recovery - Data Replication

This post covers the basics of data replication in automated disaster recovery.  Automated disaster recovery means that when a site providing a service goes down, the service will continue running elsewhere without human intervention.  If you're willing to trust your automation, this is an ideal situation - one site goes down (for whatever reason) and the other one takes over - automatically - and pretty quickly.  In most cases, your users get continued access to the service.  The Linux-HA software is well-prepared to do this for you correctly - even when your cluster is split across sites.  It has specific technology in it to deal with this situation - which I'll explain in a later post.

In Disaster recovery terminology, this kind of a solution provides meets both excellent recovery point objectives (RPO), and recovery time objectives (RTO).

There are a few problems with this though -- the second site has to have enough servers to be able to run the service, and of course, the software to run the service, and - the most difficult part - it has to have an up-to-date copy of the data the service uses.  This post will concentrate on this aspect of automated disaster recovery.

There are two basic approaches to replicating data across sites - application-specific, or disk volume level replication.

Application-specific Replication
A number of applications (DB2 UDB, Oracle, DHCP, DNS, etc.) have built-in (or add-on) replication mechanisms which will do a great job of keeping two copies of your data in sync - and these methods are typically the best when they're available.  However, if your application doesn't have such a method, then you'll have to use disk-volume-level replication.

Disk-volume-level replication
If you're running Linux, a good general way of replicating dynamic data copied across sites is DRBD.  DRBD is a great open source tool which keeps two copies of data in sync, and worries a lot about data integrity -- which copy of the data is up to date, and related things.  It works most efficiently when you make one side master and update from only the master side.  Typically, in a split-site design, this is typically what's needed - both for efficiency and keeping both your sanity and your data's sanity.

DRBD is only one method of doing this.  Most storage vendors (IBM for example) sell products that can replicate storage volumes across distances.  There are also a number of other commercial replication packages, and a few other open source disk replication packages.

If you are going to do automated site failover, then it's highly recommended that you replicate your data synchronously rather than asynchronously.

Synchronous replication means that before a disk write (or database transaction) is reported as being complete, that the replication software makes sure the other side has a good copy of the data on disk.  That way, no writes (or transactions) are ever reported as complete, but somehow lost after a failover.

With asynchronous replication,  the replication software ensures only that the write (or transaction) is queued to send to the other machine before the write or transaction is reported as being complete.  This means that in case of a crash, the last few writes or transactions may be lost - which may compromise your recovery point objective (RPO)..  If you fail over to the backup site, those last few transactions may be lost permanently.  This is obviously a disadvantage.  If you failed over automatically due to a communications link being down for a few minutes, those transactions were lost for what may turn out to be very little reason.  This is the kind of thing that has been known to encourage people to make out their resumes unexpectedly.

So, with asynchronous replication having rather serious limitations, why do people use it?  It's simple really - the speed of light.  If your two sites are 1000 km apart, the round trip time to send a packet, write the data, and confirm it is greater than the speed of light in fiber.   Normally people think of the speed of light as 300,000 km/sec.  However, in fibre, the speed of light is more like 195000 km/sec.  If you want your transactions to complete with no more than 1ms delay due to network delays, then you need your round trip ping time to be less than 1 ms.   This means that in the worst case, the round trip distance can't be more than 195 km (or 97 km each way).  However, with disk delays, and delays induced by store-and-forward algorithms in networking hardware, the practical rule-of-thumb distance is more like 50 km.  But, of course, this depends on how write-intensive your application is, and how much delay you can tolerate.  I'm not an expert in this area, but perhaps someone reading this can offer their opinions on it.  You probably figured out pretty much right away, that you can't get this kind of latency over the public Internet - there are just too many hops.  This means for synchronous replication that you need a dedicated (and obviously low-latency) network between your sites.

To summarize - for the things we've talked about so far, you need:

  • Two (or more) sites
  • Servers on both sites
  • Replication software or hardware
  • Duplicate storage on both sites
  • Dedicated low-latency bandwidth to connect the two sites

Of course, this isn't nearly enough for a complete solution, but it covers most of the basics for replication.

Another resource worth looking at is Christoph Miasch's thesis on Server-Based Wide Area Data Replication for Disaster Recovery