Heartbeat

January 07, 2008

A Complete Cluster Stack for Linux

Recently, I've had some folks ask me offline what exactly would a “complete” Linux cluster stack look like.  That's a good question, and this posting is intended to address that question.

So let's start with – what kind of cluster?  For the purposes of this posting, I'm primarily talking about a full-function high-availability enterprise-style cluster, not primarily a load balancing cluster, and not a high-performance scientific (Beowulf-style) cluster.

A few caveats before proceeding – much of what I'll reference below will be relative to the Linux-HA[1] framework, but the concepts are easily translated to any other clustering framework one might have in mind.

It's also worth noting that not every application, nor every configuration needs every component.   Adding unnecessary components adds complexity, and complexity is the enemy of reliability.

Many of the components (cluster filesystem, DLM) are primarily needed by cluster-aware applications.  Note that at this time (early 2008) very few applications are cluster-aware.

The Full Cluster Stack Exposed!

Below is a picture of the full cluster stack – which I'll describe in more detail later.  For the most part, the components higher up in the picture build on the components lower down in the picture.  To simplify the drawing, I didn't add all the who-uses-whom lines that one might want to make a detailed study of this subject.

Full_cluster_stack

 

Cluster Comm - Intracluster  communications

The most basic component any cluster needs is intracluster communications. There are a variety of different possibilities, but guaranteed packet delivery is a requirement.  Linux-HA has its own custom comm layer for doing this.  It's not perfect, but it works.  At one time in the past, we provided support for the AIS cluster APIs, and if you use OpenAIS today, then you can still have a reasonable cluster using Linux-HA and providing compatible support for the AIS protocols.  As will become clear, it's not a perfect configuration, but it's a reasonable one.  (Of course, like everything, it can always be improved even more)

Very large clusters (hundreds to tens of thousands of nodes) will likely require a different communication protocol, since most guaranteed delivery multicast protocols don't scale that high. 

Nevertheless, in an ideal world, all cluster components and cluster-aware applications would sit on top of the same set of communications protocols.

Membership – who's in the cluster?

Looking to the right of the cluster comm box on our architecture chart, you'll see the membership box.  The next basic function that a cluster has to provide is membership services.  Membership closely related to communication – since a simplistic view of membership is just who we can communicate with.  It is highly desirable that everyone in the cluster be able to communicate with everyone else. It's the job of the membership layer to provide this information to the cluster.

When your communication fails in weird ways, it's the job of the membership layer to present a view of the cluster that makes sense – in spite of the weird kinds of failures that might be going on.

If we eventually wind up with multiple kinds of communications methods, then we'll also have multiple ways of becoming a member.

Linux-HA (with or without OpenAIS) supports the AIS membership APIs.

Like I mentioned for communication, in an ideal world,  all cluster components would use exactly the same membership information.  However, it is important to note that the membership one uses must be computed using the communication method being used by the application.  So, unless every cluster-aware application uses both the common communication method and the common membership, it risks getting its membership out of sync with respect to its communication and other components using other communication methods.  In many cases, this can't be avoided.  Methods for coping with such discrepencies are discussed in more detail at the end of this post.

Fencing

Fencing is the ability to “disable” nodes not currently in our membership without their cooperation  Many of you will remember having discussed this in some detail in an earlier post[2]. As I explained in more detail there, fencing  is vital to ensure safe cluster operation

Our current implementation is STONITH[3]-based - STONITH == Shoot The Other Node In The Head

Quorum

Quorum is a topic we've talked about extensively in a couple of earlier posts [2] [4]. Quorum encapsulates the ability to determine whether the cluster can continue to operate safely or not

Quorum is tied closely to both fencing and membership.  In practice, as we discussed before, it is often highly desirable to implement multiple types of quorum.  Linux-HA currently provides multiple implementations and can provide more through plugins.  Like membership and communication, it is desirable for all cluster components to use the same quorum mechanism.  All the interesting and legal ways that quorum can interact with fencing and membership and the communication layer are too detailed for this posting.

Cluster Filesystems

Cluster filesystems allow multiple machines to sanely mount the same filesystem at the same time.  This is a great boon to parallel applications.  Cluster filesystems typically don't use the network or another server involved when doing bulk I/O. Each node mounting the filesystem is normally expected to have access to the data.  This typically requires a SAN.

Typically, this is done to improve performance, but convenience and manageability are common secondary goals  Cluster filesystems are related to, but distinct from, network filesystems like NFS and CIFS. On Linux, there are several cluster filesystems available including

  • Red Hat's GFS

  • Oracle's OCFS2

  • IBM's GPFS

  • etc.

Normally, when they're being used for performance reasons, cluster-aware applications are required.  You can't typically just run 'n' copies of your favorite cluster-ignorant application and have it work.  The filesystem won't scramble the data, but your application typically will.  It goes without saying that high-performance cluster filesystems run in the kernel – unlike all the other items we've talked about before.  Because of the high-performance, it's common for a cluster filesystem to have its own communication and membership code – not using the typically userland communications and membership code.  Since membership isn't high bandwidth or really low latency information, it is possible to feed membership from a user-space membership layer into the kernel.  Of course, then the membership and the communications layer are out of sync.  It is arguable whether this is an improvement or not.

Cluster filesystems typically need cluster lock managers – described in the next section.

DLM - Distributed Lock Manager

A DLM (Distributed Lock Manager) provides locking services across the cluster, and it's an interesting piece of code to implement them – particularly the error recovery. 

To some degree, DLMs are analogous to System V semaphores but – cluster-aware.  In addition, they provide much more sophisticated API and semantics.  Although DLM APIs are fairly well understood, there is no formal standard, so switching from one to another can be annoying.  Red Hat has a reasonable kernel-based DLM which they use with GFS.  DLMs commonly have their own separate communications and membership code.  The comments about getting membership from user-space and having them be potentially different from cluster filesytems also apply here.

Cluster Volume Managers

You might think that you really don't need a cluster-aware volume manager.  Sometimes you might be right.  More often, if you thought that, you'd be wrong.   A cluster volume manager is just like a regular volume manager – only cluster aware.  This is to keep different nodes from getting inconsistent views of the layout of a set of disks or volumes.   The current cluster-aware volume managers are EVMS and CLVM.  Only CLVM is expected to survive into the long term.

The big challenges for cluster volume managers are high-performance mirroring and snapshots.  These operations are potentially very difficult to implement right and fast.  Cluster-aware volume managers often have both kernel and user-space components.  The membership inconsistency issues here are similar to those for cluster filesystems and the DLM.

CRM - Cluster Resource Manager

Every HA cluster has something like a CRM, but they may divide up these functions differently.  Our CRM is a policy-based decision maker for what should run where – handling failed services and failed cluster nodes.

The CRM is similar to UNIX/Linux startup init scripts – it starts everything up – but across a cluster following some policies, and managing failures.

The Linux-HA CRM is arguably the best cluster resource manager around today – at least in terms of flexibility and power.  It has usability issues, and can be extended, but those are solvable.

The Linux-HA CRM function is largely divided between the PE and TE – which are described below.

PE - Policy Engine

The Policy Engine is a key component of the CRM and does two distinct things.

  • It determines what should run where (cluster layout)

  • It creates a graph of actions of how to get from the current state of affairs to the new desired state

This graph of actions is then given to the TE (described below).

The system would have more flexibility if he PE were split into two parts for these two functions, and supported plugins for the cluster layout function. 

It currently isn't aware of resource cost, nor of absolute resource limits and load balancing considerations, which complicate optimal placement.   Those would be good things to add to it in the future.  Having plugins for doing resource placement would also be a highly useful and desirable thing.

TE - Transition Engine

Receives a graph of actions to perform from the policy engine, then uses the LRM proxy to communicate with the LRMs to carry out the actions

Its main jobs are action sequencing, error detection and reporting

CIB - Cluster Information Base

The CIB manages information on cluster configuration and current status.  The cluster configuration includes the configuration and policies as defined by the system administrator.

Its key difficulty is to keep a consistent copy replicated across the cluster, resolving potential version differences.

All the data it manages is XML, and the CIB has a minimal knowledge of the structure of this XML.

LRM - Local Resource Manager

In the Linux-HA architecture, a local resource manager runs on every machine and carries out the tasks given to it. Everything that gets done gets carried out by the LRM.  Examples are:

  • start this resource

  • stop this resource

  • monitor this resource

  • migrate this resource

  • etc.

The LRM provides interface matching to the various kinds of resources through Resource Agents.  The Linux-HA LRM supports several classes of Resource Agents.

The LRM is not at all cluster-aware.  It can support an arbitrary number of clients, one of which is the LRM communications proxy (below).

LRM Communications Proxy

The LRM proxy communicates between the CRM and the LRMs on all the various machines.  This function is currently built into the CRM.  This architectural decision was based on expedience more than anything else.

To support larger clusters this needs to be separated out, made more scalable, and more flexible.  This would allow a large number of LRMs to be supported by a small number of LRM proxies.   In large systems, this would probably use the ClusterIP capability to provide load distribution (leveling) across multiple LRM proxies.

Init - Initialization and recovery

This code does really three things:

  • Sequences the startup of the cluster components

  • Recovers from component failures (restart or reboot)

  • Sequences the shutdown of all the various cluster components

This is currently provided by Linux-HA and bundled with the Linux-HA communications code.  This likely needs to be separated out to a separate proxy function (process) in the future.

Infrastructure

The Linux-HA infrastructure libraries (“clplumbing”) does a wide variety of things.  A few samples include:

  • Inter-Process Communication

  • Process management

  • Event management and scheduling

  • Many other miscellaneous functions

Surprisingly, these libraries amount to about 20K lines of code.

Quorum Daemon

The quorum daemon is an unusual daemon, because it's the only daemon we have that's intended to run outside the cluster proper.  It is instrumental in solving certain knotty quorum problems – especially for:

  • 2-node clusters (very common)

  • Split-site (disaster recovery) clusters

This was discussed extensively in previous postings [4] and [5].

Management Daemon

Provides Complete Authenticated Configuration and Status API.  This includes both information contained in the CIB, and also information about the communications configuration and so on.

The management daemon is used by the:

  • GUI

  • SNMP agent

  • CIM agent

Clients are authenticated using PAM, and all communications is via SSL, so its clients can safely be outside cluster, or even outside a firewall.  This daemon should provide different levels of authorization depending on the authenticated user, and should log its actions in a format suitable for Sarbanes-Oxley (SOX) auditing purposes.

GUI

Provides a Graphical User Interface providing configuration and status information.  It also supports creating and configuring the cluster.  Note that at the present time, there are a number of useful cluster configurations it cannot create.

CIM and SNMP agents

The CIM and SNMP agents provide CIM and SNMP management interfaces for systems management tools.  The CIM interface supports status updates and configuration changes, whereas the SNMP interfaces only report status.

Disadvantages of this architecture

For a variety of reasons, kernel space doesn't have access to user-space cluster communications or membership.

As a result, both the DLM and most cluster filesytems implements their own membership and communications.

This is in contradiction to the “ideal world” statements earlier.  This can result in some odd cases where one communication method is working in a particular case, but another method is not.  This results in differences in membership – which can have bad effects.

Why this might not be quite as bad as it seems

One reason why one might not worry about this as much as one might, is because it's a problem which one can't make go away.  A cluster system will always have to interface with software packages which do their own communication, and compute their own membership for a variety of usually good reasons.  As a result, this is a problem which we can't make go away.  Instead we have to deal with it effectively.  There are basically two cases to consider:

  1. The “Main” membership thinks that node X should not be in the cluster, whereas the “Other” membership thinks it should be.

  2. The “Other” membership thinks that node X should not be in the cluster, whereas the “Main” membership thinks it should be.

Let's take these two cases one at a time:

Case 1:

If the main membership thinks node X is not in the cluster, then it will simply not start any resources on node X.  This takes care of the problem.

Case 2:

If the “Other” membership discovers that a particular node should be dropped from its view of membership, and it can inform the CRM not to start its resources on that machine, then the local view of this membership from the perspective of the resources it deals with is effectively made to exclude these Other-errant nodes.  In the Linux-HA CRM this is easily done having the Other-resources write node attributes to cause those nodes to be excluded, and the rules would then be written to exclude those nodes from consideration for running Other-related resources.

Although Case 2 isn't pretty, it works, and no amount of wishing and hoping is likely to ever make this kind of problem go away in the general case - particularly when one involves proprietary applications  So, even if there is some membership discrepancy, it can is always possible to manage it appropriately assuming you can get a tiny bit of cooperation from the application.

References

[1] http://linux-ha.org/
[2] http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html

[3] http://linux-ha.org/STONITH
[4] http://techthoughts.typepad.com/managing_computers/2007/10/more-about-quor.html
[5] http://techthoughts.typepad.com/managing_computers/2007/11/quorum-server-i.html

December 13, 2007

How Managed Virtualization (including HA) conflicts with System Management

Managed Virtualization Versus System Management

In an earlier post[1], I talked about a couple of kinds of virtualization, comparing two of them and highlighting their strengths.  This posting discusses how virtualization can confuse and confound conventional systems management - both automated and manual, and gives some thoughts on how to deal with it.

We all know that virtualization is a GoodThing(TM).  Therefore, it can't really have any disadvantages, can it?  <tongue-in-cheek-off> Unfortunately, it does have disadvantages.  The great strength of virtualization is its ability to break the ties between a service or operating system and the server which implements its service.  Many software systems and a good number of human beings find this confusing.  If I want to reboot a physical server, what services or operating systems will be disrupted by the reboot?

Conversely, if I want to do something to the machine that's running a particular service, which machine do I have to log into?  If you're running both service virtualization (conventional HA like Linux-HA[2]) on top of server virtualization (ala Xen or VMware), then you have a doubly difficult task - first you have to figure out which virtual machine is running a service, then you have to figure out which physical machine is running that particular virtual machine.

This can be really annoying and can easily result in system administrators[3] making mistakes either in the middle of the night, or when under pressure (which all sysadmins know is pretty much all the time).

Remember - Complexity is the Enemy of Reliability.   This is just another example of my favorite phrase at work.

And, if you want to have server monitoring software which tries to figure out whether a service is stopped and have it restart it, then it can also get confused by the fact that all these stupid servers and services are always moving around.  They just won't stay put!  Back in the olden days, you logged into a server and you edited the inittab, and you always knew what hardware it was running on and what server it was.  Now, with virtualization, and especially with virtualization management software, you never know what's where.

A Recipe for Chaos and Conflict

Your HA software and/or your virtualization management software can move things around on you.  Imagine that you have these four kinds of things in your data center:

  • High-Availability (HA/service-virtualization) management software

  • Virtualization management software

  • System management monitoring software

  • Human system administrators

This is a recipe for chaos, interspersed with the occasional career-limiting disaster. It's this kind of thing that leads system administrators to pull their hair out, and keep their resumes up to date.  None of these is bad by itself, in fact, each is a GoodThing(TM).  But they don't normally play well with each other. In typical myopic software design fashion, each of these layers is usually unaware of the other layers (except, of course for the last (human) layer - who has to make up for all the poor integration).

In addition, since the software layers typically aren't aware of all this wonderful virtualization going on, they can't really deal with the picture reliably.  They don't know what should be happening where, because it isn't fixed.  The various virtualization management packages keep changing things!

So, what's a body to do?  As far as I know, there are two basic options.

  1. Integrate the four layers of management with each other using things like CIM[4] and SNMP[5]

  2. Empower your HA software to also manage the server virtualization of your data center

Integration of Layers

Virtually every data center (sadly, pun intended) has a variety of server types and a variety of operating systems, and a variety of management software.  They mostly don't play well with each other.  Almost the only way to get them to play together - even if imperfectly - is to have them talk together using industry standard protocols.

Today, that means using SNMP or CIM.   Here is my personal view on the characteristics of these two protocols for your consideration.

  • SNMP - widely deployed - implemented in a truly compatible way, but far too weak for a job this hard.  SNMP is great for grabbing statistics, checking whether a server or router is up and what kind of load it is seeing in great detail.  Anything much beyond this, and the MIBs become 100% vendor-specific - meaning that cross-vendor integration breaks down - basically completely.  For HA clustering or virtualization management or worse yet the combination of the two - forget it.

  • CIM - widely deployed in expensive disk subsystems - but rarely deployed outside that.  It has newly developed models for virtualization and clustering, but like most standards they're mostly lowest-common-denominator standards, and unfortunately not widely deployed.  For example, Linux-HA[2] implements CIM, but unfortunately Linux-HA has tremendous power and capability which CIM can't begin to model.  So, this winds up being only possible to model using vendor-specific extensions - greatly weakening the possible integrations.

Now, I'm not saying that these two protocols are useless - far from it. Without open standards like CIM and SNMP, the prospect truly is hopeless.   But I am saying that integrating them in the typical-for-the-industry highly-heterogeneous data center is a challenge, and the more layers there are to integrate, the bigger the challenge.  Since standards necessarily trail industry practice, the more "bleeding edge" the topic (i.e., HA clustering or virtualization) and the more powerful the underlying tool (like Linux-HA), the greater the mismatch.

Here we have two bleeding edge topics and four layers.  Yikes!  Surely there must be some kind of alternative to this somewhat-unattractive mess.

Decrease The Layers and Let Them Manage Themselves

As I mentioned in my earlier virtualization posting, some HA packages (like Linux-HA) can also manage virtualization simultaneously.  So, one way of dealing with this is to let (or extend) your service virtualization product also manage your server virtualization.  One advantage of this approach is that service virtualization software (HA software) is comparatively mature technology, minimizing the risk.

Unfortunately, this doesn't yet go all the way in solving the problem either.  There are a few things that should change to make this really work well. These include

  • Support much larger HA clusters - hundreds to thousands of nodes.  In an ideal world, you'd really like fewer of these HA/virtualization clusters as you can get.  Today you'd typically have to have one of these clusters for every 8-32 physical servers - which makes an awfully lot of these things to manage in a data center containing hundreds or thousands of servers.

  • Integrate with many virtualization layers - Such a product would need to integrate with Xen, IBM System Z, IBM System P, Linux KVM, VMware, and future virtualization layers like the one promised by Microsoft.   This isn't rocket science, but by the time you're done, it will be some work.

  • Support monitoring and controlling services inside the virtual machine - Otherwise you haven't really integrated the two layers - and you wind up running some HA software inside some of the virtual machines.  Again, this isn't rocket science, but it will require some work[1] for each operating system you want to manage services for.

  • Integrate with provisioning systems - so that you can add and delete virtual machines and allocate disk to them and their applications with fewer possibilities for error, and more automation.

None of these items are technically difficult, and none of them are prohibitively expensive to implement.  Given that I'm the project leader for Linux-HA, and Linux-HA is one of the most capable HA products around, you might imagine that some of these thoughts are on my mind for our future  ;-).  Of course, that doesn't eliminate the necessity for integration with the remaining layers above, which is why Linux-HA implements both CIM and SNMP.  This allows the virtualization management infrastructure to actively and autonomically manage  servers and services, while letting it bubble up events (especially those it can't automatically recover from) to the management consoles and humans via protocols like SNMP and/or CIM.

Conclusions

Virtualization technologies add complexity to the data center along with the benefits they bring, and in the process may render the existing management facilities less than useful.  However, if HA and Virtualization management are performed by a single entity, and open standards like CIM and SNMP are used, systems can be active the problems can be minimized.

See Also

Preparing for Virtual Management http://www.itbusinessedge.com/blogs/dcc/?p=276

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/virtualization-.html
[2] http://linux-ha.org/
[3] http://linux-ha.org/SysAdmin
[4] http://www.dmtf.org/standards/cim/
[4] http://en.wikipedia.org/wiki/Common_Information_Model_%28computing%29
[5] http://en.wikipedia.org/wiki/Simple_Network_Management_Protocol

December 04, 2007

A brief overview of load balancing techniques

Something that people commonly do which involves a form of automation is load balancing.  Load balancing is the idea that incoming network requests are distributed across a set of servers which then each provide the same service.  If you spread the load across "n" servers, then in an ideal world what you get is "n" times the throughput.  And, since you have redundant servers, with the right kind of automation software, you can also get a degree of high-availability.   This is way cool!  This article will talk about load balancing as a general technique, and specifically about ways to do it on Linux using free or open source software.  In particular we'll talk about the Linux Virtual Server project[1], (LVS, ipvs) and the Cluster IP[2] as load balancing techniques.

Meanwhile back in the real world, we see some slight differences from this ideal view of things.  We see that load balancers often introduce single points of failure, and that that the load balancer or some kind of back end servers typically introduce scalability limitations.  To really understand these  problems, we need to look at specific load balancing techniques in a little more detail.  Please understand that I'm not an in-depth expert on any of these techniques, but I do have basic familiarity with the methods described here.

Linux Virtual Server
The first technique we'll cover is the Linux Virtual Server[1] (LVS) - which is implemented by the ipvs kernel module.  Much of what I have to say about LVS also applies to the most load balancers - hardware or software, since they typically work roughly the same way as LVS.

I usually describe  LVS clusters as being similar to a baseball[3] diamond - with the load balancer on third base, web (or other) "real servers" stretched from home plate to second base, and the back end database on first base.  In this image, requests flow from the left to right starting from the users in the dugout to the left of second base foul line,  and responses flow from right to left from the database or file server on first base back to the users in the dugout.  [This imagery works great when talking to Americans or Japanese on the phone, but often fails for people from other cultures].

The first thing to notice is that the only inherently scalable portion of this arrangement is the web servers in the middle.  The load balancer (on third base) and the database server (on first base) are each potentially performance bottlenecks and potentially single points of failure.

If you make each of them redundant to eliminate single points of failure, the picture looks something like this:

Diamondha640
There are a number of variations on this basic theme:

  • Failover vs load sharing load balancers

  • Different applications on the "real servers" instead of WAS / Web servers.

  • Different routing techniques for the load balancer

  • Different data sources instead of a DB2 database

In the end, however, they look a lot the same, and work very similarly.

In a NAT[5] arrangement, both incoming and outgoing packets flow through the LVS director.  In a direct routing arrangement, only incoming packets flow through the LVS director, and outgoing packets bypass the director, and go directly to the clients.

LVS monitoring
Although you could set this all up by hand and start all the services by hand, if anything failed, then you'd have to reconfigure things by hand.  Since the theme of this blog is automation, obviously, the right answer is to automate this setup and reconfiguration on failure.   A common way to do this is to use the Linux-HA software[6], which includes the LVS tool ldirectord[4]. Ldirectord will look at your real servers and see if they and the services they're running are operating correctly.  It will then take corrective action if it sees problems.  The Linux-HA software will watch the directors (sitting on third base), and fail things over and back if problems come up, to eliminate the single point of failure on third base.  As of now, the most common configurations of real servers have them be part of an LVS cluster, but not part of a Linux-HA cluster.  For historical reasons, the load balancers (directors) on third base are in one cluster, and the database server(s) on first base are commonly in separate clusters.  However, with release 2.x versions of Linux-HA it is perfectly sensible to include the both in the same cluster, perhaps in an n+1 sparing arrangement.  If you have fewer than 10-12 real servers, then it might also make sense to let Linux-HA manage those real servers as well.  The reason for the upper limit is to ensure that the total cluster isn't larger than the current Linux-HA limitations on cluster size (approximately 16 nodes).  Another  possible configuration is to use Linux-HA to monitor your real servers.  This would involve writing a clone resource agent for configuring LVS to point at the various real servers.  This might result in a more scalable monitoring arrangement than the current ldirectord monitoring arrangement, since the monitoring is done on each real server, and only errors are reported back to Linux-HA. 

This is a very brief overview of LVS, which perhaps we can expand on in a future posting.  For a thorough treatment of LVS,  I recommend The Linux Enterprise Cluster[7] by Karl Kopper.

Performance characteristics
Clearly every inbound packet has to go through the load balancer (director) - so it has to receive, look at, and forward each inbound packet.  It may also have to rewrite headers and recompute checksums on each packet.  If it configured with NAT, then it also has to read and rewrite all outbound packets as well. In addition, with ldirectord and similar software, the director also has the job of monitoring the all the real server processes on all the real servers.  Eventually, this node (or these nodes) will become a bottleneck.  When this happens depends on the nature of the workload, the complexity of monitoring, and the director configuration chosen.

Cluster IP
Although LVS doesn't require a master's degree to configure, some features of it do have a reasonably steep learning curve.  For a very easy-to-configure, albeit less scalable load distribution method on Linux, you might consider using ClusterIP addresses[2].

What is a Cluster IP?
The unique feature of a cluster IP is that it has no load balancer, hence no single point of failure.  Wow! That seems weird!  What does the picture look like?  If you move the users out of the dugout onto third base, you'll get the basic idea.  But that picture brings lots of questions to mind - like how do packets get routed?

The answer is simple - each machine in the cluster has the same IP address.  Say what?  The same IP address? Yes.  I mean the same IP address.   How can this work?  This sounds like it flies in the face of usual teaching about networking.  Which it does.

Enter the Multicast MAC address
The trick to making this work is to have each machine have an ARP table entry with the same MAC address in it - a multicast MAC address.  So when an ARP request is given, all nodes in the cluster respond, but they all give the same answer"I have IP address XXX with MAC address YYY".  So, in effect, there is no confusion - because it doesn't matter which ARP reply is listened to, they all say the same thing.  Therefore at the IP level everyone is happy.

So far, this is a reasonably satisfying answer, but not quite omplete.  What about addressing at the MAC level, and at the TCP or UDP level?

At the MAC level, multicast MAC addresses are recognized by switches, and is routed to all the switch ports, since everyone has presented that MAC address as "theirs". So, it copies all the packets to all the servers.

What happens at the TCP or UDP level?
This is where things get a little more interesting.  Now, it's more obvious how each machine gets the packets - because every machine gets them.  But, now what?  We clearly don't want every machine to respond to a given TCP packet. That would totally confuse everything, as would giving every packet to all the applications.  To solve this problem, Linux has added a hashing feature which allows the source address, source and destination port number to be used in a hashing function to allow it to decide which machine will respond to any given request.  So, if you have three hash buckets and three servers, the packet header information (source IP and port numbers) can be hashed into three buckets with one bucket assigned to each server.   If the packet hashes to the hash bucket assigned to this server, then it is kept, and passed along to the UDP or TCP layers.  If it doesn't hash to the bucket assigned to this server, then it's just dropped (ignored).

So, this hashing method determines which host serves the requests.  Although the ethernet driver in every machine sees each packet, each packet is only processed by one machine each.  Now you know how it works.

It also turns out to be very easy to configure using Linux-HA, as you can see on our ClusterIP web page[8].  In the process, Linux-HA also handles all the redundancy and failover of cluster IP buckets for you automatically.  Very cool indeed.

If you only configure one bucket per node, then when a node fails, all of its traffic has to get assigned to one machine.  If you start out with 3 nodes in your ClusterIP group, and one node dies, then that means that one node gets all the additional traffic - effectively doubling its workload.  So, a better idea for "n" nodes, to have n*(n-1) cluster IP buckets.  That way when any given machine fails, its workload is split evenly across the remaining nodes.  In Linux-HA terminology, the ClusterIP address is called a clone resource, and what you want is to configure clone_max to n*(n-1)and clone_node_max also to n*(n-1).  Although clone_node_max probably doesn't have to be this large, it would allow a single node to handle all the traffic, if a sufficient number of ClusterIP peers die.

Performance characteristics
Every node in the cluster will see all incoming IP packets.  As I understand it, many/all switches will also send every packet to every switch port in the subnet (or vlan).  This argues for a small subnet for this function.  But, the packets are discarded at a very early stage - minimizing the overhead on the host.  Outbound packets are not affected by this arrangement.  This kind of arrangement works well for these kinds of cases:

  • long processing time per packet (complex J2EE applications, for example)

  • small incoming packets with large outgoing packets

  • smaller number of processing nodes

It probably works less well with the opposite kinds of configurations:

  • high number of incoming packets with trivial processing per pacekt

  • large incoming packets (uploading DVD images, for example)

  • large number of processing nodes

Note that in this case, since there is no head-end processor like an LVS director that can be a single point of failure, so no special provisions are needed for high-availability when used with Linux-HA.  It is typically not as scalable as LVS load balancer, but it is trivial to set up and use.

[1] http://www.linuxvirtualserver.org/
[2] http://flaviostechnotalk.com/wordpress/index.php/2005/06/12/loadbalancer-less-clusters-on-linux/
[3] http://en.wikipedia.org/wiki/Baseball
[4] http://www.vergenet.net/linux/ldirectord/
[5] http://en.wikipedia.org/wiki/Network_address_translation
[6] http://linux-ha.org/
[7] http://www.nostarch.com/frameset.php?startat=cluster
[8] http://www.linux-ha.org/ClusterIP

November 27, 2007

Quorum Server Illustrated - updated

In two earlier posts [1] [2], I gave brief descriptions of the quorum server which seem to have left as much confusion as they provided clarity.  This post is only about the Linux-HA quorum server, and includes illustrations for clarity.

The Linux-HA Quorum API

In the Linux-HA quorum API, you can configure a number of quorum modules which are used as follows.  If a quorum module returns HAVEQUORUM, then the cluster has quorum.  If it returns NOQUORUM then the cluster does not have quorum.  If a quorum module returns QUORUMTIE, then the next quorum module in the list is consulted.  If the final module returns QUORUMTIE, then it is treated as a NOQUORUM event.

The quorum daemon is normally used in conjunction with the nomal arithmetic voting quorum module, so that it is only consulted when the number of nodes in the cluster is exactly half the number of configured modules in the system.  So, it is worth noting that the quorum server will never be consulted if a cluster has an odd number of nodes.

Quorum Server Scenarios

Below, I'll go through the basic quorum server cases so you can see how all this works in more detail - with pictures, even!

Normal Situation - Everything up
Quorum_server_normalsm_2

In the picture above, everything is normal.  The quorum server is up, and both sites are also up.  Because the cluster has all its nodes up, the quorum server is irrelevant.

Single Site Failure
Quorum_server_nj_failedsm_3

In the situation above, we show the "New Jersey" site as down.  In this case, the conventional voting quorum has a tie (1/2 - exactly half of the nodes).  In this case the quourm server is consulted.  Since only New York is talking to the quorum server, the quorum server grants quorum to the New York site.

Split Brain Avoided
Quorum_server_splitbrainsm_2

In the case above, the link between the sites has been lost, but both sites and the quorum server are all up.  In this case, both New York and New Jersey contact the quorum server because each sees 1/2 nodes as being up - resulting in a tie condition.

In this case, the quorum server will choose one of the two sites to provide quorum to, and I assume in this case that New York was chosen.  Because New Jersey  wasn't granted quorum, it will shut its resources down.

What happens when the quorum server goes down?
Quorum_server_failed_both_upsm

That is the situation shown above.  Because New York and New Jersey are both up, they have 2/2 votes and both provide service as they should.  This illustrates the point that the quorum server is not a single point of failure.

Multiple Failures -> Loss of Service

Multiple_failures_no_servicesm_3

In this final case, multiple failures have occurred - both New Jersey and the quorum server are down.  In this case, New York doesn't have quorum, so it shuts down services and none are provide by any node in the cluster.  Of course, this situation can be overridden in the cluster configuration by changing the quorum policy, but from an automated perspective, this is all that can be (should be) done.

Security Concerns

If you want to run your quorum server communications across networks which mig

November 12, 2007

Alan eats his own cl_respawn dog food. Yum!!

In this posting, I show how to use cl_respawn[1] to monitor my system logging and help keep it running, and along the way, I improved cl_respawn a little as well.  In addition, I explain why I couldn't just use the respawn directive in /etc/inittab[5] (and why you probably can't either).   I first talked about cl_respawn in one of my first blog posts[6].

The problem

When we run our automated CTS[2] tests for Linux-HA[3] we rely on the guaranteed log entry delivery provided by syslog-ng[4].  Basically, we redirect all our logs in a test cluster to a test overseer machine, and then CTS watches this consolidated log for errors and correct behavior.

This is a nice system and it works pretty well, but it relies on the reliability of syslog-ng.  For the most part, that's just fine.  But, sometimes syslog-ng just stops running.  Then the tests show that Heartbeat has failed, but it's really just syslog-ng that's crashed on me.  So, in the past I added some code to CTS to make it test the logging after every error, and then hit the machines over the head with a hammer and restart logging if logging wasn't working.

This was sort-of OK, because it meant subsequent tests would run fine, but the one test would show failed - even though it probably succeeded.  This would be fine, except that one of my machines (my oldest and slowest) had syslog-ng die on it a few times a day.  I don't know why, and as long as I can live with it for my testing, I don't much care.  I just want it to work.  (I know, it's a lousy attitude, but I have way more to do than I can possibly do).

The solution

Then it I had this revolutionary thought - I could use HA software to make my logging highly available!!

Hold the presses, folks, new headline reads
   "HA guru realizes he can use HA software just like he tells everyone else to do!"

To fix this problem all I had to do was change the init script for syslog to use our cool little cl_respawn tool to babysit the syslog-ng service.  Although I could have used Heartbeat to monitor this service, it seemed like overkill and would have conflicted with CTS.

So, I set out to use cl_respawn to restart syslog-ng quickly - minimizing but not eliminating the possibliity of losing important log messages.

When I looked at the init scripts (they're from SUSE Linux), they had these statements in them:

  • For starting
    startproc -p ${syslog_pid} ${BINDIR}/${syslog} $params
  • For stopping
    killproc -p ${syslog_pid} -TERM ${BINDIR}/${syslog} ; rc_status -v
  • For status
    checkproc -p ${syslog_pid}      ${BINDIR}/${syslog{; rc_status -v
  • checkproc -p ${syslog_pid}      /usr/bin/cl_respawn; rc_status -v

My first thought was ll I had to do was insert cl_respawn ahead of the ${BINDIR}/syslog and I'd be done.  Well.... not quite...

If I had done that, then the pid file for the service ${syslog_pid} would have pointed not to cl_respawn, but to syslog-ng.  So, when I tried to shut down syslog, cl_respawn would have just respawned it.  OOPS.  Not quite the right effect.

What was necessary was for the syslog pid file to contain the pid of cl_respawn, not the pid of syslog-ng.  One minor problem - the author of cl_respawn didn't deal with pidfiles.  To fix that, I added support for a -p option to tell it the name of the pid file to use.

Now I try it.  Uh-oh... It didn't work.  The logs are quickly filled with attempts to start  ${syslog} and having it fail continually with  socket in use.  What was all that about?

By default, syslog-ng forks itself into the background,  and its parent process exits.  That makes cl_respawn think it's died - so it restarts it - and it fails ad infinitum.  So, I read the man page for syslog-ng and discover the -F option to keep it from forking.  Without that, cl_respawn can't tell when it dies.

Along the way, I read the code, find a couple of other minor bugs and fix them.  I update my init script and now it looks like this:

  • For starting
    startproc -p ${syslog_pid} /usr/bin/cl_respawn -p ${syslog_pid} ${BINDIR}/${syslog} -F $params
  • For stopping
    killproc -p ${syslog_pid} -TERM /usr/bin/cl_respawn ; rc_status -v
  • For status
  • checkproc -p ${syslog_pid}      /usr/bin/cl_respawn; rc_status -v

Of course, if you don't run SUSE Linux, then your init scripts will look somewhat different, but I'm sure you'll figure it out.

Why not just use respawn in inittab?

Those of you who know UNIX administration to any degree realize that /etc/inittab[5] has a respawn directive you can give it.  Why wouldn't that do the trick?  The short answer is service dependencies.   The longer answer is below:

  • Logging depends on other /etc/init.d services, so you don't want it to start until after those other services (like the network) are started.  The LSB init script system supports these dependencies and starts things in the right order.
  • Other services depend on logging.  A number of other services can't start until after logging starts.  If you try and disable the /etc/init.d/syslog service on your machine so you can start it with respawn from /etc/inittab, havoc ensues - because these other services won't start until the /etc/init.d/syslog service is started.  If you disable it, they won't start.
  • What fun would that be?  I mean, if we wrote this cl_respawn tool, we probably ought to use it ;-).

What did I learn?

  • How to use cl_respawn in real life
  • Some missing requirements for cl_respawn
  • I was reminded of the advantages of using our own software
  • How handy simple little tools like cl_respawn can be

[1http://hg.linux-ha.org/dev/file/tip/tools/cl_respawn.c
[2http://linux-ha.org/CTS
[3http://linux-ha.org/
[4http://www.balabit.com/network-security/syslog-ng/opensource-logging-system/"
[5http://www.freebsd.org/cgi/man.cgi?query=inittab&manpath=Red+Hat+Linux%2Fi386+9&format=html
[6http://techthoughts.typepad.com/managing_computers/2007/09/tools-for-servi.html

October 31, 2007

More about quorum - updated

In a previous article[1], I talked about quorum, and alluded to some more details about quorum which I'll discuss here in a little more detail.  Let's examine a couple of common quorum tie-breaker methods, and see what's useful, and what's hype, and what's painful to use.

Can the standard voting quorum method fail?
By fail, I mean, can it grant quorum to two partitions simultaneously?  The answer, unfortunately, is yes, even though it seems like a mathematical impossibility.  This is because the world, unfortunately, is more complicated than simple mathematics, and quorum methods don't stand alone.  Quorum methods are tied to membership algorithms.  If a membership method fails, and for a period of time, a given node appears to be in the membership of two partitions, then while that's true, both partitions could legitimately think they have quorum.  Like many possible failures, this one is unlikely, but it is certainly possible.  Sigh...  Paranoia is so depressing.

SCSI Reserve Limitations

In the earlier article, I mentioned that SCSI reserve was often painful to use, without being explicit on why.  Let's explore that now.  SCSI reserve is an operation which happens at the physical disk volume level - that is, at the SCSI LUN  level.  With "dumb" disks this typically corresponds to an entire disk spindle - which nowadays is something like 180G-750G of disk data - which is clearly a significant waste of resources.

However, most people who use shared disks use them in a SAN are using it with a smart SAN disk controller which allows the creation of "logical" volumes which correspond to more much smaller sizes for a single SCSI LUN, using any RAID method you want.  For a disk used just as a quorum disk, you don't need to actually write anything on it, so you want it as small as you can make it.  But, most people probably make it RAID 1, which means that the minimum size is probably something like 2 gigabytes of SAN disk. If you have a large data center, it could easily have 50 clusters in it, and each one requires such a quorum device.  In this case, this makes for a lot of extra volumes, extra administration and possibilities for confusion and human error going on here.  In addition, smaller SAN disk units may have limitations on the total number of disk partitions they can manage.

So, perhaps you think - I'm smarter than that, I'll just use software volume managers to take care of this for me, and avoid all those extra logical disks in my SAN.  Unfortunately, that typically doesn't work.  This is because when you issue the SCSI reserve, it can't reserve a logical partition, only a physical partition, so many logical volume managers block reserve operations.  So, logical volume managers are not much help here.

To make it even more complicated, multi-pathing to disk devices often confuses (i.e., "breaks") disk reservation issues - particularly with SCSI II non-persistent reservations.

Of course, if you're replicating data instead of sharing it, disk reserve operations are of no help at all - since reserving a disk on one disk volume has no effect whatsoever on the other volume.

None of these considerations are changed by what kind of SCSI reserve you issue (persistent or the older non-persistent reserve).  However, there are even more problems that occur when you use the older non-persistent SCSI II reserve (the most commonly available kind), since it isn't persistent, and is broken by a bus reset or a device reset.  So, if you use SCSI II reserve, then you have to continually verify that you still have the reserve.

Count Key Data Disk Reserve

Mainframe disk subsystems support a non SCSI-based disk model, called the extended count key data (ECKD) disks.  These disks also support reserve methods similar to those provided by SCSI.

Quorum Daemon - Helpful or Hype?
Earlier, I said that the Linux-HA[2] quorum daemon[3] can help out here - not only for local shared disk situations, but for disaster-recovery type situations where you're replicating data between sites, and gave a hand-waving style argument on why that's so.  Let's see if we can go past the hand waving into more of the details so you can see what this is, how this works, and decide for yourself it would be helpful to you, or if this is just more hype written by someone who likes his project's work.

If you recall, I also described it as analogous to a software implementation of SCSI reserve.  But, in this case, there are no disks involved, so the hardware and SCSI protocol limitations mentioned above go away.  So, if the hardware has gone away, what kind of software has replaced it, and how does it work?

In the simplest view, the quorum daemon is simple - it takes TCP connections from clients, and when multiple clients both want to have quorum for the same cluster, it grants it to exactly one of them.  So, at this level of detail, it's quite simple, and the logic also straightforward.  But, there's a little more to it when you get into the details, so let's spend a few words describing the next level of detail, for those of you who are still skeptical.

TCP doesn't do a good job of telling you when your peer goes away, so both the client and the server processes send the other side heartbeats.  If the client stops hearing heartbeats, then he notifies his cluster that he no longer has quorum.  When the server stops hearing heartbeats, he takes quorum away from the client, and sends him noquorum messages.  Before switching quorum from one client to the other, the quorum daemon induces a configurable delay to allow the previous quorum owner time to notice that they've lost quorum and to shut down any resources it might be running.

What other interesting design features does it have?  Well, for one, all communication between client and server goes over SSL, with certificate authentication.  The server has a copy of the client's public certificate, and the clients have copies of the server's public certificate - so you don't need to get your certificates from a certificate authority - because we authenticate them by certificate, not through certificate authority (which could be a single point of failure).  Because of the way we use the SSL certificates, both authentication and authorization are bundled together.

Another nice design feature, is that a single quorum daemon doesn't have any fixed upper limit on the number of clusters it can support.  So, a single quorum daemon process should be able to support hundreds of clusters.  This makes it an advantage over just adding a third node to your cluster - because it can server hundreds of clusters, whereas typically a single computer can't join more than one cluster.  Indeed, since this doesn't run the cluster stack for any given cluster, it would be possible for the quorum daemon to work with multiple cluster implementations - as long as they put the hooks in to talk to it.

The combination of these two features allows you to put your quorum daemon on a completely different site, even at a colocation facility, and let your communication flow over the public Internet - securely.  This is really nice economical choice for your split-site DR-style clusters, for companies that don't have 3 or more major data centers.  Of course, if you have dozens of clusters that aren't split-site, you can make a single quorum daemon cluster with fencing to serve all the other clusters in your site.  All without bothering your storage group to create and manage all those little tiny quorum partitions for you.  In fact, without requiring shared storage at all.

Some kinds of software support resource-level quorum, that is, quorum on a resource-level rather than a whole cluster level.  Some of these are called resource-driven clusters.  The quorum daemon idea could also be used for those arrangements as well.

So, what's not to like about the quorum daemon?  Well, you do have to make a certificate for the server, and one for each client.  But, you don't have to pay for them, because they don't need to come from one of the well-known certificate producers.  It would be nice if the quorum daemon would sync its state to a slave copy that could be used in a cluster as a master/slave resource in case the machine running the quorum daemon failed.  You can still run it in a cluster with another machine to take over for it, but the takeover isn't as graceful as it would be if there were a slave copy ready to take over for the master complete with its quorum state.

I've recently learned that HP ServiceGuard[4] has a feature similar to this which they call the quorum server daemon (qs)[5].  The documentation indicates that HP's qs authenticates by IP address and does not use SSL for communication or authentication/authorization.  As a result, there are security concerns associated with it, and qs is probably unsuitable for split-site configurations.

And, even more recently, thanks to Nils Gorrol I've learned that Sun has a similar feature in their sqsd [6] since SunCluster version 3.2.  Looking at the documentation, it looks as though it has similar security issues that HP's qs has for split-site arrangements and potentially insecure networks.

Thanks to all those who keep me on the straight and narrow regarding the facts ;-)

Ping tiebreaker

Some HA systems provide  a ping tiebreaker.  To make this work, you pick a address outside the cluster to ping, and any partition that can ping that address has quorum.  The obvious advantage is that it's very simple to set up - doesn't require any additional servers or shared disk.  The disadvantage (and it's a big one) is that it's very possible for multiple partitions to think they have quorum.  In the case of split-site (disaster recovery) type clusters, it's going to happen fairly often.  If you can use this method for a single site in conjunction with fencing, then it will likely work out quite well.  It's a lot better than no tiebreaker, or one that always says "you have quorum".  Having said that, it's significantly inferior to any of the other methods.

If I've omitted things you want to know about, or this brings up ideas or questions in your mind, or you know of other implementations of the quorum daemon idea, by all means post a reply to this article.

References

[1] http://techthoughts.typepad.com/managing_computers/2007/10/split-brain-quo.html
[2] http://linux-ha.org/
[3] http://www.linux-ha.org/QuorumServerGuide
[4] http://h71036.www7.hp.com/enterprise/cache/6468-0-0-0-121.html
[5] http://www.docs.hp.com/en/B3936-90065/ch03s01.html
[6] http://docs.sun.com/app/docs/doc/819-5360/gbdud?a=view

October 10, 2007

Split-brain, Quorum, and Fencing - updated

In some ways, an HA system is pretty simple - it starts services, it stops them, and it sees if they and the computers that run them are still running.  But, there are a few bits of important "rocket science" hiding in there among all these apparently simple tasks.  Much of the rocket science that's there centers around trying to solve a single thorny problem - split brain.  The methods that are used to solve this problem are quorum and fencing.  Unfortunately, if you manage an HA system you need to understand these issues.  So this post will concentrate on these three topics: split-brain, quorum, and fencing.

If you have three computers and some way for them to communicate with each other, you can make a cluster out of them and,each can monitor the others to see if their peer has crashed.  Unfortunately, there's a problem here - you can't distinguish a crash of a peer from broken communications with the peer.  All you really know is that you can't hear anything from them.  You're really stuck in a Dunn's law[1] situation - where you really don't know very much, but desperately need to.  Maybe you don't feel too desperate yet.  Perhaps you think that you don't need to be able to distinguish these two cases.  The truth is that sometimes you don't need to, but much of the time you very much need to be able to tell the difference.  Let's see if I can make this clearer with an illustration.

Let's say you have three computers, paul, silas, and mark, and paul and silas can't hear anything from mark and vice versa.  Let's further suppose that mark had a filesystem /importantstuff from a SAN volume mounted on it when we lost contact with it. and that mark is alive but out of contact.  What happens if we just go ahead and mount /importantstuff up on paul? The short answer is that bad things will happen[2]. /importantstuff will be irreparably corrupted as two different computers update the disk independently.  The next question you'll ask yourself is "Where are those backup tapes?". That's the kind of question that's been known to be career-ending.

Split-Brain

This problem of a subset of computers in a cluster beginning to operate autonomously from each other is called Split Brain[3]. In our example above, the cluster has split into two subclusters: {paul, silas} and {mark}, and each subset is unaware of the others.  This is the perhaps most difficult problem to deal with in high-availability clustering.  Although this situation does not occur frequently in practice, it does occur more often than one would guess.  As a result, it's vital that a clustering system have a way to safely deal with this situation.

Earlier I mentioned that there was information you really want to know, but don't know.  Exactly what information did I mean?   What I wanted to know was "is it safe to mount up /importantstuff somewhere else?".  In turn, you could figure that out if you knew the answer to one of these two questions:  "Is mark really dead?" which is one way of figuring out "Is mark going to write on the volume any more?"  But, of course, since we can't communicate with mark, this is pretty hard to figure out.  So, cluster developers came out with a kind of clever way of ensuring that this question can be answered.  We call that answer fencing.

Fencing

Fencing is the idea of putting a fence around a subcluster so that it can't access cluster resources, like  /importantstuff.  If you put a fence between it and its resources, then suddenly you know the answer to the question "Is mark going to write on the volume any more?" - and the answer is no - because that's what the fence is designed to prevent.  So, instead of passively wondering what the answer to the safeness question is, fencing takes action to ensure the "right" answer to the question.

This sort of abstract idea of fencing is fine enough, but how is this fencing stuff actually done? There are basically two general techniques:  resource fencing [4] and node fencing.[5]. 

  • Resource fencing is the idea that if you know what resources a node might be using, then you can use some method of keeping it from accessing those resources. For example, if one has a disk which is accessed by a fiber channel switch, then one can talk to the fiber channel switch and tell it to deny the errant node access to the SAN.

  • Node fencing is the idea that one can keep a node from accessing all resources - without knowing what kind of resources it might be accessing, or how one might deny access to them.  A common way of doing this is to power off or reset the errant node.  This is a very effective if somewhat inelegant method of keeping it from accessing anything at all.  This technique is also called STONITH[6] - which is a  graphic and colorful acronym standing for Shoot The Other Node In The Head.

With fencing, we can easily keep errant nodes from accessing resources, and we can now keep the world safe for democracy - or at least keep our little corner of it safe for clustering.  An important aspect of good fencing techniques is that they're performed without the cooperation of the node being fenced off, and that they give positive confirmation that the fencing was done.  Since errant nodes are suspect, it's by far better to rely on positive confirmation from a correctly operating fencing component than to rely on errant cluster nodes you can't communicate with to police themselves.

Although fencing is sufficient to ensure safe resource access, it is not typically considered to be sufficient for happy cluster operation because without some other mechanism, there are some behaviors it can get into which can be significantly annoying (even if your data really is safe).  To discuss this, let's return our sample cluster.

Earlier we talked about how paul or silas could use fencing to keep the errant node mark from accessing /importantstuff.  But, what about mark?  If mark is still alive, then it is going to regard paul and silas as errant, not itself.  So, it would also proceed to fence paul and silas - and progress in the cluster would stop.  If it is using STONITH, then one could get into a sort of infinite reboot loop, with nodes declaring each other as errant and rebooting each other, coming back up and doing it all over again.  Although this is kind of humorous the first time you see this in a test environment - in production with important services, the humor of the situation probably wouldn't be your first thought.  To solve this problem, we introduce another new mechanism - quorum.

Quorum

One way to solve the mutual fencing dilemma described above is to somehow select only one of these two subclusters to carry on and fence the subclusters it can't communicate with.  Of course, you have to solve it without communicating with the other subclusters - since that's the problem - you can't communicate with them.  The idea of quorum represents the process of selecting a unique (or distinguished for the mathematically inclined) subcluster.

The most classic solution to selecting a single subcluster is a majority vote.  If you choose a subcluster with more than half of the members in it, then (barring bugs) you know there can't be any other subclusters like this one. So, this is looks like a simple and elegant solution to the problem. For many cases, that's true.  But, what if your cluster only has two nodes in it?  Now,  if you have a single node fail, then you can't do anything - no one has quorum.  If this is the case, then two machines have no advantage over a single machine - it's not much of an HA cluster.  Since 2-node HA clusters are by far the most common size of HA cluster, it's kind of an important case to handle well.  So, how are we going to get out of this problem?

Quorum Variants and Improvements

What you need in this case, is some kind of a 3rd party arbitrator to help select who can fence off the other nodes and allow you to bring up resources - safely.  To solve this problem there is a variety of other methods available to act as this arbitrator - either software or hardware. Although there are several methods available to use as arbitrator, we'll only talk about one each of hardware and software methods: SCSI reserve and Quorum Daemon.

  • SCSI reserve:  In hardware, we fall back on our friend SCSI reserve.  In this usage, both nodes try and reserve a disk partition available to both of them, and the SCSI reserve mechanism ensures that only one of the two of them can succeed.  Although I won't go into all the gory details here, SCSI reserve creates its own set of problems including it won't work reliably over geographic distances.  A disk which one uses in this way with SCSI reserve to determine quorum is sometimes called a quorum disk.  Some HA implementations (notably Microsoft's) require a quorum disk.

  • Quorum Daemon:  In Linux-HA[7], we have implemented a quorum daemon - whose sole purpose in life is to arbitrate quorum disputes between cluster members.  One could argue that for the purposes of quorum this is basically SCSI reserve implemented in software - and such an analogy is a reasonable one.  However, since it is designed for only this purpose, it has a number of significant advantages over SCSI reserve - one of which is that it can conveniently and reliably operate over geographic distances, making it ideal for disaster recovery (DR) type situations.  I'll cover the quorum daemon and why it's a good thing in more detail in a later posting.  Both HP and Sun have similar implementations, although I have security concerns about them, particularly over long distances.  Other than the security concerns (which might or might not concern you), both HP's and Sun's implementations are also good ideas.

Arguably the best way to use these alternative techniques is not directly as a quorum method, but rather as a way of breaking ties when the number of nodes in a subcluster is exactly half the number of nodes in the cluster.  Otherwise, these mechanisms can become single points of failure - that is, if they fail the cluster cannot recover.

Alternatives to Fencing

There are times when it is impossible to use normal 3rd-party fencing techniques.  For example, in a split-site configuration (a cluster which is split across geographically distributed sites), when inter-site communication fails, then attempts to fence will also fail.  In these cases, there are a few self-fencing alternatives which one can use when the more normal third-party fencing methods aren't available.  These include:

  • Node suicide.  If a node is running resources and it loses quorum, then it can power itself off or reboot itself (sort of a self-STONITH).  The remaining nodes wait "long enough" for the other node to notice and kill itself.  The problem is that a node which is sick might not succeed in self-suicide, or might not notice that it had a membership change, or had lost quorum.   It is equally bad if notification of these events is simply delayed "too long".  Since there is a belief that the node in question is, or at least might be, malfunctioning, this is not a trivial question.  In this case, use of hardware or software watchdog timers becomes critical.

  • Self-shutdown.  This self-fencing method is a variant on suicide, except that resources are stopped gracefully.  It has many of the same problems, except it is somewhat less reliable because the time to shut down resources can be quite long.  Like the case above, use of hardware or software watchdog timers becomes critical.

Note that without fencing, the membership and quorum algorithms are extremely critical.  You've basically lost a layer of protection, and you've switched from relying on a component which gives positive confirmation to relying on a probably faulty component to fence itself, and then hoping without confirmation that you've waited long enough before continuing.

Summary

Split-brain is the idea that a cluster can have communication failures, which can cause it to split into subclusters.  Fencing is the way of ensuring that one can safely proceed in these cases, and quorum is the idea of determining which subcluster can fence the others and proceed to recover the cluster services.

An Important Final Note

It is fencing which best guarantees the safety of your resources.  Nothing else works quite as well.  If you have fencing in your cluster software, and you have irreparable resources (i.e. that would be irreparably damaged in a split-brain situation), then you must configure fencing.  If your HA software doesn't support (3rd party) fencing, then I suggest that you consider getting a different HA package.

See Also

General cluster concepts[8]

References

[1]   http://linux-ha.org/DunnsLaw
[2]   http://linux-ha.org/BadThingsWillHappen
[3]   http://linux-ha.org/SplitBrain
[4]   http://linux-ha.org/ResourceFencing
[5]   http://linux-ha.org/NodeFencing
[6]   http://linux-ha.org/STONITH
[7]   http://linux-ha.org/
[8]   http://linux-ha.org/ClusterConcepts

October 05, 2007

How to use a watchdog timer

In an earlier posting[1], I mentioned that explaining how to optimally use a watchdog driver would be a good thing to talk about later. Now seems a good time to talk about that, giving a brief overview of some good techniques for getting the most out out of your watchdog timer.

As was mentioned previously, one can have a software watchdog timer like softdog[2], or a hardware timer, or a watchdog utility like apphbd[3] (application heartbeat daemon). Although each method has its advantages and disadvantages, the methods that an application can use to take best advantage of them are very similar.

The basic idea of using a watchdog is simple:  Periodically send a heartbeat to the watchdog timer.  If the application fails to heartbeat in the specified interval, or exits prematurely, then a recovery action is taken.  So, all your application has to do  is set a timer and tickle the watchdog timer when your timer goes off.  Sounds extremely simple - and for the most part it is.

How to get into trouble with watchdog timers

If your application does disk I/O or grows in size as it runs, or calls functions or systems calls that might block, then the timing of your application can change dramatically when the system is under heavy I/O load or memory pressure.  This can mean that your application is judged to be hung when it's not.  When this happens, the watchdog timer you're using will trigger a recovery action - maybe restarting your application, or rebooting the machine.  For this kind of a situation, this is probably not what you had in mind.  Of course, if you're in an HA environment like Linux-HA[4] where a machine reboot will cause a service failover, this may exactly what's needed to straighten out your problem.  As always, YMMV[5].

When not to use watchdog timers

Watchdog timers need reasonably reliable real-time performance, and an application which you can modify and which runs periodically (or which you're willing to make run periodically).  If you can't modify the application, or it is expected to have extremely erratic real-time performance, it doesn't run continuously, or runs in an environment which services your application erratically, then watchdog timers may not be for you.

Making your watchdog timer do more

Having your application not appear to be dead is sort-of-OK, but not exactly a deep metric for how well your application is behaving. Since this code will run inside your application, and you have to modify your application anyway, you have the opportunity to make this a form of white-box testing [6]. To do this, tie tickling your watchdog timer to good measures of your program's sanity. Below are a few examples of how you might go about doing this.  Not every example is appropriate for every type of application, so take this as food for thought.

  • Audit your data structures. If you have several data structures representing work to do, clients, outputs, etc., you can audit your data structures for mutual consistency, and only send out heartbeats when you don't find any errors.  For example, perhaps every piece of work to be done ought to belong to an active client. This is an interesting technique - because it can allow for transient "false positives" or errors that get corrected by the natural flow of the application.  If you audit your data structures once every 10 seconds, but set your heartbeat rate to 30 seconds, then you can have a transient error last up to two iterations without causing a restart.  If it persists beyond that, your watchdog timer will take action.

  • Check for work being processed.  If you have work in your input queues, but no work has been completed since the last heartbeat, then suppress sending out a heartbeat until some work actually gets processed.

  • Check for old work to be done.  If you have work queues, you can skip heartbeats whenever you have work in your queues which is "too old".  This may be a symptom that your application isn't processing its work, or that it has somehow lost track of this particular piece of work.

Of course, these are just a few simple ideas, but they may spark some better ideas for your particular application.  Anything you can examine periodically to see if your application seems to be doing whatever it is it's supposed to do is a potential candidate for using to control when and whether to tickle your watchdog timer.

Making your watchdog timer more reliable

Something you should really avoid is false positives on watchdog timeouts - especially when the consequence is to reboot the machine. Spurious reboots are typically frowned upon ;-).  This is all a fine idea, but how exactly do you go about doing that?  Here are a few tips to keep in mind.

  1. Tune your timeout interval. Some watchdog timers (like apphbd) have a warning level as well as a fatal timing level. Be sure and take advantage of the warning level to help you tune your application's heartbeat interval.  For example, if your application really ought to send out a heartbeat every second, you can set the warning time threshold to 1.5 seconds, and the fatal watchdog timer to a much larger value, say 5 or 10 seconds. That way, as you test, you can see how close you come to the 3 second time limit in your most extreme cases of load.

  2. Know your application's expected behavior.  If is single-threaded and does short-lived tasks throughout the day, except for once a day when it produces a summary report, then keep that extra work and the delay it can cause in mind when setting your timer value.

  3. Make your application a better real-time citizen. Rather than processing all its work in one huge uninterruptible chunk, make sure you process chunks whose size is bounded by some constant amount of time, and allow interruptions (for processing watchdog timer requests) between these chunks.

  4. Consider using the glib mainloop task dispatcher[7]. The mainloop construct is great for event driven programs which don't actually require threading. We use it extensively in Heartbeat [4], and it has worked out very well for us.

  5. Consider using threads. Many people swear by threads as the only way to write any reasonable program. Many people (including many of those same people) swear at them[8], When properly used, threads can be helpful in writing programs with better real-time behavior.

  6. Consider doing your disk I/O into a separate process (or thread). It's usually disk I/O or memory allocation (implying disk I/O) which is most likely to hang your program and give it unpredictable realtime behavior.

  7. Consider using asynchronous I/O. This is another technique for avoiding blocking by disk I/O. It's not terribly portable, and the API seems to me to be a little subject to change, but it's a really nice idea.

  8. Consider locking your program into memory. If your program needs about the same amount of real memory as it occupies virtual memory, then the system impact of locking your program into memory may not be high. This is not for everyone - because if everyone did this, then why bother implementing virtual memory?

  9. Consider setting your program with a soft-realtime (POSIX realtime) priority. Even more than the previous step, this step is not for everyone. If your program goes into an infinite loop, the entire system stops - end of story. Not good. But, if your program is critical, small, and well-behaved, this can be a reasonable thing to do.

  10. Consider running real time Linux [9]. This is an even more drastic step, but if your whole system needs to be real time, some great work has been done here which you might look into.

  11. If you're writing in Java, consider real-time Java from IBM[10]. A better recommendation might be to not use Java, but for some people Java is their religion, but if you or your management insists on Java, then this may be just the ticket for you.

In summary, here are general ideas to keep in mind:

  • Tie tickling the watchdog timer with the sanity of your application

  • Do what is reasonable to improve the predictability of the realtime behavior of your program

Hopefully the tips above will help you do this.

References

[1] http://techthoughts.typepad.com/managing_computers/2007/09/monitoring---a-.html
[2] http://linux-ha.org/softdog
[3] http://linux.die.net/man/8/apphbd
[4] http://linux-ha.org/
[5] http://en.wiktionary.org/wiki/your_mileage_may_vary
[6] http://en.wikipedia.org/wiki/White_box_testing
[7] http://library.gnome.org/devel/glib/unstable/glib-The-Main-Event-Loop.html
[8] http://sourcefrog.net/weblog/software/languages/java/java-threads.html
[9] http://www-03.ibm.com/press/us/en/pressrelease/21232.wss
[10] http://domino.research.ibm.com/comm/research_projects.nsf/pages/metronome.index.html